مرکز منطقه ای اطلاع رساني علوم و فناوري - Using compression to identify acronyms in text

DocumentCode :

1939981

Title :

Using compression to identify acronyms in text

Author :

Yeates, Stuart ; Bainbridge, David ; Witten, Ian H.

Author_Institution :

Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand

fYear :

2000

fDate :

2000

Firstpage :

582

Abstract :

Summary form only given. Finding acronyms and their definitions in free text is useful for many purposes. We have developed a new method that uses several PPM models to encode the acronym in terms of its definition. Four different attributes of each acronym are encoded using a PPMD order 5 model: (a) whether the acronym occurred before or after its definition (direction); (b) the distance between the acronym and the definition (first-word offset); (c) the pattern of words in the definition with letters in the acronym (subsequent-word offsets); and (d) the number of letters taken from each of those words. These models, taken together, give a complete encoding of the acronym in terms of its definition. The models were trained on 1080 acronyms extracted from 150 documents. A model of plain text was trained using 100 independent documents from the same collection

Keywords :

data compression; text analysis; PPM models; PPMD; acronym encoding; compression; definition; direction; first-word offset; plain text; subsequent-word offsets; Encoding;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Compression Conference, 2000. Proceedings. DCC 2000

Conference_Location :

Snowbird, UT

ISSN :

1068-0314

Print_ISBN :

0-7695-0592-9

Type :

conf

DOI :

10.1109/DCC.2000.838229

Filename :

838229

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1939981