DocumentCode :
1939981
Title :
Using compression to identify acronyms in text
Author :
Yeates, Stuart ; Bainbridge, David ; Witten, Ian H.
Author_Institution :
Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
fYear :
2000
fDate :
2000
Firstpage :
582
Abstract :
Summary form only given. Finding acronyms and their definitions in free text is useful for many purposes. We have developed a new method that uses several PPM models to encode the acronym in terms of its definition. Four different attributes of each acronym are encoded using a PPMD order 5 model: (a) whether the acronym occurred before or after its definition (direction); (b) the distance between the acronym and the definition (first-word offset); (c) the pattern of words in the definition with letters in the acronym (subsequent-word offsets); and (d) the number of letters taken from each of those words. These models, taken together, give a complete encoding of the acronym in terms of its definition. The models were trained on 1080 acronyms extracted from 150 documents. A model of plain text was trained using 100 independent documents from the same collection
Keywords :
data compression; text analysis; PPM models; PPMD; acronym encoding; compression; definition; direction; first-word offset; plain text; subsequent-word offsets; Encoding;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2000. Proceedings. DCC 2000
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-7695-0592-9
Type :
conf
DOI :
10.1109/DCC.2000.838229
Filename :
838229
Link To Document :
بازگشت