Title :
Using compression to identify acronyms in text
Author :
Yeates, Stuart ; Bainbridge, David ; Witten, Ian H.
Author_Institution :
Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
Abstract :
Summary form only given. Finding acronyms and their definitions in free text is useful for many purposes. We have developed a new method that uses several PPM models to encode the acronym in terms of its definition. Four different attributes of each acronym are encoded using a PPMD order 5 model: (a) whether the acronym occurred before or after its definition (direction); (b) the distance between the acronym and the definition (first-word offset); (c) the pattern of words in the definition with letters in the acronym (subsequent-word offsets); and (d) the number of letters taken from each of those words. These models, taken together, give a complete encoding of the acronym in terms of its definition. The models were trained on 1080 acronyms extracted from 150 documents. A model of plain text was trained using 100 independent documents from the same collection
Keywords :
data compression; text analysis; PPM models; PPMD; acronym encoding; compression; definition; direction; first-word offset; plain text; subsequent-word offsets; Encoding;
Conference_Titel :
Data Compression Conference, 2000. Proceedings. DCC 2000
Conference_Location :
Snowbird, UT
Print_ISBN :
0-7695-0592-9
DOI :
10.1109/DCC.2000.838229