Title :
Tag insertion complexity
Author :
Yeates, Stuart ; Witten, Ian H. ; Bainbridge, David
Author_Institution :
Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
Abstract :
This paper is about inferring markup information, a generalization of part-of-speech tagging. We use compression models based on a marked-up training corpus and apply them to fresh, unmarked, text. In effect, this technique builds filters that extract information from text in a way that is generalized because it depends on training text rather than preprogrammed heuristics. As illustrated, we use SGML tags to represent the extracted information. However, we work in a more controlled textual environment: we use bibliographic text rather than plain English and mark up entities such as author, date, and titles rather than syntactic parts of speech. Such entities are generically called “metadata”-data about data-and form an important component of the information present in a bibliography. The aim of our work is to automatically enhance bibliographies with metadata tags, based on a training corpus of annotated bibliography entries
Keywords :
data compression; meta data; page description languages; search problems; SGML tags; Viterbi search; bibliographic text; bibliography; compression models; extracted information; filters; marked-up training corpus; markup information; metadata; part-of-speech tagging; tag insertion complexity; training text; Bibliographies; Computer science; Data mining; Dictionaries; Information filtering; Information filters; SGML; Tagging; Technical Activities Guide -TAG; Testing;
Conference_Titel :
Data Compression Conference, 2001. Proceedings. DCC 2001.
Conference_Location :
Snowbird, UT
Print_ISBN :
0-7695-1031-0
DOI :
10.1109/DCC.2001.917155