DocumentCode
1939335
Title
Text categorization using compression models
Author
Frank, Eibe ; Chui, Chang ; Witten, Ian H.
Author_Institution
Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
fYear
2000
fDate
2000
Firstpage
555
Abstract
Summary form only given. Test categorization is the assignment of natural language texts to predefined categories based on their concept. The use of predefined categories implies a “supervised learning” approach to categorization, where already-classified articles which effectively define the categories are used as “training data” to build a model that can be used for classifying new articles that comprise “the data”. Typical approaches extract features from articles and use the feature vectors as input to a machine learning scheme that learns how to classify articles. The features are generally words. It has often been observed that compression seems to provide a very promising alternative approach to categorization. The overall compression of an article with respect to different models can be compared to see which one it fits most closely. Such a scheme has several potential advantages: it yields an overall judgement on the document as a whole, rather than discarding information by pre-selecting features it avoids the messy and rather artificial problem of defining word boundaries; it deals uniformly with morphological variants of words; depending on the model (and its order), it can take account of phrasal effects that span word boundaries; it offers a uniform way of dealing with different types of documents for example, arbitrary files in a computer system; it generally minimizes arbitrary decisions that inevitably need to be taken to render any learning scheme practical
Keywords
data compression; feature extraction; learning (artificial intelligence); natural languages; pattern classification; text analysis; article classification; article compression; compression models; feature extraction; feature vectors; machine learning; morphological variants; natural language texts; predefined categories; supervised learning; text categorization; training data; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Compression Conference, 2000. Proceedings. DCC 2000
Conference_Location
Snowbird, UT
ISSN
1068-0314
Print_ISBN
0-7695-0592-9
Type
conf
DOI
10.1109/DCC.2000.838202
Filename
838202
Link To Document