DocumentCode :
1939335
Title :
Text categorization using compression models
Author :
Frank, Eibe ; Chui, Chang ; Witten, Ian H.
Author_Institution :
Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand
fYear :
2000
fDate :
2000
Firstpage :
555
Abstract :
Summary form only given. Test categorization is the assignment of natural language texts to predefined categories based on their concept. The use of predefined categories implies a “supervised learning” approach to categorization, where already-classified articles which effectively define the categories are used as “training data” to build a model that can be used for classifying new articles that comprise “the data”. Typical approaches extract features from articles and use the feature vectors as input to a machine learning scheme that learns how to classify articles. The features are generally words. It has often been observed that compression seems to provide a very promising alternative approach to categorization. The overall compression of an article with respect to different models can be compared to see which one it fits most closely. Such a scheme has several potential advantages: it yields an overall judgement on the document as a whole, rather than discarding information by pre-selecting features it avoids the messy and rather artificial problem of defining word boundaries; it deals uniformly with morphological variants of words; depending on the model (and its order), it can take account of phrasal effects that span word boundaries; it offers a uniform way of dealing with different types of documents for example, arbitrary files in a computer system; it generally minimizes arbitrary decisions that inevitably need to be taken to render any learning scheme practical
Keywords :
data compression; feature extraction; learning (artificial intelligence); natural languages; pattern classification; text analysis; article classification; article compression; compression models; feature extraction; feature vectors; machine learning; morphological variants; natural language texts; predefined categories; supervised learning; text categorization; training data; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2000. Proceedings. DCC 2000
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-7695-0592-9
Type :
conf
DOI :
10.1109/DCC.2000.838202
Filename :
838202
Link To Document :
بازگشت