Text categorization using compression models

Author

Frank, Eibe ; Chui, Chang ; Witten, Ian H.

Author_Institution

Dept. of Comput. Sci., Waikato Univ., Hamilton, New Zealand

fYear

2000

fDate

2000

Firstpage

555

Abstract

Summary form only given. Test categorization is the assignment of natural language texts to predefined categories based on their concept. The use of predefined categories implies a “supervised learning” approach to categorization, where already-classified articles which effectively define the categories are used as “training data” to build a model that can be used for classifying new articles that comprise “the data”. Typical approaches extract features from articles and use the feature vectors as input to a machine learning scheme that learns how to classify articles. The features are generally words. It has often been observed that compression seems to provide a very promising alternative approach to categorization. The overall compression of an article with respect to different models can be compared to see which one it fits most closely. Such a scheme has several potential advantages: it yields an overall judgement on the document as a whole, rather than discarding information by pre-selecting features it avoids the messy and rather artificial problem of defining word boundaries; it deals uniformly with morphological variants of words; depending on the model (and its order), it can take account of phrasal effects that span word boundaries; it offers a uniform way of dealing with different types of documents for example, arbitrary files in a computer system; it generally minimizes arbitrary decisions that inevitably need to be taken to render any learning scheme practical

Keywords

data compression; feature extraction; learning (artificial intelligence); natural languages; pattern classification; text analysis; article classification; article compression; compression models; feature extraction; feature vectors; machine learning; morphological variants; natural language texts; predefined categories; supervised learning; text categorization; training data; Text categorization;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Compression Conference, 2000. Proceedings. DCC 2000

Conference_Location

Snowbird, UT

ISSN

1068-0314

Print_ISBN

0-7695-0592-9

Type

conf

DOI

10.1109/DCC.2000.838202

Filename

838202