• DocumentCode
    907078
  • Title

    On using partial supervision for text categorization

  • Author

    Aggarwal, Charu C. ; Gates, Stephen C. ; Yu, Philip S.

  • Author_Institution
    IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
  • Volume
    16
  • Issue
    2
  • fYear
    2004
  • Firstpage
    245
  • Lastpage
    255
  • Abstract
    We discuss the merits of building text categorization systems by using supervised clustering techniques. Traditional approaches for document classification on a predefined set of classes are often unable to provide sufficient accuracy because of the difficulty of fitting a manually categorized collection of documents in a given classification model. This is especially the case for heterogeneous collections of Web documents which have varying styles, vocabulary, and authorship. Hence, we investigate the use of clustering in order to create the set of categories and its use for classification of documents. Completely unsupervised clustering has the disadvantage that it has difficulty in isolating sufficiently fine-grained classes of documents relating to a coherent subject matter. We use the information from a preexisting taxonomy in order to supervise the creation of a set of related clusters, though with some freedom in defining and creating the classes. We show that the advantage of using partially supervised clustering is that it is possible to have some control over the range of subjects that one would like the categorization system to address, but with a precise mathematical definition of how each category is defined. An extremely effective way then to categorize documents is to use this a priori knowledge of the definition of each category. We also discuss a new technique to help the classifier distinguish better among closely related clusters.
  • Keywords
    pattern classification; pattern clustering; pattern matching; text analysis; Web documents; document classification; partial supervision; supervised clustering techniques; text categorization; Automatic testing; Content based retrieval; Control systems; Filtering; Helium; Performance evaluation; Taxonomy; Text categorization; Vocabulary; Web sites;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2004.1269601
  • Filename
    1269601