• DocumentCode
    2370189
  • Title

    Mining relevant text from unlabelled documents

  • Author

    Barbará, Daniel ; Domeniconi, Carlotta ; Kang, Ning

  • Author_Institution
    Inf. & Software Eng. Dept., George Mason Univ., Fairfax, VA, USA
  • fYear
    2003
  • fDate
    19-22 Nov. 2003
  • Firstpage
    489
  • Lastpage
    492
  • Abstract
    Automatic classification of documents is an important area of research with many applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, obtaining this sample may not be an easy (or even possible) task. We focus on the classification of unlabelled documents into two classes: relevant and irrelevant, given a topic of interest. By dividing the set of documents into buckets (for instance, answers returned by different search engines), and using association rule mining to find common sets of words among the buckets, we can efficiently obtain a sample of documents that has a large percentage of relevant ones. This sample can be used to train models to classify the entire set of documents. We prove, via experimentation, that our method is capable of filtering relevant documents even in adverse conditions where the percentage of irrelevant documents in the buckets is relatively high.
  • Keywords
    classification; data mining; document handling; information filters; information retrieval; sampling methods; search engines; association rule mining; class labels; document filtering; document searching; forensics; relevant text; search engines; unlabelled document classification; Application software; Association rules; Content based retrieval; Data mining; Forensics; Image retrieval; Information retrieval; Search engines; Software engineering; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
  • Print_ISBN
    0-7695-1978-4
  • Type

    conf

  • DOI
    10.1109/ICDM.2003.1250959
  • Filename
    1250959