• DocumentCode
    3084413
  • Title

    Domain concept handling in automated text categorization

  • Author

    Liu, Ying ; Loh, Han Tong

  • Author_Institution
    Ind. & Syst. Eng., Hong Kong Polytech. Univ., Hong Kong, China
  • fYear
    2010
  • fDate
    15-17 June 2010
  • Firstpage
    1543
  • Lastpage
    1549
  • Abstract
    Single term based document representations, e.g. bag-of-words, have been widely accepted in the machine learning, information retrieval and text mining community. One notable limitation of such methods is that they do not consider the rich information resident in the semantic relations among terms. This paper reports our approach of concepts handling in document representation and its effect on the performance of text categorization. We introduce a Frequent word Sequence algorithm that generates concept-centered phrases to render domain knowledge concepts. Our experimental study based on a domain centered corpus shows that a consistent performance improvement can be achieved when concept-centered phrases are included in addition to the single term based features in document representations. We also observed that a universally suitable support threshold does not exist and the removal of concept irrelevant sequences can possibly further improve the performance at a lower support level.
  • Keywords
    data mining; information retrieval; learning (artificial intelligence); text analysis; automated text categorization; bag-of-words; document representations; domain concept handling; domain knowledge; information retrieval; machine learning; text mining; Data mining; Document handling; Humans; Information retrieval; Machine learning; Mining industry; Support vector machine classification; Support vector machines; Text categorization; Text mining; Domain Concept Representation; Information Management; Text Categorization; Text Mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Industrial Electronics and Applications (ICIEA), 2010 the 5th IEEE Conference on
  • Conference_Location
    Taichung
  • Print_ISBN
    978-1-4244-5045-9
  • Electronic_ISBN
    978-1-4244-5046-6
  • Type

    conf

  • DOI
    10.1109/ICIEA.2010.5514692
  • Filename
    5514692