• DocumentCode
    583030
  • Title

    An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags

  • Author

    Tang, Xuning ; Dang, Jiangbo

  • Author_Institution
    Coll. of Inf. Sci. & Technol., Drexel Univ., Philadelphia, PA, USA
  • fYear
    2012
  • fDate
    22-24 Oct. 2012
  • Firstpage
    104
  • Lastpage
    111
  • Abstract
    With the exponentially growing volume of digital documents and internet content, it becomes very challenging to locate right information when desired. We heavily rely on search engines but most existing search tools are key-word based and they often return search results with low precision and recall. The emerging semantic tagging technology provides an automatic way to generate semantic tags from text. It has drawn more and more interest from text mining research communities. It is critical to study how to utilize semantic tags to improve text mining including clustering, which helps users to enhance their experience of searching and browsing documents. Unfortunately, most previous works on text clustering merely based on content information. A few recent researches take user-generated tags into account, however user generated tags are often noisy, inconsistent, redundant and lack of semantic information and hierarchical structure. In this work, we propose a Semantic Text Mining (STeM) framework to generate semantic tags for given documents and then utilize the semantic tags to improve text clustering. Different from the previous works, we represent a document by a combination of domains and high quality noun phrases. We investigate the performance of our methods in two different datasets and the results are evaluated by normalized mutual information. Experiment results demonstrated that our proposed method substantially outperformed the traditional Term Frequency-Inverse Document Frequency (TF-IDF) term vector based clustering. We find that incorporating semantic information into document representation is critical to improve the performance of text clustering.
  • Keywords
    data mining; information retrieval; pattern clustering; semantic Web; text analysis; Internet content; STeM; TF-IDF term vector based clustering; autogenerated semantic tag; content information; digital documents; document browsing; document representation; document searching; hierarchical structure; high quality noun phrase; information location; keyword based search tool; normalized mutual information; search engine; semantic information; semantic tagging technology; semantic text mining; term frequency-inverse document frequency term vector; text clustering enhancement; user-generated tags; Clustering algorithms; Frequency domain analysis; Knowledge based systems; Ontologies; Semantics; Tagging; Vectors; SteM; clustering; document; semantic tags;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Semantics, Knowledge and Grids (SKG), 2012 Eighth International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4673-2561-5
  • Type

    conf

  • DOI
    10.1109/SKG.2012.17
  • Filename
    6391817