An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags

Author

Tang, Xuning ; Dang, Jiangbo

Author_Institution

Coll. of Inf. Sci. & Technol., Drexel Univ., Philadelphia, PA, USA

fYear

2012

fDate

22-24 Oct. 2012

Firstpage

104

Lastpage

111

Abstract

With the exponentially growing volume of digital documents and internet content, it becomes very challenging to locate right information when desired. We heavily rely on search engines but most existing search tools are key-word based and they often return search results with low precision and recall. The emerging semantic tagging technology provides an automatic way to generate semantic tags from text. It has drawn more and more interest from text mining research communities. It is critical to study how to utilize semantic tags to improve text mining including clustering, which helps users to enhance their experience of searching and browsing documents. Unfortunately, most previous works on text clustering merely based on content information. A few recent researches take user-generated tags into account, however user generated tags are often noisy, inconsistent, redundant and lack of semantic information and hierarchical structure. In this work, we propose a Semantic Text Mining (STeM) framework to generate semantic tags for given documents and then utilize the semantic tags to improve text clustering. Different from the previous works, we represent a document by a combination of domains and high quality noun phrases. We investigate the performance of our methods in two different datasets and the results are evaluated by normalized mutual information. Experiment results demonstrated that our proposed method substantially outperformed the traditional Term Frequency-Inverse Document Frequency (TF-IDF) term vector based clustering. We find that incorporating semantic information into document representation is critical to improve the performance of text clustering.

Keywords

data mining; information retrieval; pattern clustering; semantic Web; text analysis; Internet content; STeM; TF-IDF term vector based clustering; autogenerated semantic tag; content information; digital documents; document browsing; document representation; document searching; hierarchical structure; high quality noun phrase; information location; keyword based search tool; normalized mutual information; search engine; semantic information; semantic tagging technology; semantic text mining; term frequency-inverse document frequency term vector; text clustering enhancement; user-generated tags; Clustering algorithms; Frequency domain analysis; Knowledge based systems; Ontologies; Semantics; Tagging; Vectors; SteM; clustering; document; semantic tags;

fLanguage

English

Publisher

ieee

Conference_Titel

Semantics, Knowledge and Grids (SKG), 2012 Eighth International Conference on

Conference_Location

Beijing

Print_ISBN

978-1-4673-2561-5

Type

conf

DOI

10.1109/SKG.2012.17

Filename

6391817