• DocumentCode
    3237959
  • Title

    Document clustering and topic discovery based on semantic similarity in scientific literature

  • Author

    Jayabharathy, J. ; Kanmani, S. ; Parveen, A. Ayeshaa

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Pondicherry Eng. Coll., Pondicherry, India
  • fYear
    2011
  • fDate
    27-29 May 2011
  • Firstpage
    425
  • Lastpage
    429
  • Abstract
    Unlabeled document collections are becoming increasingly common and mining such databases becomes a major challenge. It is a major issue to retrieve relevant documents from the larger document collection. By clustering the text documents, the documents sharing similar topics are grouped together. Incorporating semantic features will improve the accuracy of document clustering methods. In order to determine at a sight whether the content of a cluster are of user interest or not, topic discovery methods are required to tag each clusters identifying distinct and representative topic of each cluster. Most of the existing topic discovery methods often assign labels to clusters based on the terms that the clustered documents contain. In this paper a modified semantic-based model is proposed where related terms are extracted as concepts for concept-based document clustering by bisecting k-means algorithm and topic detection method for discovering meaningful labels for the document clusters based on semantic similarity by Testor theory. The proposed method is compared to the Topic Detection by Clustering Keywords method using F-measure and purity as evaluation metrics. Experimental results prove that the proposed semantic-based model outperforms the existing work.
  • Keywords
    data mining; information retrieval; pattern clustering; text analysis; Testor theory; concept-based document clustering; database mining; distinct topic identification; k-means algorithm; representative topic identification; scientific literature; semantic similarity; text document clustering; topic detection method; topic discovery method; unlabeled document collection; Data mining; Electronic publishing; Information retrieval; Information services; Internet; Concept; Document clustering; Semantic similarity; Testor theory; Topic discovery;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communication Software and Networks (ICCSN), 2011 IEEE 3rd International Conference on
  • Conference_Location
    Xi´an
  • Print_ISBN
    978-1-61284-485-5
  • Type

    conf

  • DOI
    10.1109/ICCSN.2011.6014600
  • Filename
    6014600