Clustering sentence level-text using fuzzy hierarchical algorithm

Author

Priya, G. Krishna ; Anupriya, G.

Author_Institution

Dr. Mahalingam Coll. of Eng. & Technol., Pollachi, India

fYear

2013

fDate

23-24 Aug. 2013

Firstpage

1

Lastpage

8

Abstract

Clustering is a popular technique for unsupervised text analysis, often used to explore the content of large amounts of sentences. It is performed based on the similarity of sentences. Sentences may contain interrelated concepts and implementing flat clustering algorithms allows one sentence to be present only in one cluster. Also, sentences are semantically related to each other and so word co-occurrence is not a valid measure for sentence level flat clustering. So, WordNet based semantic similarity measure along with fuzzy sentence clustering algorithm is proposed. The existing system uses the Fuzzy C-Means algorithm where the cluster size should be specified as an input. Due to the rigorous convergence criteria, the time complexity is much larger. Most of the NLP documents are hierarchical in nature and so fuzzy hierarchical sentence clustering algorithm is used here. Labeling is performed for each cluster depending on the hierarchy formed and instead of considering all the terms in a sentence, only the verbs and nouns are considered for the similarity computation. Agglomerative clustering based on the verbs and divisive clustering based on nouns is proposed. This methodology is validated through various performance measures like Purity, Entropy and Time. Upon comparing the results for various datasets, it was observed that the overall improvement in purity is 36.6% and entropy is 31%. The time complexity of the hierarchical algorithm is very much less than the EM algorithm. Thus better quality clusters are formed in comparatively less time by using the Fuzzy Hierarchical Sentence Clustering Algorithm.

Keywords

computational complexity; fuzzy set theory; natural language processing; pattern clustering; text analysis; NLP documents; WordNet based semantic similarity measure; fuzzy c-means algorithm; fuzzy hierarchical sentence clustering algorithm; natural language processing; sentence level-text clustering; time complexity; unsupervised text analysis; Algorithm design and analysis; Clustering algorithms; Convergence; Natural languages; Semantics; Speech; Time complexity; Agglomerative and Divisive Clustering; Fuzzy C-Means(FCM) Clustering; Natural Language Processing(NLP); WordNet Similarity;

fLanguage

English

Publisher

ieee

Conference_Titel

Human Computer Interactions (ICHCI), 2013 International Conference on

Conference_Location

Chennai

Type

conf

DOI

10.1109/ICHCI-IEEE.2013.6887778

Filename

6887778