DocumentCode :
3580549
Title :
Towards Reliable Clustering of English Text Documents Using Correlation Coefficient
Author :
Bhaumik, Hrishikesh ; Chakraborty, Biswanath ; Mukherjee, Anirban ; Bhattacharyya, Siddhartha ; Chattopadhyay, Manojit
Author_Institution :
Dept. of Inf. Technol., RCC Inst. of Inf. Technol., Kolkata, India
fYear :
2014
Firstpage :
530
Lastpage :
535
Abstract :
This paper proposes a new approach for clustering English text documents, based on finding the pair wise correlation of documents in a given set of text documents. The correlation coefficient for each pair of documents is calculated on the basis of ranks given to the words in the documents. The ranking of the words occurring in a document is computed on the basis of weights of the words calculated according to the conventional TF-IDF factor. The proposed method is found to be able to cluster a given set of text documents into a number of classes depending on their contents where the number of classes is not known a priori. It is revealed from experimental results that the proposed method of text categorization using correlation coefficient performs better than some of the other text categorization methods, including methods that use artificial neural network.
Keywords :
natural language processing; pattern clustering; statistical analysis; text analysis; English text document; TF-IDF factor; correlation coefficient; pairwise correlation; reliable clustering; text categorization; Classification algorithms; Clustering algorithms; Correlation; Correlation coefficient; Equations; Text categorization; Vectors; clustering; correlation coefficient; text classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence and Communication Networks (CICN), 2014 International Conference on
Print_ISBN :
978-1-4799-6928-9
Type :
conf
DOI :
10.1109/CICN.2014.121
Filename :
7065541
Link To Document :
بازگشت