DocumentCode :
1900202
Title :
Improved Document Clustering using k-means algorithm
Author :
Bide, Pramod ; Shedge, Rajashree
Author_Institution :
Dept. Comput. Eng., Ramrao Adik Inst. of Technol., Navi Mumbai, India
fYear :
2015
fDate :
5-7 March 2015
Firstpage :
1
Lastpage :
5
Abstract :
Searching for similar documents has a crucial role in document management. Because of tremendous increase in documents day by day, it is very essential to segregate these documents in proper clusters. Faster categorization of documents is required in forensic investigation but analysis of these documents is very difficult. So, there is a need to separate multiple collections of documents into similar ones through clustering. Specifying number of clusters is mandatory in existing partitioning algorithms and the output is totally dependent on given input. Over clustering is the major problem in document clustering. The proposed algorithm takes input as Keywords found after extraction and solves the problem of over clustering by dividing the documents into small groups using Divide and Conquer Strategy. In this paper, an Improved Document Clustering algorithm is given which generates number of clusters for any text documents and uses cosine similarity measures to place similar documents in proper clusters. Experimental results showed that accuracy of proposed algorithm is high compare to existing algorithm in terms of F-Measure and time complexity.
Keywords :
digital forensics; divide and conquer methods; pattern clustering; text analysis; cosine similarity measures; divide and conquer strategy; document categorization; document clustering algorithm; document management; forensic investigation; k-means algorithm; partitioning algorithms; similar document searching; text documents; Clustering algorithms; Cosine Similarity; Divide and Conquer; Document Clustering; Tf-Idf; Threshold;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Electrical, Computer and Communication Technologies (ICECCT), 2015 IEEE International Conference on
Conference_Location :
Coimbatore
Print_ISBN :
978-1-4799-6084-2
Type :
conf
DOI :
10.1109/ICECCT.2015.7226065
Filename :
7226065
Link To Document :
بازگشت