DocumentCode :
2160249
Title :
A novel approach for feature selection method TF-IDF in document clustering
Author :
Patil, L.H. ; Atique, Mohammad
Author_Institution :
Dept. of Comput. Sci. & Eng., Sant Gadge Baba Amravati Univ., Amravati, India
fYear :
2013
fDate :
22-23 Feb. 2013
Firstpage :
858
Lastpage :
862
Abstract :
Now a day, the text document is spontaneously increasing over the internet, e-mail and web pages and they are stored in the electronic database format. To arrange and browse the document it becomes difficult. To overcome such problem the document preprocessing, term selection, attribute reduction and maintaining the relationship between the important terms using background knowledge, WordNet, becomes an important parameters in data mining. In these paper the different stages are formed, firstly the document preprocessing is done by removing stop words, stemming is performed using porter stemmer algorithm, word net thesaurus is applied for maintaining relationship between the important terms, global unique words, and frequent word sets get generated, Secondly, data matrix is formed, and thirdly terms are extracted from the documents by using term selection approaches tf-idf, tf-df, and tf2 based on their minimum threshold value. Further each and every document terms gets preprocessed, where the frequency of each term within the document is counted for representation. The purpose of this approach is to reduce the attributes and find the effective term selection method using WordNet for better clustering accuracy. Experiments are evaluated on Reuters Transcription Subsets, wheat, trade, money grain, and ship.
Keywords :
data mining; data structures; pattern clustering; text analysis; Internet; Reuters Transcription Subsets data; TD-IDF feature selection method; TF-DF approach; TF-IDF approach; TF2 approach; Web page; WordNet; attribute reduction; background knowledge; clustering accuracy; data matrix; data mining; document preprocessing; document representation; e-mail; electronic database format; electronic mail; frequent word set; money grain data; porter stemmer algorithm; ship data; stop word; term selection; term selection approach; text document clustering; threshold value; trade data; wheat data; word net thesaurus; Databases; Feature extraction; Frequency conversion; Marine vehicles; Text categorization; Time-frequency analysis; Document Preprocessing; Experimental Results; Introduction; Term Selection approach; WordNet;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advance Computing Conference (IACC), 2013 IEEE 3rd International
Conference_Location :
Ghaziabad
Print_ISBN :
978-1-4673-4527-9
Type :
conf
DOI :
10.1109/IAdCC.2013.6514339
Filename :
6514339
Link To Document :
بازگشت