DocumentCode
2160249
Title
A novel approach for feature selection method TF-IDF in document clustering
Author
Patil, L.H. ; Atique, Mohammad
Author_Institution
Dept. of Comput. Sci. & Eng., Sant Gadge Baba Amravati Univ., Amravati, India
fYear
2013
fDate
22-23 Feb. 2013
Firstpage
858
Lastpage
862
Abstract
Now a day, the text document is spontaneously increasing over the internet, e-mail and web pages and they are stored in the electronic database format. To arrange and browse the document it becomes difficult. To overcome such problem the document preprocessing, term selection, attribute reduction and maintaining the relationship between the important terms using background knowledge, WordNet, becomes an important parameters in data mining. In these paper the different stages are formed, firstly the document preprocessing is done by removing stop words, stemming is performed using porter stemmer algorithm, word net thesaurus is applied for maintaining relationship between the important terms, global unique words, and frequent word sets get generated, Secondly, data matrix is formed, and thirdly terms are extracted from the documents by using term selection approaches tf-idf, tf-df, and tf2 based on their minimum threshold value. Further each and every document terms gets preprocessed, where the frequency of each term within the document is counted for representation. The purpose of this approach is to reduce the attributes and find the effective term selection method using WordNet for better clustering accuracy. Experiments are evaluated on Reuters Transcription Subsets, wheat, trade, money grain, and ship.
Keywords
data mining; data structures; pattern clustering; text analysis; Internet; Reuters Transcription Subsets data; TD-IDF feature selection method; TF-DF approach; TF-IDF approach; TF2 approach; Web page; WordNet; attribute reduction; background knowledge; clustering accuracy; data matrix; data mining; document preprocessing; document representation; e-mail; electronic database format; electronic mail; frequent word set; money grain data; porter stemmer algorithm; ship data; stop word; term selection; term selection approach; text document clustering; threshold value; trade data; wheat data; word net thesaurus; Databases; Feature extraction; Frequency conversion; Marine vehicles; Text categorization; Time-frequency analysis; Document Preprocessing; Experimental Results; Introduction; Term Selection approach; WordNet;
fLanguage
English
Publisher
ieee
Conference_Titel
Advance Computing Conference (IACC), 2013 IEEE 3rd International
Conference_Location
Ghaziabad
Print_ISBN
978-1-4673-4527-9
Type
conf
DOI
10.1109/IAdCC.2013.6514339
Filename
6514339
Link To Document