A novel approach for feature selection method TF-IDF in document clustering

Author

Patil, L.H. ; Atique, Mohammad

Author_Institution

Dept. of Comput. Sci. & Eng., Sant Gadge Baba Amravati Univ., Amravati, India

fYear

2013

fDate

22-23 Feb. 2013

Firstpage

858

Lastpage

862

Abstract

Now a day, the text document is spontaneously increasing over the internet, e-mail and web pages and they are stored in the electronic database format. To arrange and browse the document it becomes difficult. To overcome such problem the document preprocessing, term selection, attribute reduction and maintaining the relationship between the important terms using background knowledge, WordNet, becomes an important parameters in data mining. In these paper the different stages are formed, firstly the document preprocessing is done by removing stop words, stemming is performed using porter stemmer algorithm, word net thesaurus is applied for maintaining relationship between the important terms, global unique words, and frequent word sets get generated, Secondly, data matrix is formed, and thirdly terms are extracted from the documents by using term selection approaches tf-idf, tf-df, and tf2 based on their minimum threshold value. Further each and every document terms gets preprocessed, where the frequency of each term within the document is counted for representation. The purpose of this approach is to reduce the attributes and find the effective term selection method using WordNet for better clustering accuracy. Experiments are evaluated on Reuters Transcription Subsets, wheat, trade, money grain, and ship.

Keywords

data mining; data structures; pattern clustering; text analysis; Internet; Reuters Transcription Subsets data; TD-IDF feature selection method; TF-DF approach; TF-IDF approach; TF2 approach; Web page; WordNet; attribute reduction; background knowledge; clustering accuracy; data matrix; data mining; document preprocessing; document representation; e-mail; electronic database format; electronic mail; frequent word set; money grain data; porter stemmer algorithm; ship data; stop word; term selection; term selection approach; text document clustering; threshold value; trade data; wheat data; word net thesaurus; Databases; Feature extraction; Frequency conversion; Marine vehicles; Text categorization; Time-frequency analysis; Document Preprocessing; Experimental Results; Introduction; Term Selection approach; WordNet;

fLanguage

English

Publisher

ieee

Conference_Titel

Advance Computing Conference (IACC), 2013 IEEE 3rd International

Conference_Location

Ghaziabad

Print_ISBN

978-1-4673-4527-9

Type

conf

DOI

10.1109/IAdCC.2013.6514339

Filename

6514339