مرکز منطقه ای اطلاع رساني علوم و فناوري - A novel approach for feature selection method TF-IDF in document clustering

DocumentCode :

2160249

Title :

A novel approach for feature selection method TF-IDF in document clustering

Author :

Patil, L.H. ; Atique, Mohammad

Author_Institution :

Dept. of Comput. Sci. & Eng., Sant Gadge Baba Amravati Univ., Amravati, India

fYear :

2013

fDate :

22-23 Feb. 2013

Firstpage :

858

Lastpage :

862

Abstract :

Now a day, the text document is spontaneously increasing over the internet, e-mail and web pages and they are stored in the electronic database format. To arrange and browse the document it becomes difficult. To overcome such problem the document preprocessing, term selection, attribute reduction and maintaining the relationship between the important terms using background knowledge, WordNet, becomes an important parameters in data mining. In these paper the different stages are formed, firstly the document preprocessing is done by removing stop words, stemming is performed using porter stemmer algorithm, word net thesaurus is applied for maintaining relationship between the important terms, global unique words, and frequent word sets get generated, Secondly, data matrix is formed, and thirdly terms are extracted from the documents by using term selection approaches tf-idf, tf-df, and tf2 based on their minimum threshold value. Further each and every document terms gets preprocessed, where the frequency of each term within the document is counted for representation. The purpose of this approach is to reduce the attributes and find the effective term selection method using WordNet for better clustering accuracy. Experiments are evaluated on Reuters Transcription Subsets, wheat, trade, money grain, and ship.

Keywords :

data mining; data structures; pattern clustering; text analysis; Internet; Reuters Transcription Subsets data; TD-IDF feature selection method; TF-DF approach; TF-IDF approach; TF2 approach; Web page; WordNet; attribute reduction; background knowledge; clustering accuracy; data matrix; data mining; document preprocessing; document representation; e-mail; electronic database format; electronic mail; frequent word set; money grain data; porter stemmer algorithm; ship data; stop word; term selection; term selection approach; text document clustering; threshold value; trade data; wheat data; word net thesaurus; Databases; Feature extraction; Frequency conversion; Marine vehicles; Text categorization; Time-frequency analysis; Document Preprocessing; Experimental Results; Introduction; Term Selection approach; WordNet;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Advance Computing Conference (IACC), 2013 IEEE 3rd International

Conference_Location :

Ghaziabad

Print_ISBN :

978-1-4673-4527-9

Type :

conf

DOI :

10.1109/IAdCC.2013.6514339

Filename :

6514339

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2160249