Title :
Micro-blog commercial word extraction based on improved TF-IDF algorithm
Author :
Xing Huang ; Qing Wu
Author_Institution :
Sch. of Comput. Sci. & Technol., Hangzhou Dianzi Univ., Hangzhou, China
Abstract :
Nowadays found some micro-blog commercial extraction algorithm only considering the relationship between the key words and the number of it appearing in texts, and ignoring the key words´ distribution in a certain category, which leads the decreased accuracy problems of micro-blog commercial word extraction. To solve this problem, the application of TF-IDF algorithm in words weight calculation was researched in this paper. Combining the relevant knowledge of information theory and analyzing the distribution of keywords within a class, the article proposed improving TF-IDF algorithm and applying it in term weight calculation. To test the feasibility of the improved algorithm, this paper initially classified the massive micro-blog information into certain types, and then used improved TFIDF algorithm to calculate term weight among the categories, and, this calculation was realized under the Hadoop Distributed framework. The experiment results demonstrated that in the application of micro-blog commercial word extraction, the improved TF-IDF algorithm is effective and feasible. Compared with traditional algorithms, the improved algorithm greatly improved accuracy. In addition, the data processing speed has greatly improved under Hadoop framework.
Keywords :
Web sites; distributed processing; information retrieval; pattern classification; Hadoop distributed framework; improved TF-IDF algorithm; information classification; information theory; keywords distribution; microblog commercial word extraction; term frequency-inverse document frequency algorithm; term weight calculation; words weight calculation; Accuracy; Blogs; Classification algorithms; Data mining; Entertainment industry; Games; Internet; Commercial Word Extract; Hadoop; Mass Data; Micro-blog; TF-IDF;
Conference_Titel :
TENCON 2013 - 2013 IEEE Region 10 Conference (31194)
Conference_Location :
Xi´an
Print_ISBN :
978-1-4799-2825-5
DOI :
10.1109/TENCON.2013.6718884