DocumentCode :
235478
Title :
Bayes topic prediction model for focused crawling of vertical search engine
Author :
Weihong Zhang ; Yong Chen
Author_Institution :
Nat. Eng. Res. Center for S&T Resources Sharing Services, Beihang Univ., Beijing, China
fYear :
2014
fDate :
20-22 Oct. 2014
Firstpage :
294
Lastpage :
299
Abstract :
Vertical search is an important topic in the design of search engines as it offers more abundant and more precise results on specific domain compared with large-scale search engines, like Google and Baidu. Prior to this paper, most vertical search engines were built using manually selected and edited materials, which was time and money consuming. In this paper, we propose a new information resource discovery model and build a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic. The model includes three aspects. First, webpages are transformed into term vectors. TF-TUF , short for Term Frequency-Topic Unbalanced Factor , is proposed as the weighting schema in vector space model. In the schema,we put more weight on terms whose frequencies differ a lot among topics, which will contribute more in the topic prediction we believe. Second, we use Bayes method to predict the topics of the webpages, where topic labeled text is used for training in advance. The specific method about using Bayes to predict the topic is illustrated in the algorithm section. Third, we create a focused crawler using the topic prediction result. The prediction result is used not only to filter the irrelevant webpages but also to direct the crawler to the areas, which are most possible to be topic relevant. The whole three aspects work together to reach the goal of discovering the topic relevant materials on the web efficiently, in building a vertical search engine. Our experiment shows that the average prediction accuracy of our proposed model can reach more than 85%. For application, we also used the proposed model to build "Search Engine for S&T" (http://nstr.com.cn/search), a vertical search engine in science field.
Keywords :
information retrieval; learning (artificial intelligence); search engines; Baidu; Bayes topic prediction model; Google; TF-TUF factor; Web page fetching; focused crawling; information resource discovery model; term frequency-topic unbalanced factor; term vectors; topic labeled text; vector space model; vertical search engine; Accuracy; Crawlers; Information filtering; Predictive models; Search engines; Training; Vectors; Focused Crawler; Naïve Bayes; Term Frequency; Topic Unbalanced Factor; Vertical Search Engine;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computing, Communications and IT Applications Conference (ComComAp), 2014 IEEE
Conference_Location :
Beijing
Print_ISBN :
978-1-4799-4813-0
Type :
conf
DOI :
10.1109/ComComAp.2014.7017213
Filename :
7017213
Link To Document :
بازگشت