DocumentCode
235478
Title
Bayes topic prediction model for focused crawling of vertical search engine
Author
Weihong Zhang ; Yong Chen
Author_Institution
Nat. Eng. Res. Center for S&T Resources Sharing Services, Beihang Univ., Beijing, China
fYear
2014
fDate
20-22 Oct. 2014
Firstpage
294
Lastpage
299
Abstract
Vertical search is an important topic in the design of search engines as it offers more abundant and more precise results on specific domain compared with large-scale search engines, like Google and Baidu. Prior to this paper, most vertical search engines were built using manually selected and edited materials, which was time and money consuming. In this paper, we propose a new information resource discovery model and build a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic. The model includes three aspects. First, webpages are transformed into term vectors. TF-TUF , short for Term Frequency-Topic Unbalanced Factor , is proposed as the weighting schema in vector space model. In the schema,we put more weight on terms whose frequencies differ a lot among topics, which will contribute more in the topic prediction we believe. Second, we use Bayes method to predict the topics of the webpages, where topic labeled text is used for training in advance. The specific method about using Bayes to predict the topic is illustrated in the algorithm section. Third, we create a focused crawler using the topic prediction result. The prediction result is used not only to filter the irrelevant webpages but also to direct the crawler to the areas, which are most possible to be topic relevant. The whole three aspects work together to reach the goal of discovering the topic relevant materials on the web efficiently, in building a vertical search engine. Our experiment shows that the average prediction accuracy of our proposed model can reach more than 85%. For application, we also used the proposed model to build "Search Engine for S&T" (http://nstr.com.cn/search), a vertical search engine in science field.
Keywords
information retrieval; learning (artificial intelligence); search engines; Baidu; Bayes topic prediction model; Google; TF-TUF factor; Web page fetching; focused crawling; information resource discovery model; term frequency-topic unbalanced factor; term vectors; topic labeled text; vector space model; vertical search engine; Accuracy; Crawlers; Information filtering; Predictive models; Search engines; Training; Vectors; Focused Crawler; Naïve Bayes; Term Frequency; Topic Unbalanced Factor; Vertical Search Engine;
fLanguage
English
Publisher
ieee
Conference_Titel
Computing, Communications and IT Applications Conference (ComComAp), 2014 IEEE
Conference_Location
Beijing
Print_ISBN
978-1-4799-4813-0
Type
conf
DOI
10.1109/ComComAp.2014.7017213
Filename
7017213
Link To Document