Title :
Co-citation & co-reference concepts to control focused crawler exploration
Author :
Maimunah, S. ; Widyantoro, Dwi H. ; Kuspriyanto ; Sastramihardja, Husni S.
Author_Institution :
Inf. Syst. Dept., Surabaya Adhi Tama Inst. of Technol., Surabaya, Indonesia
Abstract :
Focused crawler is an agent to index information according to specific topic. To traverse WWW, focused crawler makes a prediction of hyperlink´s visiting priority in order to download relevant documents as maximum as possible and to minimize downloaded irrelevant documents. Many researchers have proposed methods to improve focused crawling precision by minimizing irrelevant documents. However there is a precision and recall trade-off. More precision the results make less recall. This research has studied on conventional focused crawling search strategy (forward crawling) and Web documents structure. The result shows the low recall of conventional focused crawling is caused by some structural characteristics of WWW. Therefore, this research proposes a new strategy of focused crawler. The new strategy is a combination of bidirectional (forward and backward) crawling and bibliometric concepts (co-citation & co-reference). Bidirectional crawling is to improve the exploration and co-citation & co-reference concepts are to control the focusing. With this new strategy, focused crawler can obtain relevant documents that are connected through co-citations or relevant communities that act connected through co-references. Based on experiments that have been carried out, the results show that focused crawler with this new strategy, named CT-FC (more Comprehensive Traversal Focused Crawler) has better exploration capability so that recall increases and precision can remain high.
Keywords :
Internet; bibliographic systems; citation analysis; document handling; indexing; query formulation; Web document download; bibliometric concepts; bidirectional crawling; co-citation concept; co-reference concept; focused crawler; hyperlink; information index; search strategy; Bibliometrics; Communities; Context; Crawlers; Search problems; Target tracking; World Wide Web; co-citation & co-reference; focused crawler; forward & backward crawling; recall;
Conference_Titel :
Electrical Engineering and Informatics (ICEEI), 2011 International Conference on
Conference_Location :
Bandung
Print_ISBN :
978-1-4577-0753-7
DOI :
10.1109/ICEEI.2011.6021677