مرکز منطقه ای اطلاع رساني علوم و فناوري - Optimizing apache nutch for domain specific crawling at large scale

DocumentCode :

3717355

Title :

Optimizing apache nutch for domain specific crawling at large scale

Author :

Luis A. Lopez;Ruth Duerr;Siri Jodha Singh Khalsa

Author_Institution :

NSIDC, Boulder, Colorado

fYear :

2015

Firstpage :

1967

Lastpage :

1971

Abstract :

Focused crawls are key to acquiring data at large scale in order to implement systems like domain search engines and knowledge databases. Focused crawls introduce non trivial problems to the already difficult problem of web scale crawling; To address some of these issues, BCube - a building block of the National Science Foundation´s EarthCube program - has developed a tailored version of Apache Nutch for data and web services discovery at scale. We describe how we started with a vanilla version of Apache Nutch and how we optimized and scaled it to reach gigabytes of discovered links and almost half a billion documents of interest crawled so far.

Keywords :

"Big data","Conferences"

Publisher :

ieee

Conference_Titel :

Big Data (Big Data), 2015 IEEE International Conference on

Type :

conf

DOI :

10.1109/BigData.2015.7363976

Filename :

7363976

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3717355