DocumentCode :
3717355
Title :
Optimizing apache nutch for domain specific crawling at large scale
Author :
Luis A. Lopez;Ruth Duerr;Siri Jodha Singh Khalsa
Author_Institution :
NSIDC, Boulder, Colorado
fYear :
2015
Firstpage :
1967
Lastpage :
1971
Abstract :
Focused crawls are key to acquiring data at large scale in order to implement systems like domain search engines and knowledge databases. Focused crawls introduce non trivial problems to the already difficult problem of web scale crawling; To address some of these issues, BCube - a building block of the National Science Foundation´s EarthCube program - has developed a tailored version of Apache Nutch for data and web services discovery at scale. We describe how we started with a vanilla version of Apache Nutch and how we optimized and scaled it to reach gigabytes of discovered links and almost half a billion documents of interest crawled so far.
Keywords :
"Big data","Conferences"
Publisher :
ieee
Conference_Titel :
Big Data (Big Data), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/BigData.2015.7363976
Filename :
7363976
Link To Document :
بازگشت