DocumentCode :
2447315
Title :
Tibetan Web Information Collection System
Author :
Xu, Guixian ; Zhong, Dunhao ; Gao, Xu ; Lin, Yuan ; Zhao, Xiaobing ; Yang, Guosheng
Author_Institution :
Coll. of Inf. Eng., Minzu Univ. of China, Beijing, China
fYear :
2012
fDate :
1-3 Nov. 2012
Firstpage :
236
Lastpage :
238
Abstract :
Nutch is an open source web-search software project. This paper introduces a system called Tibetan web information collection system, which bases on Apache Nutch. It points out original program´s shortcomings and proposes an improved method, which can utilize the Nutch to deal with Tibetan web pages and generate the files that we need. Besides, this paper shows how to update the data regularly and delete the duplicate data. It is useful and helpful for the study of Tibetan information processing.
Keywords :
Internet; Web sites; information retrieval; public domain software; Apache Nutch; Tibetan Web information collection system; Tibetan Web pages; Tibetan information processing; data update; duplicate data deletion; file generation; open source Web-search software project; Crawlers; Data mining; Educational institutions; HTML; Information processing; Software; Web pages; Information Collection; Tibetan information processing; Web crawler;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligent Networks and Intelligent Systems (ICINIS), 2012 Fifth International Conference on
Conference_Location :
Tianjin
Print_ISBN :
978-1-4673-3083-1
Type :
conf
DOI :
10.1109/ICINIS.2012.46
Filename :
6376530
Link To Document :
بازگشت