DocumentCode :
1801504
Title :
The study of web information extraction technology based on VietSpider
Author :
Gao Tao ; Wu Hongna
Author_Institution :
Beijing Inst. of Technol., Beijing, China
fYear :
2013
fDate :
26-28 July 2013
Firstpage :
8465
Lastpage :
8470
Abstract :
Currently network information extraction technology is a hot and difficult spot of the Web data excavation area. In this paper, the author introduces a new, open source information collection tool: VietSpider, including system structure, core technology, case to proceed etc. The author also compares it with another tool (the Heritrix+HtmlParser combination) and analyzes the advantages and disadvantages of the two methods, which facilitates the selection and application of the users and researchers. And at last the author gives the solution to the garbage problem in the process of Chinese information acquisition.
Keywords :
Internet; data acquisition; graphical user interfaces; information retrieval; public domain software; storage management; Chinese information acquisition; Heritrix+HtmlParser combination; VietSpider; Web data excavation area; Web information extraction technology; core technology; garbage problem; graphical interface; network information extraction technology; open source information collection tool; system structure; Crawlers; Data mining; Databases; Encoding; Information filters; Heritrix+HtmlParser; Messy code; VietSpider; Web information extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Control Conference (CCC), 2013 32nd Chinese
Conference_Location :
Xi´an
Type :
conf
Filename :
6640939
Link To Document :
بازگشت