Title :
The study of web information extraction technology based on VietSpider
Author :
Gao Tao ; Wu Hongna
Author_Institution :
Beijing Inst. of Technol., Beijing, China
Abstract :
Currently network information extraction technology is a hot and difficult spot of the Web data excavation area. In this paper, the author introduces a new, open source information collection tool: VietSpider, including system structure, core technology, case to proceed etc. The author also compares it with another tool (the Heritrix+HtmlParser combination) and analyzes the advantages and disadvantages of the two methods, which facilitates the selection and application of the users and researchers. And at last the author gives the solution to the garbage problem in the process of Chinese information acquisition.
Keywords :
Internet; data acquisition; graphical user interfaces; information retrieval; public domain software; storage management; Chinese information acquisition; Heritrix+HtmlParser combination; VietSpider; Web data excavation area; Web information extraction technology; core technology; garbage problem; graphical interface; network information extraction technology; open source information collection tool; system structure; Crawlers; Data mining; Databases; Encoding; Information filters; Heritrix+HtmlParser; Messy code; VietSpider; Web information extraction;
Conference_Titel :
Control Conference (CCC), 2013 32nd Chinese
Conference_Location :
Xi´an