Title :
Facilitating wrapper generation with page analysis
Author :
Wu, Bo ; Cheng, Xueqi ; Wang, Yu ; Zhang, Gang ; Ding, Guodong
Author_Institution :
Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing
Abstract :
Current approaches for generating wrappers for web page extraction suffer from the requirement of huge amount of labeled training pages to obtain satisfying results. On the other hand, the quality of data extracted by fully automatic methods is not reliable. In this paper, we propose a novel method to facilitate wrapper generation by combining wrapper induction and page analysis approaches. In addition to manually labeled data, we also take advantage of a set of unlabeled pages to improve the quality of induced wrappers. Our experiments demonstrate that our system achieves a satisfying result with fewer manually labeled training pages.
Keywords :
Internet; information retrieval; text analysis; labeled training pages; page analysis; web page extraction; wrapper generation; Classification tree analysis; Computers; Data mining; Humans; Induction generators; Intersymbol interference; Labeling; Skeleton; USA Councils; Web pages; infromation extraction; web mining; wrapper;
Conference_Titel :
Intelligence and Security Informatics, 2009. ISI '09. IEEE International Conference on
Conference_Location :
Dallas, TX
Print_ISBN :
978-1-4244-4171-6
Electronic_ISBN :
978-1-4244-4173-0
DOI :
10.1109/ISI.2009.5137299