Title :
Automatic web page segmentation and information extraction using conditional random fields
Author :
Gong, Yunfei ; Liu, Qiang
Author_Institution :
Sch. of Software, Tsinghua Univ., Beijing, China
Abstract :
With the rapid development of Internet, Web pages have been more and more complex. Useful information is mixed with a lot of redundant information. In the current Web information extraction systems, manual or semi-manual methods are the majority. To improve the efficiency of information extraction, it requires us to further research the automatic method of Web information extraction. Firstly, we analyze the Web page´s basic object according to the Functional-based Object Model. Then we give an automatic method to segment the Web page into semantic blocks using conditional random fields (CRFs). In order to further improve the effect of the semantic block segmentation, combining DOM structure and tree edit distance, the optimization algorithm of the semantic block is given. Finally, we give an automatic Web information extraction tool. Based on this tool, relevant experiments are carried out to evaluate the efficiency of information extraction. Compared to DOM-based Web information extraction systems, the experimental results show the increase in accuracy and recall rate.
Keywords :
Internet; Web sites; information retrieval; statistical analysis; Internet; Web information extraction systems; automatic Web page segmentation; conditional random fields; document object model; functional-based object model; optimization algorithm; semantic block segmentation; semimanual methods; tree edit distance; Data mining; Educational institutions; Web pages; CRFs; DOM; Function-based Object Model; information extraction; semantic block segmentation;
Conference_Titel :
Computer Supported Cooperative Work in Design (CSCWD), 2012 IEEE 16th International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4673-1211-0
DOI :
10.1109/CSCWD.2012.6221840