DocumentCode :
2374960
Title :
Automatic web page segmentation and information extraction using conditional random fields
Author :
Gong, Yunfei ; Liu, Qiang
Author_Institution :
Sch. of Software, Tsinghua Univ., Beijing, China
fYear :
2012
fDate :
23-25 May 2012
Firstpage :
334
Lastpage :
340
Abstract :
With the rapid development of Internet, Web pages have been more and more complex. Useful information is mixed with a lot of redundant information. In the current Web information extraction systems, manual or semi-manual methods are the majority. To improve the efficiency of information extraction, it requires us to further research the automatic method of Web information extraction. Firstly, we analyze the Web page´s basic object according to the Functional-based Object Model. Then we give an automatic method to segment the Web page into semantic blocks using conditional random fields (CRFs). In order to further improve the effect of the semantic block segmentation, combining DOM structure and tree edit distance, the optimization algorithm of the semantic block is given. Finally, we give an automatic Web information extraction tool. Based on this tool, relevant experiments are carried out to evaluate the efficiency of information extraction. Compared to DOM-based Web information extraction systems, the experimental results show the increase in accuracy and recall rate.
Keywords :
Internet; Web sites; information retrieval; statistical analysis; Internet; Web information extraction systems; automatic Web page segmentation; conditional random fields; document object model; functional-based object model; optimization algorithm; semantic block segmentation; semimanual methods; tree edit distance; Data mining; Educational institutions; Web pages; CRFs; DOM; Function-based Object Model; information extraction; semantic block segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Supported Cooperative Work in Design (CSCWD), 2012 IEEE 16th International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4673-1211-0
Type :
conf
DOI :
10.1109/CSCWD.2012.6221840
Filename :
6221840
Link To Document :
بازگشت