DocumentCode
1930670
Title
Study of Web Page Information topic extraction technology based on vision
Author
Li, Qingshui ; Wu, Kai
Author_Institution
Comput. Sci. & Technol. Coll., Zhejiang Univ. of Technol., Hangzhou, China
Volume
9
fYear
2010
fDate
9-11 July 2010
Firstpage
781
Lastpage
784
Abstract
The vision information of Web page is applied for information extraction, which avoids using the sophisticate natural language processing technology. This paper combines the natural language processing technology with vision character of HTML page in the application of information extraction for Web page, we carried out relevant research. We propose a Web Page Information extraction algorithm based on vision character, we use the vision character rule of web page, in respect of the detailed problem of coarse-grained web page segmentation and the restructure problem of the smallest web page segmentation, we analyze the vision character of page block and finally accurate determine the topic data region. After using the information extraction technology of web page, it reduces the information block of web page content and thus reduces the cost of index generating, and also increases the hit rate of search engine.
Keywords
Web sites; computer vision; data mining; image segmentation; natural language processing; text analysis; HTML page; Web page information topic extraction technology; coarse-grained Web page segmentation; index generation; natural language processing; page block; search engine; topic data region; vision character rule; vision information; Bars; Data mining; Data Region; Information Extraction; Topic Extraction; Vision Character;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on
Conference_Location
Chengdu
Print_ISBN
978-1-4244-5537-9
Type
conf
DOI
10.1109/ICCSIT.2010.5563688
Filename
5563688
Link To Document