DocumentCode :
2256917
Title :
An adaptive bottom up clustering approach for Web news extraction
Author :
Chen, Jinlin ; Shankar, Subash ; Kelly, Angela ; Gningue, Serigne ; Rajaravivarma, Rathika
Author_Institution :
Queens Coll., Comput. Sci. Dept., CUNY, Flushing, NY, USA
fYear :
2009
fDate :
1-2 May 2009
Firstpage :
1
Lastpage :
5
Abstract :
An adaptive bottom up Web news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies Web news information by using an adaptive bottom up clustering strategy to detect possible news areas. It first detects news areas based on content function, space continuity, and formatting continuity of news information. It further identifies detailed news content based on the position, format, and semantic of detected news areas. Experiment results show that our approach achieves much better performance (in average more than 99% in terms of F1 Value) compared to previous approaches such as tree edit distance and visual wrapper based approaches. Furthermore, our approach does not assume the existence of Web templates in the tested Web pages as required by tree edit distance based approach, nor does it need training sets as required in Visual Wrapper based approach. The success of our approach demonstrates the strength of the perception based Web information extraction methodology and represents a promising approach for automatic information extraction from sources with presentation design for humans.
Keywords :
Internet; information retrieval; pattern clustering; Web information extraction methodology; Web news extraction approach; Web template; adaptive bottom-up clustering approach; content function; formatting continuity; human perception; space continuity; tree edit distance-based approach; visual wrapper-based approach; Cadaver; Cities and towns; Computational modeling; Computer science; Data mining; Educational institutions; HTML; Humans; Testing; Web pages; Web news; clustering; component; information extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Wireless and Optical Communications Conference, 2009. WOCC 2009. 18th Annual
Conference_Location :
Newark, NJ
Print_ISBN :
978-1-4244-5217-0
Type :
conf
DOI :
10.1109/WOCC.2009.5312904
Filename :
5312904
Link To Document :
بازگشت