DocumentCode :
2552213
Title :
Extraction of relevant components using shallow structure of HTML documents
Author :
Zeng, Jun ; Sakai, Toshihiko ; Flanagan, Brendan ; Hirokawa, Sachio
Author_Institution :
Grad. Sch. of Inf. Sci., Kyushu Univ., Fukuoka, Japan
fYear :
2012
fDate :
29-31 May 2012
Firstpage :
1186
Lastpage :
1190
Abstract :
As the amount of web page increases, searching for semi-structured documents is gaining greater attention. The traditional approach for extracting data from web page documents is to write specialized programs, called wrappers that identify data of interest and map them to some suitable format. However, developing wrappers manually has many well known shortcomings, mainly due to the difficulty in writing and maintaining them for continually changing web data. Moreover, there is no one wrapper program that can treat all kinds of web pages. In this paper, we aim to extract relevant and meaningful snippets from as many web pages as possible, using the shallow feature of HTML documents to discover and analyze the relevant components. Also, we introduced a new feature called GAP and verified the effectiveness of GAP by conducting a SVM learning experiment.
Keywords :
Internet; document handling; hypermedia markup languages; support vector machines; GAP; HTML documents; SVM learning experiment; Web page documents; relevant components extraction; semistructured documents; wrappers; Diseases; Educational institutions; Feature extraction; HTML; Search engines; Support vector machines; Web pages; Extraction of contents; SVM; relevant component; shallow feature;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on
Conference_Location :
Sichuan
Print_ISBN :
978-1-4673-0025-4
Type :
conf
DOI :
10.1109/FSKD.2012.6234295
Filename :
6234295
Link To Document :
بازگشت