DocumentCode :
3507995
Title :
The RDF-based Information Capturing System from Web Pages
Author :
Ushioda, Tatsuya ; Fujita, Shigeru
Author_Institution :
Grad. Sch. of Inf. & Comput. Sci., Chiba Inst. of Technol., Narashino, Japan
fYear :
2010
fDate :
4-6 Nov. 2010
Firstpage :
201
Lastpage :
206
Abstract :
It is an investigative purpose to acquire the event information in the municipality website and extraction information is converted into the XML form of the RDF model. There is a problem that the extraction performance is controlled by the structure of the HTML tag though there is Web-wrapper method that uses the HTML tag as an information extraction technique on the Web page. In this paper, we propose an extraction method from a HTML document based on dictionary. HTML tag is deleted from the HTML document and it converts it into the text. It proposes the method for extracting a target character string by comparing the text with the collection of words prepared beforehand. Finally, extraction information is converted into the XML form of the RDF model. The evaluation experiment was done to the municipality in 23 Tokyo district and 56 Chiba prefecture in Japan. The proposal method was able to extract event information on as a whole 73%. The LR-Wrapper was 52%. The Tree-Wrapper was 55%. The PLR-Wrapper was 32%. The proposal method confirmed event information was rating higher than an existing method extractive by the combination of a simple algorithm and the collection of words.
Keywords :
Internet; information retrieval systems; HTML document; LR-wrapper; PLR-wrapper; RDF-based information capturing system; Web pages; Web-wrapper method; information extraction; resource description framework; tree-wrapper; Morphological Analysis; Resource Description Framework; Text Mining; Web wrapper;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2010 International Conference on
Conference_Location :
Fukuoka
Print_ISBN :
978-1-4244-8538-3
Electronic_ISBN :
978-0-7695-4237-9
Type :
conf
DOI :
10.1109/3PGCIC.2010.34
Filename :
5662790
Link To Document :
بازگشت