Title :
HTML Pattern Generator--Automatic Data Extraction from Web Pages
Author :
Cosulschi, Mirel ; Giurca, Adrian ; Udrescu, Bogdan ; Constantinescu, Nicolae ; Gabroveanu, Mihai
Author_Institution :
Dept. of Comput. Sci., Craiova Univ.
Abstract :
Existing methods of information extraction from HTML documents include manual approach, supervised learning and automatic techniques. The manual method has high precision and recall values but it is difficult to apply it for large number of pages. Supervised learning involves human interaction to create positive and negative samples. Automatic techniques benefit from less human effort but they are not highly reliable regarding the information retrieved
Keywords :
Web sites; hypermedia markup languages; information retrieval; knowledge acquisition; learning (artificial intelligence); HTML documents; HTML pattern generator; Web pages; automatic data extraction; information extraction; information retrieval; supervised learning; Computer science; Costs; Data mining; Databases; HTML; Humans; Internet; Manuals; Supervised learning; Web pages;
Conference_Titel :
Symbolic and Numeric Algorithms for Scientific Computing, 2006. SYNASC '06. Eighth International Symposium on
Conference_Location :
Timisoara
Print_ISBN :
0-7695-2740-X
DOI :
10.1109/SYNASC.2006.43