Title :
Text Extraction from the Web via Text-to-Tag Ratio
Author :
Weninger, Tim ; Hsu, William H.
Author_Institution :
Comput. & Inf. Sci., Kansas State Univ., Manhattan, KS
Abstract :
We describe a method to extract content text from diverse Web pages by using the HTML document´s text-to-tag ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the text-to-tag ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
Keywords :
Internet; hypermedia markup languages; information retrieval; text analysis; HTML document; diverse Web pages; text extraction; text-to-tag ratio; Art; Cascading style sheets; Data mining; Databases; Expert systems; HTML; Histograms; Internet; Testing; Web pages; Histogram; Information Extraction; Web;
Conference_Titel :
Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on
Conference_Location :
Turin
Print_ISBN :
978-0-7695-3299-8
DOI :
10.1109/DEXA.2008.12