مرکز منطقه ای اطلاع رساني علوم و فناوري - Text Extraction from the Web via Text-to-Tag Ratio

DocumentCode :

2830395

Title :

Text Extraction from the Web via Text-to-Tag Ratio

Author :

Weninger, Tim ; Hsu, William H.

Author_Institution :

Comput. & Inf. Sci., Kansas State Univ., Manhattan, KS

fYear :

2008

fDate :

1-5 Sept. 2008

Firstpage :

Lastpage :

Abstract :

We describe a method to extract content text from diverse Web pages by using the HTML document´s text-to-tag ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the text-to-tag ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.

Keywords :

Internet; hypermedia markup languages; information retrieval; text analysis; HTML document; diverse Web pages; text extraction; text-to-tag ratio; Art; Cascading style sheets; Data mining; Databases; Expert systems; HTML; Histograms; Internet; Testing; Web pages; Histogram; Information Extraction; Web;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on

Conference_Location :

Turin

ISSN :

1529-4188

Print_ISBN :

978-0-7695-3299-8

Type :

conf

DOI :

10.1109/DEXA.2008.12

Filename :

4624686

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2830395