Title :
Combining DOM tree and geometric layout analysis for online medical journal article segmentation
Author :
Zou, Jie ; Le, Daniel ; Thoma, George R.
Abstract :
We describe an HTML Web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-converted-HTML files). The Web page content is modeled by a zone tree structure based primarily on the geometric layout of the Web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire Web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps
Keywords :
Internet; bibliographic systems; hypermedia markup languages; information retrieval; medical information systems; HTML Web page; X-Y cut algorithm; document object model tree analysis; geometric layout analysis; information retrieval; online medical journal article segmentation; zone tree structure; Algorithm design and analysis; Content based retrieval; Government; HTML; Information analysis; Information retrieval; Software libraries; Storage automation; Text analysis; Web pages; HTML document segmentation; document layout analysis; document object model (DOM); web information retrieval;
Conference_Titel :
Digital Libraries, 2006. JCDL '06. Proceedings of the 6th ACM/IEEE-CS Joint Conference on
Conference_Location :
Chapel Hill, NC
Print_ISBN :
1-59593-354-9
DOI :
10.1145/1141753.1141777