DocumentCode
2036115
Title
Automated extraction of non <h>-tagged headers in webpages by decision trees
Author
Okada, Hidehiko ; Arakawa, Hiroki
Author_Institution
Grad. Sch. of Eng., Kyoto Sangyo Univ., Kyoto, Japan
fYear
2011
fDate
13-18 Sept. 2011
Firstpage
2117
Lastpage
2120
Abstract
The guideline #5.2a in the JIS X 8341-3 reccomends to “represent headings with heading elements instead of difference in font size, etc”. Thus, in checking webpage accessibility, headers that are not tagged with heading tags (<;h1>;-<;h6>;) should be extracted as problems. In this paper, we propose a method for the extraction. Our idea is to let a machine learning method to automatically derive extraction rules from problem instances on the web. We define 26 attributes of HTML elements for deriving the rules. Values of these attributes are calculated by parsing the HTML source of the webpage. Accuracy of our method was evaluated by 10-fold cross validations with the data we collected from the web. The accuracy was 85-88% in average in terms of F-measure. Non <;h>;-tagged image headers were slightly better discriminated than non <;h>;-tagged text headers.
Keywords
Internet; decision trees; hypermedia markup languages; learning (artificial intelligence); program compilers; text analysis; 10-fold cross validation; F measure; HTML element; HTML source parsing; JIS X 8341-3; Web page; automated extraction rule; decision tree; heading element; heading tag; machine learning method; non <;h>;-tagged text header; non<;h>;tagged image header; Accuracy; Decision trees; Guidelines; HTML; Learning systems; Machine learning; Vegetation; Web accessibility; automated checking; decision tree; heading; machine learning;
fLanguage
English
Publisher
ieee
Conference_Titel
SICE Annual Conference (SICE), 2011 Proceedings of
Conference_Location
Tokyo
ISSN
pending
Print_ISBN
978-1-4577-0714-8
Type
conf
Filename
6060321
Link To Document