Title :
Web document information extraction using class attribute approach
Author :
Srivastava, Sanjeev ; Haroon, Mohd ; Bajaj, Anu
Author_Institution :
CSE Deptt. IET, Dr. R.M.L. Avadh Univ., Faizabad, India
Abstract :
As we know that “change is the nature”. In the world of information technology the changes happens rapidly. As the new technologies always changes the world of information representation, the effect is to find out relevant pieces of information is quite difficult because of the heavy noise, cluttering with distracted features(like advertisements, links, scrollers etc.) in the whole web page. Information or useful content extraction from the web pages(structured or semi strutured) becomes a critical issue for web users and web miners. The user can be misguided by the noise of the web page. So the information extraction from the web page carries a huge importance. A confusing puzzle for information extraction is to define the noise domain and its removal. In the recent studies we all well known about the wrapper induction, feature extractor, back propagation algorithm of neural network, content extractor, PAT trees, etc. In the paper followed by the abstract we investigate the DOM tree segmentation with class attribute based approach. The class attribute can be used with all HTML elements inside the `BODY´ section of the document. It is used to create different classes of an element, where each class can have its own properties. To evaluate the system performance several experiments done on different commercial, news, entertainment websites. Experiments indicate our method is applicable to extract informative content from web pages of these websites.
Keywords :
Web sites; data mining; hypermedia markup languages; information retrieval; tree data structures; DOM tree segmentation; HTML elements; PAT trees; Web document information extraction; Web miners; Web page content extraction; Web users; backpropagation algorithm; class attribute approach; commercial Websites; content extractor; entertainment Websites; feature extractor; information representation; informative content extraction; neural network; news Websites; wrapper induction; Computers; Feature extraction; HTML; Noise; Web pages; XML; Classes; DOM; DOM tree; HTML; XHTML Segmentation;
Conference_Titel :
Computer and Communication Technology (ICCCT), 2013 4th International Conference on
Conference_Location :
Allahabad
Print_ISBN :
978-1-4799-1569-9
DOI :
10.1109/ICCCT.2013.6749596