مرکز منطقه ای اطلاع رساني علوم و فناوري - Web document information extraction using class attribute approach

DocumentCode :

3401909

Title :

Web document information extraction using class attribute approach

Author :

Srivastava, Sanjeev ; Haroon, Mohd ; Bajaj, Anu

Author_Institution :

CSE Deptt. IET, Dr. R.M.L. Avadh Univ., Faizabad, India

fYear :

2013

fDate :

20-22 Sept. 2013

Firstpage :

Lastpage :

Abstract :

As we know that “change is the nature”. In the world of information technology the changes happens rapidly. As the new technologies always changes the world of information representation, the effect is to find out relevant pieces of information is quite difficult because of the heavy noise, cluttering with distracted features(like advertisements, links, scrollers etc.) in the whole web page. Information or useful content extraction from the web pages(structured or semi strutured) becomes a critical issue for web users and web miners. The user can be misguided by the noise of the web page. So the information extraction from the web page carries a huge importance. A confusing puzzle for information extraction is to define the noise domain and its removal. In the recent studies we all well known about the wrapper induction, feature extractor, back propagation algorithm of neural network, content extractor, PAT trees, etc. In the paper followed by the abstract we investigate the DOM tree segmentation with class attribute based approach. The class attribute can be used with all HTML elements inside the `BODY´ section of the document. It is used to create different classes of an element, where each class can have its own properties. To evaluate the system performance several experiments done on different commercial, news, entertainment websites. Experiments indicate our method is applicable to extract informative content from web pages of these websites.

Keywords :

Web sites; data mining; hypermedia markup languages; information retrieval; tree data structures; DOM tree segmentation; HTML elements; PAT trees; Web document information extraction; Web miners; Web page content extraction; Web users; backpropagation algorithm; class attribute approach; commercial Websites; content extractor; entertainment Websites; feature extractor; information representation; informative content extraction; neural network; news Websites; wrapper induction; Computers; Feature extraction; HTML; Noise; Web pages; XML; Classes; DOM; DOM tree; HTML; XHTML Segmentation;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer and Communication Technology (ICCCT), 2013 4th International Conference on

Conference_Location :

Allahabad

Print_ISBN :

978-1-4799-1569-9

Type :

conf

DOI :

10.1109/ICCCT.2013.6749596

Filename :

6749596

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3401909