DocumentCode
1717295
Title
DOM tree based approach for Web content extraction
Author
Mehta, Bhavdeep ; Narvekar, Meera
Author_Institution
Dept. of Comput. Eng., D.J. Sanghvi Coll. of Eng., Mumbai, India
fYear
2015
Firstpage
1
Lastpage
6
Abstract
The World Wide Web plays an important role while searching for information in the data network. Users are constantly exposed to an ever-growing flood of information. Our approach will help in searching for the exact user relevant content from multiple search engines thus, making the search more efficient and reliable. Our framework will extract the relevant result records based on two approaches i.e. Stored URL list and Run time Generated URL list. Finally, the unique set of records is displayed in a common framework´s search result page. The extraction is performed using the concepts of Document Object Model (DOM) tree. The paper comprises of a concept of threshold and data filters to detect and remove irrelevant & redundant data from the web page. The data filters will also be used to further improve the similarity check of data records. Our system will be able to extract 75%-80% user relevant content by eliminating noisy content from the different structured web pages like blogs, forums, articles etc. in the dynamic environment. Our approach shows significant advantages in both precision and recall.
Keywords
Internet; Web sites; application program interfaces; content-based retrieval; information filters; DOM tree; Document Object Model; Web content extraction; Web page; data filters; information searching; run time generated URL list; stored URL list; threshold concept; Accuracy; Data mining; HTML; Information filtering; Noise measurement; Uniform resource locators; Web pages; Content extraction techniques etc.; DOM tree; Information extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Communication, Information & Computing Technology (ICCICT), 2015 International Conference on
Conference_Location
Mumbai
Print_ISBN
978-1-4799-5521-3
Type
conf
DOI
10.1109/ICCICT.2015.7045706
Filename
7045706
Link To Document