DocumentCode :
1717295
Title :
DOM tree based approach for Web content extraction
Author :
Mehta, Bhavdeep ; Narvekar, Meera
Author_Institution :
Dept. of Comput. Eng., D.J. Sanghvi Coll. of Eng., Mumbai, India
fYear :
2015
Firstpage :
1
Lastpage :
6
Abstract :
The World Wide Web plays an important role while searching for information in the data network. Users are constantly exposed to an ever-growing flood of information. Our approach will help in searching for the exact user relevant content from multiple search engines thus, making the search more efficient and reliable. Our framework will extract the relevant result records based on two approaches i.e. Stored URL list and Run time Generated URL list. Finally, the unique set of records is displayed in a common framework´s search result page. The extraction is performed using the concepts of Document Object Model (DOM) tree. The paper comprises of a concept of threshold and data filters to detect and remove irrelevant & redundant data from the web page. The data filters will also be used to further improve the similarity check of data records. Our system will be able to extract 75%-80% user relevant content by eliminating noisy content from the different structured web pages like blogs, forums, articles etc. in the dynamic environment. Our approach shows significant advantages in both precision and recall.
Keywords :
Internet; Web sites; application program interfaces; content-based retrieval; information filters; DOM tree; Document Object Model; Web content extraction; Web page; data filters; information searching; run time generated URL list; stored URL list; threshold concept; Accuracy; Data mining; HTML; Information filtering; Noise measurement; Uniform resource locators; Web pages; Content extraction techniques etc.; DOM tree; Information extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Communication, Information & Computing Technology (ICCICT), 2015 International Conference on
Conference_Location :
Mumbai
Print_ISBN :
978-1-4799-5521-3
Type :
conf
DOI :
10.1109/ICCICT.2015.7045706
Filename :
7045706
Link To Document :
بازگشت