DocumentCode :
3545258
Title :
A novel ensemble vision based deep web data extraction technique for web mining applications
Author :
Banu, B. Aysha ; Chitra, M.
Author_Institution :
Dept. of Comput. Sci. & Eng., Mohamed Sathak Eng. Coll., Kilakarai, India
fYear :
2012
fDate :
23-25 Aug. 2012
Firstpage :
110
Lastpage :
114
Abstract :
Web Content extraction is the task of extracting structured information from unstructured and semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images and audio, video could be seen as information extraction. Similarly, information retrieval is the process which is based on user´s query. The retrieved information is to be extracted using the web content extraction concept. The Challenges for this type of web page content extraction is increasing now-a-days. In this work, we study the problem of automatically extracting the contents from the web pages. Many more researches have been done to address this problem. The existing approaches have some limitations such as that, it has no sufficient power to deal with the large number of web pages and also that they are web-page-programming- language(HTML) dependent. Our proposed work is to overcome the limitations of the existing system. This work deals with information retrieval process in which the Vision based approach is applied, which helps to extract both images and text from the web pages. In fact most of researches show that when a page is presented to the user, the spatial and visual features play a very important role because they help the user to unconsciously divide the webpage into several semantic parts. Hence, proposed work focus on the primary visual features of a web page. The extraction is carried out on the basis of these features. This approach can gain a better performance when compared with other traditional methods.
Keywords :
Internet; data mining; document handling; hypermedia markup languages; multimedia computing; natural language processing; query processing; NLP; Web mining applications; Web page content extraction; Web-page-programming- language; automatic annotation; content extraction; human language texts; information extraction; information retrieval; multimedia document processing; natural language processing; novel ensemble vision based deep Web data extraction technique; semi-structured machine-readable documents; spatial features; unstructured machine-readable documents; visual features; Data mining; Engines; Feature extraction; HTML; Merging; Particle separators; Visualization; Deep web pages; page ranking; visual data extraction webcontent extraction; web mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Communication Control and Computing Technologies (ICACCCT), 2012 IEEE International Conference on
Conference_Location :
Ramanathapuram
Print_ISBN :
978-1-4673-2045-0
Type :
conf
DOI :
10.1109/ICACCCT.2012.6320752
Filename :
6320752
Link To Document :
بازگشت