DocumentCode :
3243515
Title :
A system´s approach towards domain identification of web pages
Author :
Gupta, Swastik ; Bhatia, Komal Kumar
Author_Institution :
Dept. of Comput. Eng., YMCA Univ. of Sci. & Technol., Faridabad, India
fYear :
2012
fDate :
6-8 Dec. 2012
Firstpage :
870
Lastpage :
875
Abstract :
With the proliferation of the document corpora (commonly called as HTML documents or web pages) on the WWW, efficient ways of exploring relevant documents are of increasing importance [4, 8]. The key challenge lies in tackling the sheer volume of documents on the Web and evaluating relevancy for such a huge number. Efficient exploration needs a web crawler that can semantically understand and predict the domain of the web page through analytical processing. This will not only facilitate efficient exploration but also help in the better organization of the web content. As a search engine classifies the Search results by keyword matches, link analysis and other such mechanisms, the paper proposes a solution to the domain identification problem by finding keywords or key terms that are representative of the page´s content through the elements like <;META>; and <;TITLE>; in the HTML structure of the webpage [11]. This paper proposes a two-step framework that automatically first identifies the domain of the specified web page and with the thus obtained domain information, classifies the web content according to the different prespecified categories. The former uses the various HTML elements present in the web page while the latter is achieved using Artificial Neural Networks (ANN).
Keywords :
Web sites; hypermedia markup languages; neural nets; search engines; ANN; HTML documents; WWW; Web content; Web crawler; Web pages; artificial neural networks; document corpora; domain identification; link analysis; search engine; Artificial neural networks; Neurons; World Wide Web; Artificial Neural Networks; HTML elements; META; Search engine; TITLE; categorization; classification; crawler; domain-specific;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Distributed and Grid Computing (PDGC), 2012 2nd IEEE International Conference on
Conference_Location :
Solan
Print_ISBN :
978-1-4673-2922-4
Type :
conf
DOI :
10.1109/PDGC.2012.6449938
Filename :
6449938
Link To Document :
بازگشت