DocumentCode :
2213712
Title :
TOPCRAWL: Community mining in web search engines with emphasize on topical crawling
Author :
Balaji, S. ; Sarumathi, S.
Author_Institution :
Dept. of IT, K.S. Rangasamy Coll. of Technol., Tiruchengode, India
fYear :
2012
fDate :
21-23 March 2012
Firstpage :
20
Lastpage :
24
Abstract :
Web Mining Systems make use of the redundancy of data published on the Web to automatically extract formation from existing web documents. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. This paper proposes a framework and algorithm, TOPCRAWL for mining. The proposed TOPCRAWL algorithm is a new crawling method which emphasis on topic relevancy and outperforms state-of-the-art approaches with respect to recall values achievable within a given period of time. This method also tries to offer the result in community format and it makes use of a new combination of ideas and techniques used to identify and exploit navigational structures of websites, such as hierarchies, lists or maps. This algorithm is simulated with web mining tool Deixto and the basic idea has been implemented using the JAVA and Results are given. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall whilst maintaining precision.
Keywords :
Internet; Web sites; data mining; information retrieval; search engines; HTML files parsing; JAVA; TOPCRAWL algorithm; URL; Web documents; Web mining systems; Web mining tool Deixto; Web page retrieval; Web search engines; Websites navigational structures; community mining; crawler quality; data redundancy; flexibility issue; information retrieval; manageability issue; robustness issue; topic relevancy; topical crawling; Communities; Couplings; Crawlers; Data mining; Java; Web pages; Association Rule mining; B-SIGNET; Blog; Clustering; Mining; Social network analysis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Pattern Recognition, Informatics and Medical Engineering (PRIME), 2012 International Conference on
Conference_Location :
Salem, Tamilnadu
Print_ISBN :
978-1-4673-1037-6
Type :
conf
DOI :
10.1109/ICPRIME.2012.6208281
Filename :
6208281
Link To Document :
بازگشت