• DocumentCode
    3002804
  • Title

    Effect of feature selection method on the performance of focused crawlers—A case study on traditional and accelerated focused crawlers

  • Author

    Gadiraju, N. V G Sirisha ; Chaitanya, R. Krishna ; Raju, G. V Padma

  • Author_Institution
    Dept. of CSE, S.R.K.R. Eng. Coll., Bhimavaram, India
  • fYear
    2010
  • fDate
    11-12 June 2010
  • Firstpage
    482
  • Lastpage
    487
  • Abstract
    This paper mainly focuses on the effect of feature selection method on the performance of Traditional Focused Crawler (TFC) and Accelerated Focused Crawler (AFC). Information retrieval methods like querying a search engine, usage of web catalog and browsing may not satisfy the information needs of all the users. When information requirement is about a specific topic, focused crawlers will complement these methods. The aim of these crawlers is to download web pages that are highly relevant to the pre-defined topic. Naive Bayesian classifier is used to guide the crawlers by rating the web page before it is downloaded. For this analysis topics to be crawled are represented using a set of relevant documents. The features used by Bayesian Classifier in construction of the model are collected from the document corpus using Document Frequency and Information Gain feature selection methods. Performance of both the crawlers is evaluated when 500 features are selected using Document Frequency and Information Gain feature selection methods. Accelerated Focused Crawler´s performance is evaluated for varied number of features gathered using both the feature selection methods. Target pages recall and Target description recall are used in evaluating the crawlers.
  • Keywords
    Bayes methods; Internet; pattern classification; query processing; search engines; Information retrieval methods; Web catalog; Web pages; accelerated focused crawler; document frequency; feature selection method; information gain feature selection methods; naive Bayesian classifier; search engine querying; target description recall; target pages recall; traditional focused crawler; Acceleration; Bayesian methods; Crawlers; Educational institutions; Frequency; Information retrieval; Information technology; Search engines; Taxonomy; Web pages; Accelerated Focused Crawler; Classifier; Feature Selection; Focused Crawler; Performance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Networking and Information Technology (ICNIT), 2010 International Conference on
  • Conference_Location
    Manila
  • Print_ISBN
    978-1-4244-7579-7
  • Electronic_ISBN
    978-1-4244-7578-0
  • Type

    conf

  • DOI
    10.1109/ICNIT.2010.5508468
  • Filename
    5508468