• DocumentCode
    575083
  • Title

    To create a confusion matrix in respect of threshold being fixed for effective detection of near duplicate web documents in Web Crawling

  • Author

    Narayana, V.A. ; Govardhan, A. ; Premchand, P.

  • Author_Institution
    Dept. of CSE, CMR Coll. of Eng. & Tech., Hyderabad, India
  • fYear
    2011
  • fDate
    Nov. 29 2011-Dec. 1 2011
  • Firstpage
    763
  • Lastpage
    768
  • Abstract
    The drastic development of the WWW in recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overheads for the search engines critically affecting their performance and quality which have to be removed to provide users with the relevant results for their queries. In this paper, we have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling where the keywords are extracted from the crawled pages and the similarity score between two pages is calculated. The documents having similarity score lesser than a threshold value are considered as near duplicates. The approximate value of the threshold is 19.5043. In this paper we have created a confusion matrix from which the efficiency of the algorithm for the detection of Near Duplicates is found out.
  • Keywords
    Internet; matrix algebra; query processing; search engines; WWW; Web crawling; Web search engines; confusion matrix; keyword extraction; near duplicate Web document detection; querying; similarity score; Confusion Matrix; Fingerprint; Near-duplicate; Similarity score; Web crawling and Threshold; false negative; false positive; true negative; true positive;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on
  • Conference_Location
    Seogwipo
  • Print_ISBN
    978-1-4577-0472-7
  • Type

    conf

  • Filename
    6316719