• DocumentCode
    152780
  • Title

    Language based web crawling on big data

  • Author

    Girgin, Canan ; Gonultac, Hayati ; Muhtaroglu, F. Canan Pembe ; Demir, Simsek ; Akin, Ahmet Afsin ; Obali, M.

  • fYear
    2014
  • fDate
    23-25 April 2014
  • Firstpage
    1528
  • Lastpage
    1531
  • Abstract
    Online textual and visual data that are created and used by web users have been increasing dramatically and continually. This increase has caused the need for easy and fast access to online data and facilitated the development of alternative means of access to this data. Nowadays, web crawlers are the most efficient and popular tools used for accessing big volumes of data available on the web. In this paper, a web crawler which works on a distributed Hadoop cluster for crawling web pages with content of a predefined language is described. A language identification tool is developed for enabling the system to focus only on a specific language. In this study, the accuracy of the language identification tool is evaluated on a small data set (consisting of 4729 web pages). The performance of the focused web crawling system is reported on a big data set of 86 million web pages containing Turkish content.
  • Keywords
    Big Data; Internet; natural language processing; search engines; Big Data; Turkish content; Web crawlers; Web crawling system; Web users; distributed Hadoop cluster; language identification tool; visual data; Conferences; Crawlers; Internet; Signal processing; Visualization; Web pages; Big Data; Language Identification; Web Crawling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal Processing and Communications Applications Conference (SIU), 2014 22nd
  • Conference_Location
    Trabzon
  • Type

    conf

  • DOI
    10.1109/SIU.2014.6830532
  • Filename
    6830532