• DocumentCode
    3398818
  • Title

    A novel approach to build Kannada web Corpus

  • Author

    Parameswarappa, S. ; Narayana, V.N. ; Bharathi, G.N.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Malnad Coll. of Eng., Hassan, India
  • fYear
    2012
  • fDate
    10-12 Jan. 2012
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    This paper introduces the Kannada Corpus tool, a suite of Perl (Program Extraction and Reporting Language) programs implementing an iterative procedure to build Kannada corpora from the web. The procedure requires is, first a set of "seed" words list is built and later a set of “seed” URLs (Uniform Resource Locator) containing documents in the Kannada language is collected by sending queries to commercial search engines (Google and Yahoo). The obtained seeds are then used to start a crawling job using the open-source, command-line based downloading tool "wget". The downloaded documents are then processed in various ways in order to build Kannada raw corpora such as HTML (Hyper Text Markup Language) code removal, boilerplate stripping, and language identification, duplicate and near duplicate detection. We conducted an evaluation of the tool by applying it to the construction of Kannada corpora from the domains such as Recent Discussions, Articles, Recent Activities, Proverbs, Recent Feedback\´s, Poems and Fifteen Books, Novels, News paper, Dictionary, Blogs and Informal Chats. The results illustrate the potential usefulness of the tool.
  • Keywords
    Internet; Perl; document handling; hypermedia markup languages; natural language processing; public domain software; query processing; search engines; Google; HTML code removal; Kannada Web corpus; Kannada corpus tool; Kannada language; Perl programs; Yahoo; boilerplate stripping; crawling job; duplicate detection; hyper text markup language; iterative procedure; language identification; near duplicate detection; open-source command-line based downloading tool; program extraction-and-reporting language; search engines; seed URL; seed words list; uniform resource locator; wget; Corpora; Kannada corpus; Part-of-Speech (POS) tagging; Tokenizer; World Wide Web; wget;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Communication and Informatics (ICCCI), 2012 International Conference on
  • Conference_Location
    Coimbatore
  • Print_ISBN
    978-1-4577-1580-8
  • Type

    conf

  • DOI
    10.1109/ICCCI.2012.6158824
  • Filename
    6158824