A novel approach to build Kannada web Corpus

Author

Parameswarappa, S. ; Narayana, V.N. ; Bharathi, G.N.

Author_Institution

Dept. of Comput. Sci. & Eng., Malnad Coll. of Eng., Hassan, India

fYear

2012

fDate

10-12 Jan. 2012

Firstpage

1

Lastpage

6

Abstract

This paper introduces the Kannada Corpus tool, a suite of Perl (Program Extraction and Reporting Language) programs implementing an iterative procedure to build Kannada corpora from the web. The procedure requires is, first a set of "seed" words list is built and later a set of “seed” URLs (Uniform Resource Locator) containing documents in the Kannada language is collected by sending queries to commercial search engines (Google and Yahoo). The obtained seeds are then used to start a crawling job using the open-source, command-line based downloading tool "wget". The downloaded documents are then processed in various ways in order to build Kannada raw corpora such as HTML (Hyper Text Markup Language) code removal, boilerplate stripping, and language identification, duplicate and near duplicate detection. We conducted an evaluation of the tool by applying it to the construction of Kannada corpora from the domains such as Recent Discussions, Articles, Recent Activities, Proverbs, Recent Feedback\´s, Poems and Fifteen Books, Novels, News paper, Dictionary, Blogs and Informal Chats. The results illustrate the potential usefulness of the tool.

Keywords

Internet; Perl; document handling; hypermedia markup languages; natural language processing; public domain software; query processing; search engines; Google; HTML code removal; Kannada Web corpus; Kannada corpus tool; Kannada language; Perl programs; Yahoo; boilerplate stripping; crawling job; duplicate detection; hyper text markup language; iterative procedure; language identification; near duplicate detection; open-source command-line based downloading tool; program extraction-and-reporting language; search engines; seed URL; seed words list; uniform resource locator; wget; Corpora; Kannada corpus; Part-of-Speech (POS) tagging; Tokenizer; World Wide Web; wget;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Communication and Informatics (ICCCI), 2012 International Conference on

Conference_Location

Coimbatore

Print_ISBN

978-1-4577-1580-8

Type

conf

DOI

10.1109/ICCCI.2012.6158824

Filename

6158824