Linguini: language identification for multilingual documents

Author

Prager, J.M.

Author_Institution

IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA

Volume

Track2

fYear

1999

fDate

5-8 Jan. 1999

Abstract

Presents Linguini, a vector-space based categorizer tailored for high-precision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration and the features used. We found that Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy. We also describe how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond that of monolingual analysis. This approach can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.

Keywords

information resources; languages; linguistics; text analysis; Linguini; World Wide Web documents; accuracy; category recommendation; computational overhead; features; high-precision language identification; input document size; multilingual documents; subject categorization systems; vector-space based categorizer; Frequency; Information filtering; Information filters; Internet; Organizing; Testing; Text categorization;

fLanguage

English

Publisher

ieee

Conference_Titel

Systems Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on

Conference_Location

Maui, HI, USA

Print_ISBN

0-7695-0001-3

Type

conf

DOI

10.1109/HICSS.1999.772689

Filename

772689