Title :
Linguini: language identification for multilingual documents
Author_Institution :
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Abstract :
Presents Linguini, a vector-space based categorizer tailored for high-precision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration and the features used. We found that Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy. We also describe how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond that of monolingual analysis. This approach can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Keywords :
information resources; languages; linguistics; text analysis; Linguini; World Wide Web documents; accuracy; category recommendation; computational overhead; features; high-precision language identification; input document size; multilingual documents; subject categorization systems; vector-space based categorizer; Frequency; Information filtering; Information filters; Internet; Organizing; Testing; Text categorization;
Conference_Titel :
Systems Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on
Conference_Location :
Maui, HI, USA
Print_ISBN :
0-7695-0001-3
DOI :
10.1109/HICSS.1999.772689