DocumentCode
339686
Title
Linguini: language identification for multilingual documents
Author
Prager, J.M.
Author_Institution
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Volume
Track2
fYear
1999
fDate
5-8 Jan. 1999
Abstract
Presents Linguini, a vector-space based categorizer tailored for high-precision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration and the features used. We found that Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy. We also describe how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond that of monolingual analysis. This approach can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Keywords
information resources; languages; linguistics; text analysis; Linguini; World Wide Web documents; accuracy; category recommendation; computational overhead; features; high-precision language identification; input document size; multilingual documents; subject categorization systems; vector-space based categorizer; Frequency; Information filtering; Information filters; Internet; Organizing; Testing; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Systems Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on
Conference_Location
Maui, HI, USA
Print_ISBN
0-7695-0001-3
Type
conf
DOI
10.1109/HICSS.1999.772689
Filename
772689
Link To Document