DocumentCode :
339686
Title :
Linguini: language identification for multilingual documents
Author :
Prager, J.M.
Author_Institution :
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
Volume :
Track2
fYear :
1999
fDate :
5-8 Jan. 1999
Abstract :
Presents Linguini, a vector-space based categorizer tailored for high-precision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration and the features used. We found that Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy. We also describe how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond that of monolingual analysis. This approach can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Keywords :
information resources; languages; linguistics; text analysis; Linguini; World Wide Web documents; accuracy; category recommendation; computational overhead; features; high-precision language identification; input document size; multilingual documents; subject categorization systems; vector-space based categorizer; Frequency; Information filtering; Information filters; Internet; Organizing; Testing; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on
Conference_Location :
Maui, HI, USA
Print_ISBN :
0-7695-0001-3
Type :
conf
DOI :
10.1109/HICSS.1999.772689
Filename :
772689
Link To Document :
بازگشت