• DocumentCode
    339686
  • Title

    Linguini: language identification for multilingual documents

  • Author

    Prager, J.M.

  • Author_Institution
    IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
  • Volume
    Track2
  • fYear
    1999
  • fDate
    5-8 Jan. 1999
  • Abstract
    Presents Linguini, a vector-space based categorizer tailored for high-precision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration and the features used. We found that Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy. We also describe how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond that of monolingual analysis. This approach can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
  • Keywords
    information resources; languages; linguistics; text analysis; Linguini; World Wide Web documents; accuracy; category recommendation; computational overhead; features; high-precision language identification; input document size; multilingual documents; subject categorization systems; vector-space based categorizer; Frequency; Information filtering; Information filters; Internet; Organizing; Testing; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on
  • Conference_Location
    Maui, HI, USA
  • Print_ISBN
    0-7695-0001-3
  • Type

    conf

  • DOI
    10.1109/HICSS.1999.772689
  • Filename
    772689