• DocumentCode
    3530737
  • Title

    Efficacy of a constantly adaptive language modeling technique for web-scale applications

  • Author

    Wang, Kuansan ; Li, Xiaolong

  • Author_Institution
    Internet Service Res. Center (ISRC), Microsoft Corp., Redmond, WA
  • fYear
    2009
  • fDate
    19-24 April 2009
  • Firstpage
    4733
  • Lastpage
    4736
  • Abstract
    In this paper, we describe CALM, a method for building statistical language models for the Web. CALM addresses several unique challenges dealing with the Web contents. First, CALM does not rely on the whole corpus to be available to build the language model. Instead, we design CALM to progressively adapt itself as Web chunks are made available by the crawler. Second, given the dynamic and dramatic changes in the Web contents, CALM is designed to quickly enrich its lexicon and N-grams as new vocabulary and phrases are discovered. To reduce the amount of heuristics and human interventions typically needed for model adaptation, we derive an information theoretical formula for CALM to facilitate the optimal adaptation in the maximum a posteriori (MAP) sense. Testing against a collection of Web chunks where new vocabulary and phrases are dominant, we show CALM can achieve comparable and satisfactory model measured in perplexity. We also show CALM is robust against over training and the initial condition, suggesting that any assumptions made in obtaining the initial model can gradually see their impacts diminished as CALM runs its full course and adapt to more data.
  • Keywords
    Internet; vocabulary; CALM addresses; N-grams; Web contents; Web-scale applications; adaptive language modeling technique; maximum a posteriori; statistical language models; Adaptation model; Buildings; Crawlers; Humans; Large-scale systems; Natural languages; Speech recognition; Testing; Vocabulary; Web and internet services; CALM; MAP adaptation; N-gram; Statistical language model; Web applications;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on
  • Conference_Location
    Taipei
  • ISSN
    1520-6149
  • Print_ISBN
    978-1-4244-2353-8
  • Electronic_ISBN
    1520-6149
  • Type

    conf

  • DOI
    10.1109/ICASSP.2009.4960688
  • Filename
    4960688