• DocumentCode
    1161244
  • Title

    Association pattern language modeling

  • Author

    Chien, Jen-Tzung

  • Author_Institution
    Dept. of Comput. Sci. & Inf. Eng., Nat. Cheng Kung Univ., Tainan
  • Volume
    14
  • Issue
    5
  • fYear
    2006
  • Firstpage
    1719
  • Lastpage
    1728
  • Abstract
    Statistical n-gram language modeling is popular for speech recognition and many other applications. The conventional n-gram suffers from the insufficiency of modeling long-distance language dependencies. This paper presents a novel approach focusing on mining long distance word associations and incorporating these features into language models based on linear interpolation and maximum entropy (ME) principles. We highlight the discovery of the associations of multiple distant words from training corpus. A mining algorithm is exploited to recursively merge the frequent word subsets and efficiently construct the set of association patterns. By combining the features of association patterns into n-gram models, the association pattern n-grams are estimated with a special realization to trigger pair n-gram where only the associations of two distant words are considered. In the experiments on Chinese language modeling, we find that the incorporation of association patterns significantly reduces the perplexities of n-gram models. The incorporation using ME outperforms that using linear interpolation. Association pattern n-gram is superior to trigger pair n-gram. The perplexities are further reduced using more association steps. Further, the proposed association pattern n-grams are not only able to elevate document classification accuracies but also improve speech recognition rates
  • Keywords
    computational linguistics; data mining; interpolation; maximum entropy methods; speech recognition; Chinese language modeling; association pattern language modeling; linear interpolation; long distance word associations mining; maximum entropy principles; perplexities reduction; speech recognition; statistical n-gram language modeling; training corpus; Biomedical optical imaging; Data mining; Entropy; Information retrieval; Interpolation; Natural language processing; Natural languages; Probability; Speech recognition; Testing; Association pattern; data mining; language model; long distance association; maximum entropy and trigger pair;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TSA.2005.858551
  • Filename
    1677991