Unsupervised word segmentation from noisy input

Author

Heymann, Jahn ; Walter, O. ; Haeb-Umbach, Reinhold ; Raj, Bhiksha

Author_Institution

Dept. of Commun. Eng., Univ. of Paderborn, Paderborn, Germany

fYear

2013

fDate

8-12 Dec. 2013

Firstpage

458

Lastpage

463

Abstract

In this paper we present an algorithm for the unsupervised segmentation of a character or phoneme lattice into words. Using a lattice at the input rather than a single string accounts for the uncertainty of the character/phoneme recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model is known. Recently a Weighted Finite State Transducer (WFST) based approach has been published which we show to suffer from an issue: language model probabilities of known words are computed incorrectly. Fixing this issue leads to greatly improved precision and recall rates, however at the cost of increased computational complexity. It is therefore practical only for single input strings. To allow for a lattice input and thus for errors in the character/phoneme recognizer, we propose a computationally efficient suboptimal two-stage approach, which is shown to significantly improve the word segmentation performance compared to the earlier WFST approach.

Keywords

probability; speech recognition; unsupervised learning; word processing; character recognizer; computationally efficient suboptimal two-stage approach; error-prone phoneme recognizer; label sequence; language model probabilities; lexical unit discovery; noisy input; phoneme lattice; unsupervised word segmentation algorithm; word segmentation performance; zero-resource setting; Acoustics; Computational modeling; Context; Lattices; Probability; Speech; Transducers; Automatic speech recognition; Unsupervised learning;

fLanguage

English

Publisher

ieee

Conference_Titel

Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on

Conference_Location

Olomouc

Type

conf

DOI

10.1109/ASRU.2013.6707773

Filename

6707773