مرکز منطقه ای اطلاع رساني علوم و فناوري - Creating word-level language models for handwriting recognition

DocumentCode :

1583081

Title :

Creating word-level language models for handwriting recognition

Author :

Pitrelli, John E. ; Roy, Amit

Author_Institution :

IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA

fYear :

2001

fDate :

6/23/1905 12:00:00 AM

Firstpage :

721

Lastpage :

725

Abstract :

For large-vocabulary handwriting-recognition applications, such as note-taking, word-level language modeling is of key importance to constrain the recognizer´s search and to contribute to the scoring of hypothesized texts. We discuss the creation of a word-unigram language model, which associates probabilities with individual words. Typically, such models are derived from a large, diverse text corpus. We describe a three-stage algorithm for determining a word unigram from such a corpus: 1) tokenization, the segmenting of a corpus into words; and 2) we select for the model a subset of the set of distinct words found during tokenization. Complexities of these stages are discussed. Finally, we create recognizer-specific data structures for the word set and unigram. Applying our method to a 600-million-word corpus, we generate a 50,000-word model which eliminates 45% of word-recognition errors made by a baseline system employing only a character-level language model

Keywords :

data structures; document image processing; handwriting recognition; natural languages; data structures; handwriting recognition; text corpus; tokenization; word segmentation; word-level language modeling; word-unigram; Character generation; Data structures; Frequency; Handwriting recognition; Hidden Markov models; Law; Legal factors; Natural languages; Text recognition; Writing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on

Conference_Location :

Seattle, WA

Print_ISBN :

0-7695-1263-1

Type :

conf

DOI :

10.1109/ICDAR.2001.953884

Filename :

953884

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1583081