DocumentCode :
3529021
Title :
Optimization of text database using hierachical clustering
Author :
Tian, Jilei ; Nurminen, Jani
Author_Institution :
Media Lab., Nokia Res. Center, Tampere
fYear :
2009
fDate :
19-24 April 2009
Firstpage :
4269
Lastpage :
4272
Abstract :
Many speech and language related techniques employ models that are trained using text data. In this paper, we introduce a novel method for selecting optimized training sets from text databases. The coverage of the subset selected for training is optimized using hierarchical clustering and the generalized Levenshtein distance. The validity of the proposed subset optimization technique is verified in a data-driven syllabification task. The results clearly indicate that the proposed approach meaningfully optimizes the training set, which in turn improves the quality of the trained model. Compared to the existing state-of-the-art data selection technique, the proposed hierarchical clustering approach improves the compactness of data clusters, decreases the computational complexity and makes data set selection scalable. The presented idea can be used in a wide variety of language processing applications that require training with text data.
Keywords :
database management systems; optimisation; pattern clustering; speech processing; text analysis; computational complexity; data selection technique; data-driven syllabification task; generalized Levenshtein distance; hierarchical clustering; optimized training sets; subset optimization technique; text database optimization; Clustering algorithms; Computational complexity; Databases; Decision trees; Laboratories; Natural languages; Neural networks; Optimization methods; Research and development; Speech processing; Levenshten distance; hierarchical clustering; text data selection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on
Conference_Location :
Taipei
ISSN :
1520-6149
Print_ISBN :
978-1-4244-2353-8
Electronic_ISBN :
1520-6149
Type :
conf
DOI :
10.1109/ICASSP.2009.4960572
Filename :
4960572
Link To Document :
بازگشت