Title :
Clustering of Short Strings in Large Databases
Author :
Kazimianec, Michail ; Mazeika, Arturas
Author_Institution :
Fac. of Comput. Sci., Free Univ. of Bozen-Bolzano, Bolzano, Italy
fDate :
Aug. 31 2009-Sept. 4 2009
Abstract :
A novel method CLOSS intended for textual databases is proposed. It successfully identifies misspelled string clusters, even if the cluster border is not prominent. The method uses q-gram approach to represent data and a string proximity graph to find the cluster. Contribution refers to short string clustering in text mining, when the proximity graph has multiple horizontal lines or the line is not present.
Keywords :
data mining; pattern clustering; string matching; text analysis; very large databases; CLOSS; cluster border; clustering of short strings; large databases; q-gram approach; string proximity graph; text mining; textual databases; Application software; Clustering methods; Computer science; Databases; Detection algorithms; Expert systems; Robustness; Smoothing methods; Tagging; Text mining; clustering; q-grams; short strings;
Conference_Titel :
Database and Expert Systems Application, 2009. DEXA '09. 20th International Workshop on
Conference_Location :
Linz
Print_ISBN :
978-0-7695-3763-4
DOI :
10.1109/DEXA.2009.73