Title :
Similarity Calculation with Length Delimiting Dictionary Distance
Author :
Burkovski, Andre ; Klenk, Sebastian ; Heidemann, Gunther
Author_Institution :
Dept. for Intell. Syst., Univ. of Stuttgart, Stuttgart, Germany
Abstract :
The Normalized Compression Distance (NCD) has gained considerable interest in pattern recognition as a similarity measure applicable to unstructured data of very different domains, such as text, DNA sequences, or images. NCD uses existing compression programs such as gzip to compute similarity between objects. NCD has unique features: It does not require any prior knowledge, data preprocessing, feature extraction, domain adaptation or any parameter settings. Further, the NCD can be applied to symbolic data and raw signals alike. In this paper we decompose the NCD and introduce a method to measure compression-based similarity without the need to use compression. The Length Delimiting Dictionary Distance (LD3) takes the one component essential in compression methods, the dictionary generation, and strips the NCD of all dispensable components. The LD3 performs "compression based pattern recognition without compression", keeping all of the above benefits of the NCD while achieving better speed and recognition rates. We first review the NCD, introduce LD3 as the "essence" of NCD, and evaluate the LD3 based on language tree experiments, authorship recognition, and genome phylogeny data.
Keywords :
data mining; dictionaries; pattern recognition; trees (mathematics); NCD; compression-based similarity; feature extraction; genome phylogeny data; language tree experiments; length delimiting dictionary distance; normalized compression distance; parameter-free data mining; pattern recognition; Complexity theory; Compression algorithms; Compressors; Dictionaries; Image coding; Measurement; Pattern recognition; dictionary-based compression; normalized compression distance; parameter-free data mining; pattern recognition; similarity metric;
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on
Conference_Location :
Boca Raton, FL
Print_ISBN :
978-1-4577-2068-0
Electronic_ISBN :
1082-3409
DOI :
10.1109/ICTAI.2011.133