Title :
Normalized Compression Distance of Multisets with Applications
Author :
Cohen, Andrew R. ; Vitanyi, Paul M. B.
Author_Institution :
Dept. of Electr. & Comput. Eng., Drexel Univ., Philadelphia, PA, USA
Abstract :
Pairwise normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity metric based on compression. We propose an NCD of multisets that is also metric. Previously, attempts to obtain such an NCD failed. For classification purposes it is superior to the pairwise NCD in accuracy and implementation complexity. We cover the entire trajectory from theoretical underpinning to feasible practice. It is applied to biological (stem cell, organelle transport) and OCR classification questions that were earlier treated with the pairwise NCD. With the new method we achieved significantly better results. The theoretic foundation is Kolmogorov complexity.
Keywords :
bioinformatics; computational complexity; image classification; optical character recognition; Kolmogorov complexity; NCD; OCR classification; biological task; multisets; normalized compression distance; organelle transport; pairwise NCD; pairwise normalized compression distance; parameter-free-feature-free-alignment-free similarity metric; stem cell; Accuracy; Additives; Complexity theory; Educational institutions; Measurement; Pattern recognition; Retina; Kolmogorov complexity; Normalized compression distance; classification; data mining; handwritten character recognition; multisets or multiples; organelle transport; pattern recognition; retinal progenitor cells; similarity; synthetic data;
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
DOI :
10.1109/TPAMI.2014.2375175