• DocumentCode
    2137962
  • Title

    A similarity algorithm for categorical variables

  • Author

    Liang Zhao ; Jian-hui Liu

  • Author_Institution
    Inst. of Grad., Liaoning Tech. Univ., Huludao, China
  • fYear
    2013
  • fDate
    23-25 July 2013
  • Firstpage
    878
  • Lastpage
    883
  • Abstract
    How to measure the similarity of data objects is one of the most important problems in the data analysis. This paper proposes a method which uses only information of the distribution of attributes to measure the similarity between nominal data objects. In this algorithm, we made the logarithm form of the conditional probability the main interest, because we think that the distribution information is the only information that a dataset can tell us without domain knowledge. First we calculate conditional probability of the target data objects and every other attributes. Then we turn them into logarithm form and sort by the data objects. In last step, we use the average value of each attribute column to compose the feature vector of data objects, and the Euclidean distance will be the similarity metrics between the data objects. The experiments on extensive UCI data sets based on the derived similarity metrics will show the considerable accuracy.
  • Keywords
    category theory; data analysis; probability; Euclidean distance; UCI data sets; attribute column; categorical variables; conditional probability; data analysis; distribution information; domain knowledge; feature vector; nominal data object; similarity algorithm; similarity metrics; target data objects; Accuracy; Algorithm design and analysis; Bayes methods; Frequency measurement; Support vector machine classification; Vectors; Accuracy; Categorical Variables; Similarity measure;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Computation (ICNC), 2013 Ninth International Conference on
  • Conference_Location
    Shenyang
  • Type

    conf

  • DOI
    10.1109/ICNC.2013.6818100
  • Filename
    6818100