• DocumentCode
    1830907
  • Title

    Detecting the Same Text in Different Languages

  • Author

    Koroutchev, Kostadin ; Cebrián, Manuel

  • Author_Institution
    Depto. de IngenierÃ\xada Informática, Universidad Autónoma de Madrid, 28049 Madrid, Spain. k.koroutchev@uam.es
  • fYear
    2006
  • fDate
    Oct. 2006
  • Firstpage
    337
  • Lastpage
    341
  • Abstract
    Compression based similarity distances have the main drawback of needing the same coding scheme for the objects to be compared. When two texts are translated, there exists significant similarity with no literal coincidence. In this article, we present an algorithm that compares the redundancy structure of the data extracted by means of a Lempel- Ziv compression scheme. Each text is presented as a graph and two texts are considered similar with our measure if they have the same referential topology when compressed. We give empirical evidence that this measure detects similarity between data coded in different languages.
  • Keywords
    Compression algorithms; Computer science education; Conferences; Data mining; Entropy; Humans; Information theory; Length measurement; Object detection; Topology;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Theory Workshop, 2006. ITW '06 Chengdu. IEEE
  • Conference_Location
    Chengdu, China
  • Print_ISBN
    1-4244-0067-8
  • Electronic_ISBN
    1-4244-0068-6
  • Type

    conf

  • DOI
    10.1109/ITW2.2006.323816
  • Filename
    4119314