DocumentCode
1830907
Title
Detecting the Same Text in Different Languages
Author
Koroutchev, Kostadin ; Cebrián, Manuel
Author_Institution
Depto. de IngenierÃ\xada Informática, Universidad Autónoma de Madrid, 28049 Madrid, Spain. k.koroutchev@uam.es
fYear
2006
fDate
Oct. 2006
Firstpage
337
Lastpage
341
Abstract
Compression based similarity distances have the main drawback of needing the same coding scheme for the objects to be compared. When two texts are translated, there exists significant similarity with no literal coincidence. In this article, we present an algorithm that compares the redundancy structure of the data extracted by means of a Lempel- Ziv compression scheme. Each text is presented as a graph and two texts are considered similar with our measure if they have the same referential topology when compressed. We give empirical evidence that this measure detects similarity between data coded in different languages.
Keywords
Compression algorithms; Computer science education; Conferences; Data mining; Entropy; Humans; Information theory; Length measurement; Object detection; Topology;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Theory Workshop, 2006. ITW '06 Chengdu. IEEE
Conference_Location
Chengdu, China
Print_ISBN
1-4244-0067-8
Electronic_ISBN
1-4244-0068-6
Type
conf
DOI
10.1109/ITW2.2006.323816
Filename
4119314
Link To Document