Extracting Digital Fingerprints from Chinese Documents

Author

Liu, Guo-Hua ; Ma, Hui-Dong ; Li, Xu ; Liang, Peng

fYear

2007

fDate

15-19 Dec. 2007

Firstpage

438

Lastpage

441

Abstract

It is an important problem to extract features from Chi- nese documents for protecting intellectual property. The ex- isting approaches are major oriented to words frequency or semantic, they can´t extract features efficiently. By mapping Chinese documents into an ordered set of integers, we find that a Chinese document can be corresponded to a unique ordered set of integers and the set is an isomorphism of the document. So, we propose an algorithm which can hash the set to three kinds of hash value sequences: paragraph se- quence, sentence sequence and chunk sequence, which can represent the features of the document completely. In or- der to reduce the numbers of the features defined as digital fingerprints in this paper, we present an optimal strategy to select some hash values from the sequences. The experiment results show that the algorithms proposed are efficient.

Keywords

Computational intelligence; Data mining; Dictionaries; Feature extraction; Fingerprint recognition; Frequency; Intellectual property; Natural languages; Protection; Sequences;

fLanguage

English

Publisher

ieee

Conference_Titel

Computational Intelligence and Security, 2007 International Conference on

Conference_Location

Harbin, China

Print_ISBN

0-7695-3072-9

Electronic_ISBN

978-0-7695-3072-7

Type

conf

DOI

10.1109/CIS.2007.53

Filename

4415381

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2564389