DocumentCode :
1855054
Title :
Text clustering on authorship attribution based on the features of punctuations usage
Author :
Jin Mingzhe ; Minghu Jiang
Author_Institution :
Dept. & Grad. Sch. of Culture & Inf. Sci., Doshisha Univ., Kyoto, Japan
Volume :
3
fYear :
2012
fDate :
21-25 Oct. 2012
Firstpage :
2175
Lastpage :
2178
Abstract :
This paper proposes a method of extracting writing characteristics of various authors based on their usage of punctuation marks. Comparative analysis has been done between the text clustering effects of the proposed method and character Bigram method using 200 articles of five well-known modern writers. The analysis also covers the performance of Euclidean distance, cosine distance and KLD (Kullback-Leibler) distance used in the text clustering. In conclusion, the analysis results show that: (1) The method proposed in this paper not only features low dimension, but also is superior to Bigram, (2) KLD has obvious advantages compared to Euclidean distance and cosine distance, and F1 value using the Ward hierarchical clustering of KLD distance can reach 96%~99%.
Keywords :
natural language processing; pattern clustering; text analysis; Euclidean distance; KLD distance; Kullback-Leibler distance; authorship attribution; character Bigram method; cosine distance; punctuation marks; punctuations usage; text clustering; writing characteristic extraction; authorship attribution; bigram of characters; distance; text clustering; usage of punctuations;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Signal Processing (ICSP), 2012 IEEE 11th International Conference on
Conference_Location :
Beijing
ISSN :
2164-5221
Print_ISBN :
978-1-4673-2196-9
Type :
conf
DOI :
10.1109/ICoSP.2012.6492012
Filename :
6492012
Link To Document :
بازگشت