• DocumentCode
    3571007
  • Title

    Who wrote this paper? Learning for authorship de-identification using stylometric featuress

  • Author

    Hurtado, Jose ; Taweewitchakreeya, Napat ; Xingquan Zhu

  • Author_Institution
    Dept. of Comput. & Electr. Eng. & Comput. Sci., Florida Atlantic Univ., Boca Raton, FL, USA
  • fYear
    2014
  • Firstpage
    859
  • Lastpage
    862
  • Abstract
    In this paper, we propose to combine stylometric features and neural networks for authorship de-identification. Our research mainly focuses on scientific publications, because scholarly journals are publicly available with plenty of labeled data to learn an author´s style or traits. The main challenge of authorship de-identification is to identify features which can properly capture an author´s writing style. In the proposed design, we choose a combination of stylometric features, including lexical, syntactic, structural and content-specific features, to represent each author´s style and use them to build classification models. We manually collect publications from computer science and biomedicine domains and validate our designs by using a number of classification methods. Our experiments show that among four well-known classifiers, Multilayer Perceptron (MLP) classifiers achieve the best performance for authorship de-identification.
  • Keywords
    feature extraction; learning (artificial intelligence); multilayer perceptrons; pattern classification; text analysis; MLP classifier; author style learning; author trait learning; author writing style; authorship deidentification; biomedicine publications; classification method; classification model; computer science publications; content-specific features; feature identification; lexical features; multilayer perceptron classifier; neural networks; publicly available scholarly journals; scientific publications; structural features; stylometric features; syntactic features; Abstracts; Computer science; Feature extraction; Radio frequency; Support vector machines; Training data; Machine learning; artificial neural network; authorship de-identification; text classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on
  • Type

    conf

  • DOI
    10.1109/IRI.2014.7051981
  • Filename
    7051981