مرکز منطقه ای اطلاع رساني علوم و فناوري - Finding Syntactic Structure in Unparsed Corpora The Gsearch Corpus Query System

Title of article :

Finding Syntactic Structure in Unparsed Corpora The Gsearch Corpus Query System

Author/Authors :

TREWIN، SHARI نويسنده , , CORLEY، STEFFAN نويسنده , , CORLEY، MARTIN نويسنده , , KELLER، FRANK نويسنده , , CROCKER، MATTHEW W. نويسنده ,

Issue Information :

روزنامه با شماره پیاپی سال 2001

Pages :

-80

From page :

To page :

Abstract :

The most important approaches to computer-assisted authorship attribution are exclusively based on lexical measures that either represent the vocabulary richness of the author or simply comprise frequencies of occurrence of common words. In this paper we present a fully-automated approach to the identification of the authorship of unrestricted text that excludes any lexical measure. Instead we adapt a set of style markers to the analysis of the text performed by an already existing natural language processing tool using three stylometric levels, i.e., token-level, phrase-level, and analysis-level measures. The latter represent the way in which the text has been analyzed. The presented experiments on a Modem Greek newspaper corpus show that the proposed set of style markers is able to distinguish reliably the authors of a randomly-chosen group and performs better than a lexically-based approach. However, the combination of these two approaches provides the most accurate solution (i.e., 87% accuracy). Moreover, we describe experiments on various sizes of the training data as well as tests dealing with the significance of the proposed set of style markers.

Keywords :

Parsing , syntactic annotation , computational linguistics , SGML , psycho-linguistics , corpus search

Journal title :

COMPUTER AND THE HUMANITIES

Serial Year :

2001

Journal title :

COMPUTER AND THE HUMANITIES

Record number :

32085

Link To Document :

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=32085