Deducing linguistic structure from the statistics of large corpora

Author

Brill, Eric ; Magerman, David ; Marcus, Mitchell ; Santorini, Beatrice

Author_Institution

Dept. of Comput. & Inf. Sci., Pennsylvania Univ., Philadelphia, PA, USA

fYear

1990

fDate

22-25 Oct 1990

Firstpage

380

Lastpage

389

Abstract

Two experiments that strongly suggest that largely distributional techniques might be developed to automatically provide both a set of part of speech tags for English and a skeletal parsing of free English text are described. In one experiment the authors have developed a constituent boundary parsing algorithm that derives an (unlabeled) bracketing, given text annotated for part of speech as input. In other experiment the authors have investigated whether a distributional analysis can discover a part of speech tag set which might prove adequate to support experiments. The state of a tagged natural language corpus to aid such experiments is summarized

Keywords

computational linguistics; grammars; linguistics; natural languages; English text; boundary parsing algorithm; distributional analysis; large corpora; linguistic structure; skeletal parsing; speech tags; tagged natural language corpus; Data mining; Distributed computing; Error analysis; Information analysis; Mutual information; Natural languages; Speech analysis; Statistical distributions; Statistics; Stochastic processes;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Technology, 1990. 'Next Decade in Information Technology', Proceedings of the 5th Jerusalem Conference on (Cat. No.90TH0326-9)

Conference_Location

Jerusalem

Print_ISBN

0-8186-2078-1

Type

conf

DOI

10.1109/JCIT.1990.128309

Filename

128309