Title :
Exploiting limited data for parsing
Author :
Dongchen Li ; Xiantao Zhang ; Xihong Wu
Author_Institution :
Key Lab. of Machine Perception & Intell., Peking Univ., Beijing, China
Abstract :
Data sparsity issues are extremely severe for parser due to the flexibility of tree structures. Many tags and productions appears a little, nevertheless, they are crucial for the parse disambiguation where it occurs. Besides, when a common tag somewhat regularly occurs in a non-canonical position, its distribution is usually distinct. In this paper, we propose a metric that measures the scarcity of any phrase with arbitrary span size. To make a better compromise between training trees with high confidence and scarcity, we try to catch some constraints in response to rare but articulating categories when training latent variable grammar. We exploits the limited data more sufficiently by capturing the depicting power of rate tree structure configuration in Expectation & Maximization procedure and Split & Merge framework. The resulting grammars are interpretable as our intension. Based on this approach, we further propose a method that exploits the limited training date from multiple perspectives, and accumulates their advantages in a product model. Despite its limited training data, out model improves parsing performance on Penn Chinese Treebank Fifth Edition, even higher than some systems with extra unlabeled data and external resources. Furthermore, this method is easy to generalized to cope with data sparsity in other natural language processing tasks.
Keywords :
expectation-maximisation algorithm; grammars; merging; natural language processing; tree data structures; Penn Chinese treebank fifth edition; data sparsity; expectation and maximization procedure; latent variable grammar; natural language processing tasks; parse disambiguation; parsing performance improvement; rate tree structure configuration; scarcity measurement; split and merge framework; training trees; tree structure flexibility; Computational linguistics; Data models; Grammar; Merging; Production; Training; Vegetation;
Conference_Titel :
Computer and Information Science (ICIS), 2014 IEEE/ACIS 13th International Conference on
Conference_Location :
Taiyuan
DOI :
10.1109/ICIS.2014.6912128