SAZEH: A Wide Coverage Persian Constituency Tree Bank and Parser

پديدآورندگان

Tabatabayi Seifi Shohreh RCDAT: Research Center for Development of Advanced Technologies , Sarraf Rezaee Iman RCDAT: Research Center for Development of Advanced Technologies

تعداد صفحه

4

كليدواژه

Constituency Treebank , Constituency Parser , Natural Language Processing

سال انتشار

1396

عنوان كنفرانس

نوزدهمين كنفرانس بين المللي هوش مصنوعي و پردازش سيگنال

زبان مدرك

انگليسي

چكيده فارسي

— Constituency parsing is one of the basic operations in many NLP tasks such as translation, Information Extraction, Abstractive Summarization and etc. We need wide coverage constituency treebank to train a probabilistic parser. SAZEH is the first large-volume Persian constituency treebank with more than 21000 parsed trees and 627000 tokens. The average length of its sentences is 30 words. They are chosen from Peykare Corpus which already has POS tags. Berkeley Lexical Parser is trained on SAZEH corpus and the best F-measure attained on the test part of the corpus is 81.65% using gold POS-tags.

چكيده لاتين

— Constituency parsing is one of the basic operations in many NLP tasks such as translation, Information Extraction, Abstractive Summarization and etc. We need wide coverage constituency treebank to train a probabilistic parser. SAZEH is the first large-volume Persian constituency treebank with more than 21000 parsed trees and 627000 tokens. The average length of its sentences is 30 words. They are chosen from Peykare Corpus which already has POS tags. Berkeley Lexical Parser is trained on SAZEH corpus and the best F-measure attained on the test part of the corpus is 81.65% using gold POS-tags.

كشور

ايران