مرکز منطقه ای اطلاع رساني علوم و فناوري - هم‌ مرجع‌ يابي مبتني بر پيكره در متون فارسي

شماره ركورد :

1141491

عنوان مقاله :

هم‌ مرجع‌ يابي مبتني بر پيكره در متون فارسي

عنوان به زبان ديگر :

Corpus based coreference resolution for Farsi text

پديد آورندگان :

رحيمي، زينب پژوهشگاه توسعه فناوري‌هاي پيشرفته خواجه نصيرالدين طوسي تهران - گروه پردازش صوت و زبان طبيعي , حسين نژاد، شادي پژوهشگاه توسعه فناوري‌هاي پيشرفته خواجه نصيرالدين طوسي تهران - گروه پردازش صوت و زبان طبيعي

تعداد صفحه :

از صفحه :

تا صفحه :

كليدواژه :

هم‌ مرجع يابي خودكار , مرجع‌ گزيني , تحليل مرجع ضمير , عبارات ارجاعي

چكيده فارسي :

مرجع‌ يابي يا مرجع‌ گزيني يا پيدا‌كردن واژگان هم‌مرجع در متن، يكي از وظايف مهم در پردازش زبان طبيعي است كه يك بخش عملياتي مهم در مسائلي مانند خلاصه‌سازي خودكار، پرسش و پاسخ خودكار و استخراج اطلاعات به‌شمار مي‌رود. طبق تعاريف زماني، دو واژه زماني هم‌مرجع هستند كه هر دو به موجوديت واحدي در متن يا جهان حقيقي ارجاع بدهند. تاكنون براي حل اين مسأله تلاش‌هاي متعددي صورت گرفته است كه بنابر نتايج اين مطالعات، عمليات مرجع‌گزيني را مي‌توان با روش‌هاي متفاوتي مانند روش‌هاي قاعده‌مند، مبتني بر قوانين مكاشفه‌اي و روش‌هاي يادگيري ماشين (بانظارت يا بي‌ناظر) انجام داد. نكته قابل توجه اين است كه در سال‌هاي اخير استفاده از پيكره‌هاي برچسب‌گذاري‌شده در اين زمينه رواج زيادي داشته و منجر به توليد نتايج مناسبي هم شده است. با تكيه بر اين موضوع، در پژوهش حاضر، يك پيكره از واژگان هم‌مرجع توليد شده كه حدود يك‌ميليون واژه به‌همراه برچسب موجوديت نامدار دارد. در بخش مرجع‌گزيني تمام گروه‌هاي اسمي، ضماير و موجوديت‌هاي نامدار برچسب‌گذاري شده‌اند و برچسب‌هاي موجوديت نامدار پيكره شامل هفت برچسب است. در پژوهش حاضر با استفاده از اين پيكره، يك ابزار مرجع‌گزيني خودكار با استفاده از ماشين بردار پشتيبان توليد شده كه دقت آن بر روي داده‌هاي آزمايش طلايي در حدود شصت درصد است.

چكيده لاتين :

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be used in many natural language processing tasks such as machine translation, automatic text summarization, question answering, and information extraction systems. Adding coreference information can increase the power of natural language processing systems. The coreference resolution can be done through different ways. These methods include heuristic rule-based methods and supervised/unsupervised machine learning methods. Corpus based and machine learning based methods are widely used in coreference resolution task in recent years and has led to a good performance. For using such these methods, there is a need for manually labeled corpus with sufficient size. For Persian language, before this research, there exists no such corpus. One of the important targets here, was producing a through corpus that can be used in coreference resolution task and other associated fields in linguistics and computational linguistics. In this coreference resolution research, a corpus of coreference tagged phrases has been generated (manually annotated) that has about one million words. It also has named entity recognition (NER) tags. Named entity labels in this corpus include 7 labels and in coreference task, all noun phrases, pronouns and named entities have been tagged. Using this corpus, a coreference tool was created using a vector space machine, with precision of about 60% on golden test data. As mentioned before, this article presents the procedure for producing a coreference resolution tool. This tool is produced by machine learning method and is based on the tagged corpus of 900 thousand tokens. In the production of the system, several different features and tools have been used, each of which has an effect on the accuracy of the whole tool. Increasing the number of features, especially semantic features, can be effective in improving results. Currently, according to the sources available in the Persian language, there are no suitable syntactic and semantic tools, and this research suffers from this perspective. The coreference tagged corpus produced in this study is more than 500 times bigger than the previous Persian language corpora and at the same time it is quite comparable to the prominent ACE and Ontonotes corpora. The system produced has an f-measure of nearly 60 according to the CoNLL standard criterion. However, other limited studies conducted in Farsi have provided different accuracy from 40 to 90%, which is not comparable to the present study, because the accuracy of these studies has not been measured with standard criterion in the coreference resolution field.

سال انتشار :

1399

عنوان نشريه :

پردازش علائم و داده ها

فايل PDF :

8113646

لينک به اين مدرک :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=8&DC=1141491