مرکز منطقه ای اطلاع رساني علوم و فناوري - پيما: پيكره برچسب‌خورده موجوديت‌هاي اسمي زبان فارسي

شماره ركورد :

1122927

عنوان مقاله :

پيما: پيكره برچسب‌خورده موجوديت‌هاي اسمي زبان فارسي

عنوان به زبان ديگر :

PAYMA: A Tagged Corpus of Persian Named Entities

پديد آورندگان :

شهشهاني، مهسا‌سادات دانشگاه تهران - دانشكده مهندسي برق و كامپيوتر - پرديس دانشكده هاي فني , محسني، مهدي دانشگاه تهران - دانشكده مهندسي برق و كامپيوتر - پرديس دانشكده هاي فني , شاكري، آزاده دانشگاه تهران - دانشكده مهندسي برق و كامپيوتر - پرديس دانشكده هاي فني , فيلي، هشام دانشگاه تهران - دانشكده مهندسي برق و كامپيوتر - پرديس دانشكده هاي فني

تعداد صفحه :

از صفحه :

تا صفحه :

109

كليدواژه :

پيكره موجوديت‌هاي اسمي , تشخيص موجوديت‌هاي اسمي , روش قاعده‌ محور , روش مبتني بر يادگيري عميق , روش ميدان‌هاي تصادفي شرطي

چكيده فارسي :

هدف در مسأله تشخيص موجوديت‌هاي اسمي، ردهبندي اسامي خاص متن با برچسب‌هايي همچون شخص، مكان، و سازمان است. اين مسأله به‌عنوان يكي از گام‌هاي پيش‌پردازشي بسياري از مسائل پردازش زبان طبيعي مطرح است. اگر چه در زبان انگليسي پژوهش‌هاي زيادي در اين حوزه انجام شده و سامانه‌ها به كيفيت F1 بالاي نود درصد دست يافته‌اند، در زبان فارسي به‌دليل نبود يك مجموعه داده استاندارد، پژوهش‌هاي كمي در اين زمينه انجام شده است. در اين پژوهش به ساخت چنين مجموعه‌داده‌اي مي‌پردازيم و آن را به‌صورت آزاد در اختيار پژوهش‌گران قرار مي‌دهيم؛ سپس با استفاده از اين مجموعه‌داده به طراحي سامانه آماري با استفاده از مدل ميدان‌هاي تصادفي شرطي و نيز سامانه‌اي مبتني بر شبكه‌هاي عصبي بازگشتي از نوع LSTM براي تشخيص موجوديتهاي اسمي مي‌پردازيم. در پيكره ايجاد‌شده هفت نوع موجوديت شخص، مكان، سازمان، زمان، تاريخ، درصد، و مقادير پولي برچسب خوردهاند و در‌نتيجه تمام ارزيابي‌هاي سامانه طراحي‌شده بر روي اين هفت برچسب انجام مي‌گيرد. براي طراحي اين سامانه، پس از آموزش يك سامانه آماري مبتني بر الگوريتم CRF، از خروجي اين سامانه به‌عنوان يك ويژگي براي آموزش يك شبكه عصبي بازگشتي LSTM دوطرفه استفاده مي‌كنيم. علاوه‌بر اين ويژگي، از خوشه‌بندي واژگان به روش k- means نيز بهره مي‌بريم. براي اين كار، شماره خوشه واژگان را به‌عنوان يك ويژگي در اختيار شبكه عصبي LSTM قرار مي‌دهيم و به اين ترتيب سامانه تركيبي نهايي ساخته مي‌شود. اين شيوه تركيب مدل CRF با مدل شبكه عصبي و نيز استفاده از شماره خوشه براي هر واژه در روش خوشه‌بندي k-means نوآوري اين پژوهش محسوب مي‌شود. نتايج آزمايش‌ها نشان مي‌دهد كه با استفاده از مدل نهايي به F1 برابر با 87 درصد در سطح واژه و هشتاد درصد در سطح عبارت موجوديت اسمي مي‌رسيم. همچنين آزمايش‌ها نشان مي‌دهد كه روش پيشنهادي براي استفاده از خروجي مدل CRF به‌عنوان يك ويژگي در ورودي مدل شبكه عصبي باعث مي‌شود كه با در‌اختيار‌داشتن حجم كمتري از داده برچسب‌خورده به كيفيت قابل قبولي در تشخيص موجوديت‌هاي اسمي برسيم كه اين مسأله مي‌تواند در زبان‌هايي كه حجم داده برچسب‌خورده آن‌ها محدود است، مفيد باشد.

چكيده لاتين :

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached performances of higher than 90 percent in terms of F1 measure, there are very few research studies on this task in Persian. One of the main important reasons for this may be the lack of a standard Persian NER dataset to train and test the NER systems. In this research we create a standard tagged Persian NER dataset which will be distributed freely for research purposes. In order to construct this standard dataset, we studied the existing standard NER datasets in English and came to the conclusion that almost all of these datasets are constructed using news data. Thus we collected documents from ten news websites in Persian. In the next step, in order to provide the annotators with guidelines to tag these documents, we studied the guidelines used for constructing CoNLL and MUC English datasets and created our own guidelines considering the Persian linguistic rules. Using these guidelines, all words in documents can be labeled as person, location, organization, time, date, percent, currency, or other (words that are not in any of these 7 classes). We use IOB encoding for annotating named entities in documents, like most of the existing English NER datasets. Using this encoding, the first token of a named entity is labeled with B, and the next tokens (if exist) are labeled with I. The words that are not part of any named entity are labeled with O. The constructed corpus, named PAYMA, consists of 709 documents and includes 302530 tokens. 41148 tokens out of these tokens are labeled as named entities and the others are labeled as O. In order to determine the inter-annotator agreement, 160 documents were labeled by a second annotator. Kappa statistic was estimated as 95% using words that are labeled as named entities. After creating the dataset, we used the dataset to design a hybrid system for named entity recognition. We trained a statistical system based on the CRF algorithm, and used its output as a feature to train a bidirectional LSTM recurrent neural network. Moreover, we used the k-means word clustering method to cluster the words and fed the cluster number of each word to the LSTM neural network. This form of combining CRF with neural networks and using the cluster number for each word is the novelty of this research work. Experimental results show that the final model can reach an F1 score of 87% at word-level and 80% at phrase level.

سال انتشار :

1398

عنوان نشريه :

پردازش علائم و داده ها

فايل PDF :

7755313

لينک به اين مدرک :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=8&DC=1122927