پيكره اعلام: يك پيكره استاندارد واحدهاي اسمي براي زبان فارسي

عنوان به زبان ديگر

A’laam Corpus: A Standard Corpus of Named Entity for Persian Language

پديد آورندگان

حسين‌نژاد، شادي پژوهشگاه توسعه فناوري هاي پيشرفته خواجه نصيرالدين طوسي، تهران , شكفته، ياسر دانشگاه شهيد بهشتي، تهران - دانشكده مهندسي و علوم كامپيوتر , امامي آزادي، طاهره پژوهشگاه توسعه فناوري هاي پيشرفته خواجه نصيرالدين طوسي، تهران

تعداد صفحه

از صفحه

127

تا صفحه

140

كليدواژه

پردازش زبان طبيعي , تشخيص واحدهاي اسمي , پيكره واحدهاي اسمي , يادگيري ماشين , ميدان تصادفي شرطي

چكيده فارسي

تشخيص واحدهاي اسمي يكي از مسائل مطرح در پردازش زبان طبيعي است. كاربرد عمده شناسايي واحدهاي اسمي در سامانه‌هاي خلاصه‌ساز متون، استخراج اطلاعات، پرسش و پاسخ، ترجمه ماشيني و دسته‌بندي اسناد است. يكي از روش‌هاي تهيه سامانه تشخيص واحدهاي اسمي، استفاده از روش‌هاي مبتني بر پيكره است. اين مقاله نحوه و مراحل تهيه پيكره اَعلام – يك پيكره استاندارد با برچسب واحدهاي اسمي براي زبان فارسي- را شرح مي‌دهد. مجموعه تهيه‌شده با داشتن سيزده برچسب واحدهاي اسمي و حجم 250 هزار كلمه نياز سامانه‌هاي برچسب‌گذاري خودكار در حوزه پردازش زبان طبيعي فارسي را برآورده مي‌كند. با استفاده از اين پيكره و به‌كارگيري روش يادگيري ماشين ميدان تصادفي شرطي، سامانه‌اي براي شناسايي واحدهاي اسمي جملات فارسي تهيه شده كه داراي دقت 92/94 درصد و فراخواني 78/48 درصد است.

چكيده لاتين

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named entities include the names of persons, organizations, locations (e.g. city and country), expressions of times, quantities, monetary expressions, and percentages. In general, corpus-based NER approaches have been proved to be well suited for NER problem. Using a NER corpus, recognition of named entities can be done through ruled-based or machine-learning methods. Corpus-based NER systems need standard and appropriate annotated corpora. However, such corpora mainly exist in languages such as English, and are rarely found in Persian/Farsi or limited in volume. So, this paper is dedicated to describe the producing procedure of a standard named entity (NE) corpus - A’laam corpus - for Persian language. A’laam corpus contains about 250,000 tokens tagged with 13 NE tags. This corpus has been developed in the Research Center for Development of Advanced Technologies (RCDAT). Tokens of A’laam corpus are a part of Farsi Text Corpus. The Farsi Text Corpus is a standard Farsi corpus. This corpus, containing more than 100 million Farsi words, has been developed by the Research Center of Intelligent Signal Processing (changed to the Research Center for Development of Advanced Technologies in 2013). The words of this corpus, selected from diverse written and spoken sources, was tokenized and corrected manually. In addition, a part of the Farsi Text Corpus with 8 million words has part-of-speech (POS) tags at word level. Totally, about 8,400 sentences of the Farsi Text Corpus have been randomly selected to obtain about 250,000 tokens of A’laam Corpus. This corpus included words, POS tags, and named entity tags. To evaluate A’laam corpus, a Persian NER system was trained based on this corpus. This corpus was so divided into the train and test sections. The train section accounted for 90% of the corpus and the remaining 10% belonged to the test section. Using Conditional Random Fields (CRF) method, the Persian NER system resulted in a 92.94% Precision and 78.48% Recall.

سال انتشار

1396

عنوان نشريه

پردازش علائم و داده ها

فايل PDF

7329326

عنوان نشريه

پردازش علائم و داده ها

لينک به اين مدرک

https://search.isc.ac/dl/search/defaultta.aspx?DTC=8&DC=997227