سيستم شناسايي و طبقه بندي اسامي در متون فارسي

عنوان به زبان ديگر

Persian name entity recognition and classification

پديد آورندگان

اصفهاني، سيد عبدالحميد نويسنده esfahani, abdolhamid , راحتي قوچاني، سعيد نويسنده Rahati quehani , saeed , جهانگيري، نادر نويسنده Jahangiri, nader

اطلاعات موجودي

دوفصلنامه سال 1389 شماره 13

رتبه نشريه

علمي پژوهشي

تعداد صفحه

از صفحه

تا صفحه

كليدواژه

پردازش زبان طبيعي , انتخاب ويژگي , تابع N-gram , شناسايي و طبقه‌بندي اسامي

چكيده فارسي

يك سيستم شناسايي و طبقه‌بندي اسامي، سيستمي است كه مي تواند يك يا چند نوع از اسامي را در متن شناسايي و طبقه‌بندي كند اين اسامي مي توانند اسامي اشخاص، ارگان ها، شركت ها، اسامي مكان ها ( كشور، شهر، خيابان و مانند آن) اسامي زمان (تاريخ و ساعت) مقادير مالي، درصدها و مانند آن باشد. هر چند كه در دهه اخير كارهاي زيادي بر روي سيستم هاي شناسايي و طبقه‌بندي اسامي در زبان هاي مختلف و دامنه هاي مختلف انجام شده است، امّا در زبان فارسي، با توجّه به عدم وجود يك مجموعه داده كامل به همراه برچسب هاي غني، تاكنون سيستمي براي طبقه بندي اسامي ايجاد نشده است. در اين پژوهش از مجموعه داده پژوهشكده پردازش هوشمند علايم استفاده شده است. روش كار بدين صورت است كه در ابتدا الگوريتم پيش پردازش اسامي را با استفاده از برچسب دستوري كلمات از داده ها جدا شده و سپس مصدر ها، اسامي زمان، اسامي شمارشي، اعداد را هم از مجموعه داده حذف مي كند. بقيه مطالب به دليل errore سيستم درج نگرديد.

چكيده لاتين

Abstract Name entity recognition (NER) is a system that can identify one or more kinds of names in a text and classify them into specified categories. These categories can be name of people, organizations, companies, places (country, city, street, etc.), time related to names (date and time), financial values, percentages, etc. Although during the past decade a lot of researches has been done on NER in different languages, but lack of a system with admissible performance in Farsi texts is quietly sensible. In this paper, the Corpus of Research Center of Intelligent Signal Processing has been used to create a Farsi NER. In our proposed NER system, there exist three stages: preprocessing, feature extraction and classification. To prepare a data set in the preprocessing stage, by using the part of speech (POS) feature, names are extracted from text and then infinitives, time related names, counting names, and numbers are removed from data. This gives a more balanced data set for learning and classification. In the feature extraction stage, N- gram is computed as feature, and four classifiers (linear, KNN, Bayesian, Neural Network) is learned in the classification stage. Because of lack of variety in the time related names and a few number of mixture of time related names with names in the other categories, an auxiliary list is used to identifying them. The results of research show, neural network have better performance (99%) in distinct between the names of places and people. In general, KNN and linear classifiers obtain 91% success based on F-measure scale in classifying the names of places and people and general names. In classifying the time related names, using an auxiliary list, based on an F- measure scale, a 96% success was obtained.

سال انتشار

1389

عنوان نشريه

پردازش علائم و داده ها

عنوان نشريه

پردازش علائم و داده ها

اطلاعات موجودي

دوفصلنامه با شماره پیاپی 13 سال 1389

كلمات كليدي

#تست#آزمون###امتحان

لينک به اين مدرک

https://search.isc.ac/dl/search/defaultta.aspx?DTC=8&DC=472331