مرکز منطقه ای اطلاع رساني علوم و فناوري - روشي جديد جهت استخراج موجوديت‌هاي اسمي در عربي كلاسيك

شماره ركورد :

997181

عنوان مقاله :

روشي جديد جهت استخراج موجوديت‌هاي اسمي در عربي كلاسيك

عنوان به زبان ديگر :

A New Approach for Extracting Named Entity in Classical Arabic

پديد آورندگان :

سجادي، محمدباقر دانشگاه آزاد تهران مركز، تهران - دانشكده كامپيوتر , رشيدي، حسن دانشگاه علامه طباطبايي، تهران - دانشكده رياضي و علوم كامپيوتر , مينايي بيدگلي، بهروز دانشگاه علم و صنعت، تهران - دانشكده كامپيوتر

تعداد صفحه :

از صفحه :

تا صفحه :

كليدواژه :

تشخيص واحدهاي اسمي , مجمع رده‌بندها , روش بوستينگ , زبان عربي كلاسيك

چكيده فارسي :

تشخيص واحدهاي اسمي به‌ عنوان يكي از سامانه‌هاي پردازش زبان طبيعي عبارت از تشخيص اسامي خاص و طبقه‌بندي آن‌ها به يكي از گروه‌هاي شخص، مكان، سازمان و زمان است. اين عمليات به دليل تأثير قابل توجه در بهبود كارايي ديگر حوزه‌هاي پردازش زبان طبيعي مانند ترجمه ماشين، بازيابي اطلاعات، خوشه‌بندي نتايج جستجو و پرسش و پاسخ، در سال‌هاي اخير مورد توجه پژوهش‌گران در زبان عربي نيز قرار گرفته است. گرچه بيشتر پژوهش‌ها در اين حوزه روي عربي استاندارد امروزي انجام ‌شده است، اما در اين مطالعه عربي كلاسيك مورد توجه است. در همين راستا، روشي جديد جهت تشخيص واحدهاي اسمي در زبان عربي ارائه مي‌شود. در اين پژوهش يك پيكره متني عربي كلاسيك به نام نوركورپ، متشكل از 130 هزار كلمه برچسب‌گذاري‌شده توسط متخصصان، معرفي مي‌شود؛ همچنين از يك فرهنگ لغات شامل 18000 اسامي اشخاص كه از كتب حديثي استخراج شده است، به‌عنوان منابع خارجي استفاده مي‌شود. مدل پيش‌بيني، بر اساس مجمع رده‌بندها و يك روش دو‌مرحله‌اي پيشنهاد شده است؛ به‌طوري‌كه در مرحله نخست تشخيص واحدهاي اسمي از طريق الگوريتم آدابوست M1 و در مرحله دوم طبقه‌بندي آن‌ها به گروه‌هاي از‌پيش‌تعيين‌شده توسط الگوريتم آدابوست M2 انجام مي‌شود. به‌ منظور غلبه بر چالش‌هاي زبان عربي عمليات نشانه‌گذاري، برچسب‌گذاري ادات سخن و قطعه‌كردن عبارت پايه به كار گرفته‌شده است. با استفاده از يك روش آماري، برخي از كلمات پر كاربرد در واحدهاي اسمي به‌عنوان كلمات كليدي استخراج شدند. نتيجه به‌دست‌آمده از مدل پيشنهادي در ارزيابي F-measure‌ معادل 86/85 درصد است كه بيان‌گر عملكرد مطلوب مدل است. در آخر، روش پيشنهادي روي يك پيكره استاندارد امروزي به نام انركورپ اعمال و نتايج با پيكره نوركورپ مقايسه شده‌اند.

چكيده لاتين :

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result clustering, etc. While most of these researches are based on Modern Standard Arabic (MSA), in this paper, we focus on Classical Arabic (CA) literature. We propose a corpus called NoorCorp with 130k labeled words for research purposes which is annotated by expert human resources manually. This corpus is based on a Historic-Islamic book of 1200 years ago including 1843 sentences and 127550 words. We also collected about 18k proper names from old Hadith books as a gazetteer which is called NoorGazet used as a future. In this paper, we propose a new approach to extract named entities (NEs) including person, location, organization and time. We use hybrid approach benefiting from advantages of Rule based approach and Machine learning approach. We divided the NoorCorp into two parts of training and test sets containing 80% and 20% of the data set respectively. Prediction model, based on Boosting method, was developed in two steps which Adaboost.M1 is employed to identify NEs and Adaboost.M2 is employed to classify NEs. There are many methods using multiple classifiers as voters and summing up their results, among which, ensemble methods are those which generate multiple hypotheses using the same base learner. We developed an ensemble consisting of 50 members (classifiers) based on decision stump to implement the weak learner. Since only 17% of the text data is composed of name entity labels, we had to deepen the tree while restricting pruning. We exploited tokenizing, part of speech (POS) tagging, and base phrase chunking (BPC) to overcome linguistic obstacles in Arabic including Meaning ambiguity, Optional diacritics, Complex morphology and Nonstandard written text. Moreover, using a statistical technique, the most frequently used words extracted as key words. Results show that performance of the method is better than decision tree as the base classifier. An overall F-measure value of 86.85 obtained which is better than base line about 20% and CART decision tree about 12%. Since CA corpus consists of simpler linguistic patterns compared to MSA, we applied the proposed approach on ANERCorp as Modern Standard Arabic corpus. Results show that the proposed model outcome on CA corpus is about 19% better than MSA. This result is due to the fact that there are plenty of NEs entered to MSA from other languages. These proper names do not have specific patterns and do not exist in the gazetteer. In addition, many NE’s are not distributed uniformly in ANERcorp which considerably reduces the results accuracy.

سال انتشار :

1396

عنوان نشريه :

پردازش علائم و داده ها

فايل PDF :

7329164

عنوان نشريه :

پردازش علائم و داده ها

لينک به اين مدرک :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=8&DC=997181