مرکز منطقه ای اطلاع رساني علوم و فناوري - مسئلة چندواژگي در پردازش نحو رايانشي زبان فارسي

شماره ركورد كنفرانس :

4163

عنوان مقاله :

مسئلة چندواژگي در پردازش نحو رايانشي زبان فارسي

عنوان به زبان ديگر :

The Problem of Multi-words in Syntactic Processing of Persian

پديدآورندگان :

قيومي مسعود m.ghayoomi@ihcs.ac.ir پژوهشگاه علوم انساني و مطالعات فرهنگي

تعداد صفحه :

كليدواژه :

چندواژگي , پردازش زبان طبيعي , نحو , پيكره زباني.

سال انتشار :

1396

عنوان كنفرانس :

چهارمين همايش ملي زبان شناسي رايانشي

زبان مدرك :

فارسي

چكيده فارسي :

اين مقاله به بررسي چالش چندواژگي در پردازش نحو رايانشي زبان فارسي و ارائه راهكار براي رفع آن مي‌پردازد. اين چالش به دو دستة عمده تقسيم مي‌شود: واحدهاي واژگاني چند قطعه‌اي و چندقطعه‌‌اي‌هاي واژگاني يك واحدي. چالش دستة اول زماني ظاهر مي‌گردد كه در يك زنجيره، چند واژه به‌اشتباه به يكديگر جوش‌خورده‌اند، و يا واژه‌بست به ميزبان ملحق شده‌است. در دستة دوم، چند زنجيره بايد با هم تركيب شوند تا يك واژه حاصل گردد. اين دو چالش در پردازش نحوي و بن‌واژه‌سازي واژه‌ها متبلور است. براي رفع اين دو دسته چالش، سه الگوريتم معرفي مي‌شود كه به‌ترتيب بر روي پيكرة بي‌جن‌خان اجرا مي‌گردند. ويژگي الگوريتم‌ها اين است كه در آنها از روش‌هاي قاعده‌مند و روش‌هاي مبتني‌بر آمار استفاده شده‌است تا به‌طور منسجم بتوانند بر مشكلات حاصل از چندواژگي در پردازش نحو رايانشي زبان فارسي فائق آيند. پس ‌از اعمال الگوريتم‌ها، كار ارزيابي با استفاده از دادة آزمون نشان مي‌دهد كه با اجراي الگوريتم اول كه در آن واژه‌بست از ميزبان جدا مي‌گردد، دقت 52/80 به‌دست مي‌آيد. با اجراي الگوريتم دوم براي رفع مشكل جوش‌خوردگي عناصر، دقت ۴۳/75 به‌دست مي‌آيد. دقت 38/86 درصد با اجراي الگوريتم سوم براي تركيب چند زنجيره و ساخت يك واژه به‌دست مي‌آيد.

چكيده لاتين :

This paper studies the challenge of multi-words in the syntactic processing of the Persian language and it proposes solutions to resolve the problem. The challenge is divided into two major groups: the multi-unit tokens, and the multi-token units. The challenge in the first group appears when a string of two or more words are wrongly fused together or when a clitic is attached to its host. In the second group, a string of two or more words attach together to create one word. To resolve the challenge of the two groups, 3 algorithms are introduced that are applied to the Bijankhan Corpus in a specific order. The main property of the algorithms is that they have utilized rule-based as well as statistical methods to resolve the problems resulted from the multi-words uniformlycoherently. The evaluation of the algorithms shows that the first algorithm where the clitic is split from its host, achieved the accuracy of 80.52%. Applying the second algorithm to resolve the problem of fused elements achieved the accuracy of 75.43%.The third algorithm which composes strings together and creates a word achieved the accuracy of 86.38%.

كشور :

ايران

لينک به اين مدرک :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=36&DC=232731