مرکز منطقه ای اطلاع رساني علوم و فناوري - ارائة يك روش مبتني بر مدل زباني براي واحدسازي پيكرۀ فارسي

شماره ركورد :

1028991

عنوان مقاله :

ارائة يك روش مبتني بر مدل زباني براي واحدسازي پيكرۀ فارسي

عنوان به زبان ديگر :

A Tentative Method of Tokenizing Persian Corpus based on Language Modelling

پديد آورندگان :

قيومي، مسعود پژوهشگاه علوم انساني و مطالعات فرهنگي

تعداد صفحه :

از صفحه :

تا صفحه :

كليدواژه :

پردازش زبان طبيعي , واحدسازي داده , مدل‌سازي زباني آماري , زبان‌شناسي پيكره‌اي

چكيده فارسي :

متن نگاشته‌شدۀ فارسي دو مشكل ساده ولي مهم دارد. مشكل اول واژه‌هاي چندواحدي هستند كه از اتصال يك واژه به واژه‌هاي بعدي حاصل مي‌شوند. مشكل ديگر واحدهاي چندواژه‌اي هستند كه از جداشدگي واژه‌هايي كه با هم يك واحد واژگاني تشكيل مي‌دهند حاصل ميگردند. اين مقاله الگوريتمي را معرفي مي‌كند كه بتواند به‌طور خودكار اين دو مشكل را در متن نوشتاريِ فارسي بكاهد و يك متن معيار را به‌دست آورد. الگوريتمِ معرفي‌شده سه مرحله دارد. در مرحلۀ اول، واژه‌هاي چندواحدي از هم جدا مي‌شوند و واحدهاي چندواژه‌اي به يكديگر متصل مي‌شوند. براي اين مرحله، يك الگوريتم پايۀ مبتني‌بر مدل زباني معرفي شده‌است كه كار تفكيك واژه‌هاي چندواحدي به واژه‌هاي مستقل را انجام مي‌دهد. اين الگوريتم باتوجه‌به چالش‌هاي پيش‌آمده بهبود مي‌يابد تا كارايي آن افزايش يابد. همچنين اين مرحله از يك تحليل‌گرِ صرفي براي بررسي وندِ تصريفي و اشتقاقي و روش انطباق فهرست واژه براي رفع مشكل واحدهاي چندواژه‌اي استفاده مي‌كند. در مرحلۀ دوم، از روش انطباق براي بررسيِ چندواژگيِ افعال استفاده مي‌شود. مرحلة سوم تكرار مرحلة اول است تا مشكلات جديد ايجادشده در متن بعداز اجراي مرحلة دوم مرتفع شود. الگوريتم معرفي‌شده براي واحدسازي دادۀ زبانيِ پايگاه داده‌هاي زبان فارسي استفاده شده‌است. با استفاده از اين الگوريتم، 72.40 درصد خطاي نگارشي واژه‌هاي دادة آزمون تصحيح شدهاست. دقت اين تصحيح در دادۀ آزمون 97.80 درصد و خطاي نگارشي ايجادشده توسط اين الگوريتم در دادۀ آزمون 0.02 درصد است.

چكيده لاتين :

A digital Persian text suffers from two simple but important problems. The first problem concerns multi-token units to which the individual words are attached. The other problem concerns multi-unit tokens that result from the detachment of elements of a word. This paper introduces an algorithm to reduce these problems automatically and to achieve a standard text. The proposed algorithm has three steps. In the first step, the multi-token units are split into individual words and the multi-unit tokens are then attached together . For this step, a core algorithm based on language modeling is introduced to split multi-token units into independent words. The algorithm is modified with respect to the possible challenges of improving the performance[m2] . Furthermore, this step utilizes a morphological analyzer to study derivational and inflectional affixes and exact matching in a word list to resolve the problem of the multi-token units. In the second step, an exact word matching strategy is used to resolve the multi-token unit problem of verbs. The third step repeats the algorithm in the first step to fix new problems raised by running the second step. The introduced algorithm was tested in tokenizing the data in the Persian Linguistic DataBase (PLDB). The algorithm achieved 72.04% correction of the errors in the test set with 97.8% accuracy and 0.02% error production in the spelling.

سال انتشار :

1397

عنوان نشريه :

زبان و زبان شناسي

فايل PDF :

7522854

عنوان نشريه :

زبان و زبان شناسي

لينک به اين مدرک :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=8&DC=1028991