مرکز منطقه ای اطلاع رساني علوم و فناوري - A new model for persian multi-part words edition based on statistical machine translation

Title of article :

A new model for persian multi-part words edition based on statistical machine translation

Author/Authors :

Zahedi، M. نويسنده School of Computer Engineering & Information Technology, Shahrood University of Technology, Shahrood,Iran. , , Arjomandzadeh ، A. نويسنده School of Computer Engineering & Information Technology, Shahrood University of Technology, Shahrood,Iran. ,

Issue Information :

دوفصلنامه با شماره پیاپی سال 2016

Pages :

From page :

To page :

Abstract :

اجزا كلمات چندبخشي در زبان انگليسي با استفاده از خط ربط از هم جدا مي‌شوند. در زبان فارسي براي جداكردن اجزا كلمات چندبخشي و در عين حال حفظ يكپارچگي اجزا به‌عنوان يك كلمه واحد، از نيم‌فاصله استفاده مي‌شود. در بسياري از موارد به نادرستي بين اجزا يك كلمه چندبخشي فاصله قرار مي‌گيرد كه اين باعث بوجود آمدن مشكلاتي در پردازش متن فارسي و همچنين باعث كاهش خوانايي متن مي‌شود. در اين مقاله روشي ارايه شده‌است كه با استفاده از آن مي‌توان فاصله ميان اجزا كلمات چندبخشي را با نيم‌فاصله ويرايش كرد. در روش ارايه‌شده، پارادايم ترجمه ماشيني آماري براي ويرايش فاصله ميان اجزا كلمه چندبخشي به‌كارگرفته ‌شده‌‌است. در ترجمه ماشيني آماري از يك پيكره موازي براي استخراج پارامترهاي آماري و اطلاعات زباني استفاده مي‌شود. به‌طوري‌كه در سمت مبدا پيكره موازي، متن زبان مبدا و در سمت هدف آن متن زبان مقصد قرار دارد. پيكره موازي كه در اين مقاله ايجاد شده به اين‌صورت است كه در سمت مبدا متني با كلمات چندبخشي كه بين اجزا آن فاصله قرار دارد، آمده‌است و در سمت هدف اين فاصله‌ها به نيم‌فاصله ويرايش شده‌اند. نتايج نشان‌دهنده اين است كه روش ارايه‌شده مي‌تواند با ميزان دقت چشمگيري فاصله ميان كلمات چندبخشي را به نيم‌فاصله ويرايش كند.

Abstract :

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate.

Journal title :

Journal of Artificial Intelligence and Data Mining

Serial Year :

2016

Journal title :

Journal of Artificial Intelligence and Data Mining

Record number :

2387954

Link To Document :

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=2387954