ارائه روشي براي استخراج كلمات كليدي و وزن‌دهي كلمات براي بهبود طبقه‌بندي متون فارسي

عنوان به زبان ديگر

An Approach for Extraction of Keywords and Weighting Words for Improvement Farsi Documents Classification

پديد آورندگان

رضائي، وحيده دانشگاه آزاد اسلامي واحد ياسوج - گروه رياضي , محمدپور، مجيد دانشگاه آزاد اسلامي واحد ياسوج - گروه مهندسي كامپيوتر , پروين، حميد دانشگاه آزاد اسلامي واحد نورآباد ممسني - گروه مهندسي كامپيوتر , نجاتيان، صمد دانشگاه آزاد اسلامي واحد ياسوج - گروه مهندسي برق

تعداد صفحه

از صفحه

تا صفحه

كليدواژه

اصطلاح‌نامه , بازيابي اطلاعات , استخراج كلمات كليدي , وزن‌ دهي

چكيده فارسي

با توجه به گسترش روز افزون اطلاعات و وجود حجم انبوه متون غيرساخت يافته، استفاده از كلمات كليدي نقش مهمي در بازيابي اطلاعات دارد. اين درحالي است كه استخراج كلمات كليدي به صورت دستي مشكلات زيادي دارد. بنابرين استخراج كلمات كليدي به صورت خودكار از نيازهاي ضروري فناوري امروزه است. در اين پژوهش سعي شده با استفاده از اصطلاح نامه كه از نظامي ساختارمند برخوردار است، كلمات كليدي با معناتري از متون استخراج كرد و با آنها طبقه بندي متون فارسي را بهبود بخشيد. مراحلي كه براي افزايش جامعيت جستجو بايد سپري شود به اين صورت است كه در مرحله نخست كلمات زائد حذف و باقي كلمات ريشه يابي مي شود؛ سپس به كمك اصطلاح ‌نامه كلمات هممعني، اعم ها و اخص ها و همچنين وابسته ها پيدا و در ادامه براي مشخص‌ شدن اهميت نسبي كلمات يك وزن عددي به هر كلمه منسوب مي‌شود كه بيانگر ميزان تأثير كلمه در ارتباط با موضوع متن و در مقايسه با ساير كلمات به كار‌ رفته در متن است‌. با توجه به مراحل بالا و به كمك اصطلاح نامه، طبقه بندي متون دقيق تر انجام مي گيرد. در اين روش از الگوريتم نزديكترين همسايه (KNN) براي طبقهبندي استفاده مي شود. الگوريتم KNN به خاطر سادگي و مؤثر‌بودن آن در طبقه بندي متون بسيار به كار برده مي شود. مبناي كار اين الگوريتم، مقايسه متن آزمايش داده‌ شده با متون آموزشي داده‌ شده و به دست‌آوردن ميزان شباهت بين آنها است. نتايج آزمايش‌ها بر روي چندين متن در موضوع هاي مختلف، نشان دهنده دقت و توانايي روش پيشنهادي در استخراج كلمات كليدي منطبق با خواست كاربر و در‌نتيجه طبقه بندي دقيق تر متون است.

چكيده لاتين

Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. Authors claim that extraction of more meaningful keywords out of documents can be attained via employment of a thesaurus. The keywords extracted by applying thesaurus, can improve the document classification. The steps to be taken to increase the comprehensiveness of search should be such that in the first step the stop words are removed and the remaining words are stemmed. Then, with the help of a thesaurus are found words equivalent, hierarchical and dependent. Then, to determine the relative importance of words, a numerical weight is assigned to each word, which represents effect of the word on the subject matter and in comparison with other words used in the text. According to the steps above and with the help of a thesaurus, an accurate text classification is performed. In this method, the KNN algorithm is used for the classification. Due to the simplicity and effectiveness of this algorithm (KNN), there is a great deal of use in the classification of texts. The cornerstone of KNN is to compare with the text trained and text tested to determine their similarity between. The empirical results show the quality and accuracy of extracted keywords are satisfiable for users. They also confirm that the document classification has been enhanced. In this research, it has been tried to extract more meaningful keywords out of texts using thesaurus (which is a structured word-net) rather than not using it.

سال انتشار

1396

عنوان نشريه

پردازش علائم و داده ها

فايل PDF

7329382

عنوان نشريه

پردازش علائم و داده ها

لينک به اين مدرک

https://search.isc.ac/dl/search/defaultta.aspx?DTC=8&DC=997261