ارائه يك مدل جديد از فاصله بين كلمات پرس و جو بر اساس حداقل جابجايي

عنوان به زبان ديگر

Providing a new model for the distance between query words based on the minimal displacement

پديد آورندگان

پاك سيما، جواد دانشگاه پيام نور - گروه كامپيوتر و فناوري اطلاعات , زارع بيدكي، علي محمد دانشگاه يزد - دانشكده مهندسي برق و كامپيوتر , درهمي، ولي دانشگاه يزد - دانشكده مهندسي برق و كامپيوتر

تعداد صفحه

از صفحه

تا صفحه

كليدواژه

موتور جستجو , رتبه‌بندي , فاصله , وابستگي كلمات

چكيده فارسي

بر اساس تحقيقات انجام شده روي موتورهاي جستجو،‌ اكثر پرس‌وجوهاي كاربران بيش از يك كلمه مي‌باشد. براي پرس‌وجوها با بيش از يك كلمه دو مدل مي‌توان ارائه كرد. در مدل اول فرض مي‌شود كلمات پرس‌وجو مستقل از يكديگر هستند و در مدل دوم محل و ترتيب كلمات وابسته فرض مي‌شود. آزمايش‌ها نشان مي‌دهد كه در اكثر پرس‌وجوها بين كلمات وابستگي وجود دارد. يكي از پارامترهايي كه مي‌تواند وابستگي بين كلمات پرس‌وجو را مشخص كند فاصله‌ي بين كلمات پرس‌وجو در سند است. در اين مقاله تعريف جديدي از فاصله بر اساس حداقل جابجايي كلمات سند به‌منظور تطبيق بر پرس‌وجو ارائه مي‌گردد. همچنين با توجه به اين‌كه اكثر الگوريتم‌هاي رتبه‌بندي از فركانس رخداد يك كلمه در سند (Term Frequency) براي امتيازدهي به اسناد استفاده مي‌كنند و براي پرس‌وجو با بيش از يك كلمه تعريف روشني از اين پارامتر وجود ندارد؛ در اين مقاله پارامترهاي ‌فركانس رخداد يك عبارت (Phrase Frequency) و معكوس فركانس سند (Inverted Document Frequency) با توجه به مفهوم جديد فاصله تعريف شده است و الگوريتم‌هايي براي محاسبه آن‌ها ارائه ‌گرديده است. همچنين نتايج الگوريتم پيشنهادي با الگوريتم پياده‌سازي شده توسط نمايه‌ساز متن‌باز لوسين مقايسه شده است كه افزايش خوبي را در ميانگين دقت نشان مي‌دهد.

چكيده لاتين

Based on the researches performed on search engines, most user queries contain more than one word. For queries with more than one word, two models can be presented. In the first model, query words are assumed to be independent of each other, and in the second model, the place and the order of words are assumed to be dependent. Experiments show that there are dependencies among most query words. One of the parameters that can determine the dependency between query words is the distance between the query words in the document. In this paper, a new distance definition based on the minimum displacement of the document words in order to match the query is presented. Also, given that most ranking algorithms use the word frequency in the documents (Term Frequency) to score the documents and since there is no clear definition for this parameter for queries with more than one word; in this paper, the frequency of the occurrence of a phrase (Phrase Frequency) and Inverted Document Frequency are defined according to the new concept of distance and the proper algorithms are presented to calculate them. Also, the results of the proposed algorithm are compared with the algorithm implemented by the open source Lucene indexer, which shows a good increase in the mean accuracy.

سال انتشار

1396

عنوان نشريه

رايانش نرم و فناوري اطلاعات

فايل PDF

7498111

عنوان نشريه

رايانش نرم و فناوري اطلاعات

لينک به اين مدرک

https://search.isc.ac/dl/search/defaultta.aspx?DTC=8&DC=1016126