مرکز منطقه ای اطلاع رساني علوم و فناوري - An Introduction to Noor Corpus and its Language Model

شماره ركورد كنفرانس :

2139

عنوان مقاله :

An Introduction to Noor Corpus and its Language Model

عنوان به زبان ديگر :

An Introduction to Noor Corpus and its Language Model

پديدآورندگان :

Elahimanesh Mohammad Hossein نويسنده , Minaei-Bidgoli Behrouz نويسنده , Gholami Mohammad Javad نويسنده , Juzi Hossein نويسنده

تعداد صفحه :

كليدواژه :

LANGUAGE MODEL , Natural language processing , Islamic Corpus

سال انتشار :

1391

عنوان كنفرانس :

نخستين كنفرانس بين المللي پردازش خط و زبان فارسي

زبان مدرك :

فارسی

چكيده لاتين :

In Linguistics, a text corpus is defined as a large group of text documents. Text corpora are used in order to extract the hidden laws of languages. As one application for statistical researches and hidden laws extraction, language models are made to be used for information retrieval applications. In this paper we introduce one of the greatest text corpora in Islamic science which is called Noor Corpus, and then we provide the Language model of this corpus. The Noor Corpus is results of a decade of efforts from theological researchers and computer engineers of Computer Research Center of Islamic Sciences (CRCIS). This corpus includes thousands of Islamic Books are classified into different categories. Most of the existing texts are Arabic and Persian. There are 1.2 billion Arabic words as well as 616 million Persian words. The bigram language models of this corpus have 80 million distinct bigram words in Arabic and 44 million distinct bigram words in Persian.

شماره مدرك كنفرانس :

4474716

سال انتشار :

1391

از صفحه :

تا صفحه :

سال انتشار :

1391

لينک به اين مدرک :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=36&DC=90093