شماره ركورد كنفرانس :
2139
عنوان مقاله :
An Introduction to Noor Corpus and its Language Model
عنوان به زبان ديگر :
An Introduction to Noor Corpus and its Language Model
پديدآورندگان :
Elahimanesh Mohammad Hossein نويسنده , Minaei-Bidgoli Behrouz نويسنده , Gholami Mohammad Javad نويسنده , Juzi Hossein نويسنده
كليدواژه :
LANGUAGE MODEL , Natural language processing , Islamic Corpus
عنوان كنفرانس :
نخستين كنفرانس بين المللي پردازش خط و زبان فارسي
چكيده لاتين :
In Linguistics, a text corpus is defined as a large group of text documents. Text corpora are used in order to extract the hidden laws of languages. As one application for statistical researches and hidden laws extraction, language models are made to be used for information retrieval applications. In this paper we introduce one of the greatest text corpora in Islamic science which is called Noor Corpus, and then we provide the Language model of this corpus. The Noor Corpus is results of a decade of efforts from theological researchers and computer engineers of Computer Research Center of Islamic Sciences (CRCIS). This corpus includes thousands of Islamic Books are classified into different categories. Most of the existing texts are Arabic and Persian. There are 1.2 billion Arabic words as well as 616 million Persian words. The bigram language models of this corpus have 80 million distinct bigram words in Arabic and 44 million distinct bigram words in Persian.
شماره مدرك كنفرانس :
4474716