پديد آورندگان :
حسيننژاد، شادي پژوهشگاه توسعه فناوري هاي پيشرفته خواجه نصيرالدين طوسي، تهران , شكفته، ياسر دانشگاه شهيد بهشتي، تهران - دانشكده مهندسي و علوم كامپيوتر , امامي آزادي، طاهره پژوهشگاه توسعه فناوري هاي پيشرفته خواجه نصيرالدين طوسي، تهران
كليدواژه :
پردازش زبان طبيعي , تشخيص واحدهاي اسمي , پيكره واحدهاي اسمي , يادگيري ماشين , ميدان تصادفي شرطي
چكيده فارسي :
تشخيص واحدهاي اسمي يكي از مسائل مطرح در پردازش زبان طبيعي است. كاربرد عمده شناسايي واحدهاي اسمي در سامانههاي خلاصهساز متون، استخراج اطلاعات، پرسش و پاسخ، ترجمه ماشيني و دستهبندي اسناد است. يكي از روشهاي تهيه سامانه تشخيص واحدهاي اسمي، استفاده از روشهاي مبتني بر پيكره است. اين مقاله نحوه و مراحل تهيه پيكره اَعلام – يك پيكره استاندارد با برچسب واحدهاي اسمي براي زبان فارسي- را شرح ميدهد. مجموعه تهيهشده با داشتن سيزده برچسب واحدهاي اسمي و حجم 250 هزار كلمه نياز سامانههاي برچسبگذاري خودكار در حوزه پردازش زبان طبيعي فارسي را برآورده ميكند. با استفاده از اين پيكره و بهكارگيري روش يادگيري ماشين ميدان تصادفي شرطي، سامانهاي براي شناسايي واحدهاي اسمي جملات فارسي تهيه شده كه داراي دقت 92/94 درصد و فراخواني 78/48 درصد است.
چكيده لاتين :
Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named entities include the names of persons, organizations, locations (e.g. city and country), expressions of times, quantities, monetary expressions, and percentages. In general, corpus-based NER approaches have been proved to be well suited for NER problem. Using a NER corpus, recognition of named entities can be done through ruled-based or machine-learning methods. Corpus-based NER systems need standard and appropriate annotated corpora. However, such corpora mainly exist in languages such as English, and are rarely found in Persian/Farsi or limited in volume. So, this paper is dedicated to describe the producing procedure of a standard named entity (NE) corpus - A’laam corpus - for Persian language. A’laam corpus contains about 250,000 tokens tagged with 13 NE tags. This corpus has been developed in the Research Center for Development of Advanced Technologies (RCDAT). Tokens of A’laam corpus are a part of Farsi Text Corpus. The Farsi Text Corpus is a standard Farsi corpus. This corpus, containing more than 100 million Farsi words, has been developed by the Research Center of Intelligent Signal Processing (changed to the Research Center for Development of Advanced Technologies in 2013). The words of this corpus, selected from diverse written and spoken sources, was tokenized and corrected manually. In addition, a part of the Farsi Text Corpus with 8 million words has part-of-speech (POS) tags at word level. Totally, about 8,400 sentences of the Farsi Text Corpus have been randomly selected to obtain about 250,000 tokens of A’laam Corpus. This corpus included words, POS tags, and named entity tags. To evaluate A’laam corpus, a Persian NER system was trained based on this corpus. This corpus was so divided into the train and test sections. The train section accounted for 90% of the corpus and the remaining 10% belonged to the test section. Using Conditional Random Fields (CRF) method, the Persian NER system resulted in a 92.94% Precision and 78.48% Recall.