Title :
Classification of personal names with application to DBLP
Author :
Biryukov, Maria ; Wang, Yafang
Author_Institution :
Dept. of Comput. Sci., Univ. of Luxembourg, Luxembourg City
Abstract :
In this paper we propose a new perspective for the data analysis in digital libraries, bibliographic and other databases containing personal names. Knowing language/cultural background of a person can be beneficial in many applications, however this information is often not present explicitly in the databases. We present here a statistical tool for the automatic language detection of personal names. Our system does not require a dictionary of names for training and handles 14 different languages so far. General purpose corpora for all Western European, Chinese, Japanese and Turkish languages are used in order to build simple statistical models of the languages. The tool is fine tuned to achieve precision and recall above 90% for many languages which proves better performance than some other systems aiming at the language identification of personal names. On an example of a bibliographical database DBLP we show how our tool can be used in tasks such as data cleaning and discovery of trends.
Keywords :
bibliographic systems; digital libraries; document handling; statistical analysis; Chinese language; DBLP bibliographical database; Japanese language; Turkish language; Western European language; automatic language detection; data analysis; digital libraries; language-cultural background; personal name classification; statistical tool; Application software; Bibliographies; Cleaning; Computer science; Cultural differences; Data analysis; Databases; Dictionaries; Natural languages; Software libraries;
Conference_Titel :
Digital Information Management, 2008. ICDIM 2008. Third International Conference on
Conference_Location :
London
Print_ISBN :
978-1-4244-2916-5
Electronic_ISBN :
978-1-4244-2917-2
DOI :
10.1109/ICDIM.2008.4746754