Title :
Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage
Author :
Wilson, D. Randall
Author_Institution :
FamilySearch, Salt Lake City, UT, USA
fDate :
July 31 2011-Aug. 5 2011
Abstract :
Probabilistic record linkage has been used for many years in a variety of industries, including medical, government, private sector and research groups. The formulas used for probabilistic record linkage have been recognized by some as being equivalent to the naïve Bayes classifier. While this method can produce useful results, it is not difficult to improve accuracy by using one of a host of other machine learning or neural network algorithms. Even a simple single-layer perceptron tends to outperform the naïve Bayes classifier-and thus traditional probabilistic record linkage methods-by a substantial margin. Furthermore, many record linkage system use simple field comparisons rather than more complex features, partially due to the limits of the probabilistic formulas they use. This paper presents an overview of probabilistic record linkage, shows how to cast it in machine learning terms, and then shows that it is equivalent to a naïve Bayes classifier. It then discusses how to use more complex features than simple field comparisons, and shows how probabilistic record linkage formulas can be modified to handle this. Finally, it demonstrates a huge improvement in accuracy through the use of neural networks and higher-level matching features, compared to traditional probabilistic record linkage on a large (80,000 pair) set of labeled pairs of genealogical records used by FamilySearch.org.
Keywords :
Bayes methods; learning (artificial intelligence); pattern matching; perceptrons; records management; genealogical record linkage method; high-level matching features; machine learning; naive Bayes classifier; neural network algorithms; probabilistic record linkage method; single-layer perceptron; Accuracy; Classification algorithms; Couplings; Fires; Neural networks; Probabilistic logic; Training;
Conference_Titel :
Neural Networks (IJCNN), The 2011 International Joint Conference on
Conference_Location :
San Jose, CA
Print_ISBN :
978-1-4244-9635-8
DOI :
10.1109/IJCNN.2011.6033192