DocumentCode :
566440
Title :
Using Random Forest classifiers to detect duplicate gazetteer records
Author :
Martins, Bruno ; Galhardas, Helena ; Goncalves, Nuno
Author_Institution :
INESC-ID, Tech. Univ. of Lisbon, Porto Salvo, Portugal
fYear :
2012
fDate :
20-23 June 2012
Firstpage :
1
Lastpage :
4
Abstract :
This paper presents an approach for detecting duplicate records in the context of digital gazetteers, using a state-of-the-art machine learning technique. It reports on a thorough evaluation of a machine learning approach designed for the task of classifying pairs of gazetteer records as either duplicates or not, built by using Random Forests and leveraging on different combinations of similarity scores for the feature vectors. Experimental results show that using feature vectors that combine multiple similarity scores, derived from place names, semantic relationships, place types and geospatial footprints, leads to an accuracy of 97.45%.
Keywords :
geography; learning (artificial intelligence); pattern classification; duplicate records; feature vectors; gazetteer records; geospatial footprints; machine learning technique; place names; random forest classifiers; semantic relationships; similarity scores; Conferences; Data mining; Geospatial analysis; Machine learning; Manuals; Semantics; Support vector machine classification; Digital Gazetteers; Duplicate Detection; Random Forests; Supervised Machine Learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Systems and Technologies (CISTI), 2012 7th Iberian Conference on
Conference_Location :
Madrid
ISSN :
2166-0727
Print_ISBN :
978-1-4673-2843-2
Type :
conf
Filename :
6263211
Link To Document :
بازگشت