Title :
A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia
Author :
Woodward, Daryl ; Witmer, Jeremy ; Kalita, Jugal
Author_Institution :
Comput. Sci. Dept., Univ. of Colorado Colorado Springs, Colorado Springs, CO, USA
Abstract :
We target in this paper the challenge of extracting geospatial data from the article text of the English Wikipedia. We present the results of a Hidden Markov Model (HMM) based approach to identify location-related named entities in the our corpus of Wikipedia articles, which are primarily about battles and wars due to their high geospatial content. The HMM NER process drives a geocoding and resolution process, whose goal is to determine the correct coordinates for each place name (often referred to as grounding). We compare our results to a previously developed data structure and algorithm for disambiguating place names that can have multiple coordinates. We demonstrate an overall f-measure of 79.63% identifying and geocoding place names. Finally, we compare the results of the HMM-driven process to earlier work using a Support Vector Machine.
Keywords :
Internet; cartography; geographic information systems; hidden Markov models; information retrieval; English Wikipedia; Wikipedia article; data structure; geocoding; geospatial content; geospatial data extraction; geospatial entity extraction; hidden Markov model; location-related named entity identification; named entity recognition; support vector machine; Electronic publishing; Encyclopedias; Geospatial analysis; Hidden Markov models; Internet; Support vector machines; GIS; NER; geospatial extration; wikipedia;
Conference_Titel :
Semantic Computing (ICSC), 2010 IEEE Fourth International Conference on
Conference_Location :
Pittsburgh, PA
Print_ISBN :
978-1-4244-7912-2
Electronic_ISBN :
978-0-7695-4154-9
DOI :
10.1109/ICSC.2010.74