DocumentCode :
755639
Title :
Data alignment and integration [US government]
Author :
Pantel, Patrick ; Philpot, Andrew ; Hovy, Eduared
Author_Institution :
Inf. Sci. Inst., Univ. of Southern California, Marina del Rey, CA, USA
Volume :
38
Issue :
12
fYear :
2005
Firstpage :
43
Lastpage :
50
Abstract :
A general-purpose solution to the problem of matching entities within or across heterogeneous data sources can´t depend on the presence or reliability of auxiliary data such as structural information or metadata. Instead, it must leverage the available data (or observations) that describe the entities. Our technology, based on information theory principles, measures the importance of observations and then leverages them to quantify the similarity between entities, improving accuracy and reducing the time required to find related entities in a population. Applying this purely data-driven paradigm, we´ve built two systems: Guspin for automatically identifying equivalence classes or aliases, and Sift for automatically aligning data across databases. The key to our underlying technology is identifying the most informative observations and then matching entities that share them. Given the right types of observations, our model can potentially solve several serious and urgent problems that governments face, such as terrorist detection, identity theft, and data integration.
Keywords :
Internet; distributed databases; government data processing; information theory; Guspin system; Sift system; US government; data alignment; data integration; data-driven paradigm; entity matching problem; government problem; heterogeneous data sources; identity theft; information theory principle; metadata; terrorist detection; Air pollution; Automatic control; Control systems; Databases; Electronic mail; Merging; Monitoring; Protection; Terrorism; US Government; CARB; CEIDARS; Data sharing; Digital government; Facilities Registry System; Guspin; Information modeling; National Emission Inventory; Sift;
fLanguage :
English
Journal_Title :
Computer
Publisher :
ieee
ISSN :
0018-9162
Type :
jour
DOI :
10.1109/MC.2005.406
Filename :
1556484
Link To Document :
بازگشت