Author_Institution :
Reed Elsevier LexisNexis Risk Solutions, Alpharetta, GA, USA
Abstract :
Summary form only given. Large-scale entity extraction, disambiguation and linkage in Big Data can challenge the traditional methodologies developed over the last three decades. Entity linkage, in particular, is cornerstone for a wide spectrum of applications, such as Master Data Management, Data Warehousing, Social Graph Analytics, Fraud Detection and Identity Management. Traditional rules based heuristic methods usually don´t scale properly, are language specific and require significant maintenance over time. This presentation will introduce the audience to the use of probabilistic record linkage, also known as specificity based linkage, on Big Data, to perform language independent large-scale entity extraction, resolution and linkage across diverse sources. The presentation also includes a live demonstration reviewing the different steps required during the data integration process (ingestion, profiling, parsing, cleansing, standardization and normalization), and show the basic concepts behind probabilistic record linkage on a real-world application using the open source big data platform, HPCC Systems [1] from LexisNexis.
Keywords :
Big Data; data handling; information retrieval; probability; Big Data; HPCC systems; LexisNexis; data cleansing; data ingestion; data integration process; data normalization; data parsing; data profiling; data standardization; data warehousing; fraud detection; identity management; large-scale entity extraction; master data management; open source big data platform; probabilistic record linkage; rules based heuristic methods; social graph analytics; specificity based linkage; Abstracts; Big data; Couplings; Data mining; Maintenance engineering; Probabilistic logic; Warehousing; Big Data; disambiguation; entity extraction; identity fraud; identity management; public data; record linking;