Title :
Web-based Arabic/English duplicate record detection with nested blocking technique
Author :
Higazy, Azza ; El Tobely, Tarek ; Yousef, Ahmed H. ; Sarhan, Amany
Author_Institution :
Comput. & Control Dept., Tanta Univ., Tanta, Egypt
Abstract :
Data accuracy and quality affects the success of any business intelligence and data mining solutions. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset, this operation becomes more complicated when entities are identified by a string value like the case of person names. These data inaccuracy problems exist due to misspelling and wide range of typographical variations especially with non-Latin languages like Arabic. Up to authors´ knowledge, the previously proposed duplicate record detection (DRD) algorithms and frameworks do not support Arabic language and have some configuration difficulties. In this paper an English/Arabic enabled web-based framework is designed and implemented, considering the wide range variations in Arabic language. Improved indexing/blocking techniques used to allow fast processing. The framework is implemented and verified by several case studies. Results showed that the framework has substantial improvements compared to known techniques.
Keywords :
Internet; indexing; natural language processing; string matching; Arabic language; DRD algorithms; Web-based Arabic duplicate record detection; Web-based English duplicate record detection; blocking technique; data accuracy; data quality; indexing technique; nested blocking technique; nonLatin languages; person names; string value; Cleaning; Complexity theory; Couplings; Educational institutions; Indexing; Standardization; data integration; duplicate record detectio; entity matching; indexing; matching data cleaning;
Conference_Titel :
Computer Engineering & Systems (ICCES), 2013 8th International Conference on
Conference_Location :
Cairo
Print_ISBN :
978-1-4799-0078-7
DOI :
10.1109/ICCES.2013.6707225