عنوان مقاله :
روشي جديد در تشخيص تكراري ركوردها با استفاده از خوشهبندي سلسله مراتبي
عنوان به زبان ديگر :
A New Method for Duplicate Detection Using Hierarchical Clustering of Records
پديد آورندگان :
داﻧﺸﭙﻮر، ﻧﮕﯿﻦ داﻧﺸﮕﺎه ﺗﺮﺑﯿﺖ دﺑﯿﺮ ﺷﻬﯿﺪ رﺟﺎﯾﯽ - داﻧﺸﮑﺪه ﻣﻬﻨﺪﺳﯽ ﮐﺎﻣﭙﯿﻮﺗﺮ، ﺗﻬﺮان، اﯾﺮان , ﺑﺮزﮔﺮي، ﻋﻠﯽ داﻧﺸﮕﺎه آزاد اﺳﻼﻣﯽ واﺣﺪ ﻋﻠﻮم ﺗﺤﻘﯿﻘﺎت داﻧﺸﮑﺪه ﻓﻨﯽ ﻣﻬﻨﺪﺳﯽ - ﮔﺮوه ﮐﺎﻣﭙﯿﻮﺗﺮ، ﺗﻬﺮان، اﯾﺮان
كليدواژه :
ﺗﺸﺨﯿﺺ ﺗﮑﺮاري , ﭘﺎكﺳﺎزي داده , ﺧﻮﺷﻪﺑﻨﺪي ﺳﻠﺴﻠﻪﻣﺮاﺗﺒﯽ , ﺗﺎﺑﻊ ﺗﺸﺎﺑﻪ , اﻧﺘﺨﺎب وﯾﮋﮔﯽ
چكيده فارسي :
ﺑﻪدﻟﯿﻞ اﻫﻤﯿﺖ ﺑﺎﻻي ﮐﯿﻔﯿﺖ دادهﻫﺎ در ﻋﻤﻠﮑﺮد ﺳﺎﻣﺎﻧﻪ ﻫﺎي ﻧﺮماﻓﺰاري، ﻓﺮآﯾﻨﺪ ﭘﺎكﺳﺎزي داده ﺑﻪﺧﺼﻮص ﺗﺸﺨﯿﺺ رﮐﻮردﻫﺎي ﺗﮑﺮاري، ﻃﯽ ﺳﺎﻟﯿﺎن اﺧﯿﺮ ﯾﮑﯽ از ﻣﻬﻢﺗﺮﯾﻦ ﺣﻮزهﻫﺎي ﻋﻠﻮم راﯾﺎﻧﻪ ﺑﻪ ﺣﺴﺎب آﻣﺪه اﺳﺖ. در اﯾﻦ ﻣﻘﺎﻟﻪ روﺷﯽ ﺑﺮاي ﺗﺸﺨﯿﺺ رﮐﻮردﻫﺎي ﺗﮑﺮاري اراﺋﻪ ﺷﺪه اﺳﺖ ﮐﻪ ﺑﺎ ﺧﻮﺷﻪﺑﻨﺪي ﺳﻠﺴﻠﻪﻣﺮاﺗﺒﯽ رﮐﻮردﻫﺎ ﺑﺮ اﺳﺎس وﯾﮋﮔﯽﻫﺎي ﻣﻨﺎﺳﺐ در ﻫﺮ ﺳﻄﺢ، ﻣﯿﺰان ﺷﺒﺎﻫﺖ ﻣﯿﺎن رﮐﻮردﻫﺎ ﺗﺨﻤﯿﻦ زده ﻣﯽﺷﻮد. اﯾﻦ ﮐﺎر ﺳﺒﺐ ﻣﯽﺷﻮد ﺗﺎ ﺧﻮﺷﻪﻫﺎﯾﯽ در ﺳﻄﺢ آﺧﺮ ﺑﻪدﺳﺖ آﯾﻨﺪ ﮐﻪ رﮐﻮردﻫﺎي درون آنﻫﺎ ﺑﺴﯿﺎر ﻣﺸﺎﺑﻪ ﯾﮑﺪﯾﮕﺮ ﺑﺎﺷﻨﺪ. ﺑﺮاي ﮐﺸﻒ رﮐﻮردﻫﺎي ﺗﮑﺮاري ﻧﯿﺰ ﻣﻘﺎﯾﺴﻪ ﺗﻨﻬﺎ ﺑﺮ روي رﮐﻮردﻫﺎي درون ﯾﮏ ﺧﻮﺷﻪ از ﺳﻄﺢ آﺧﺮ اﻧﺠﺎم ﻣﯽﮔﯿﺮد. ﻫﻤﭽﻨﯿﻦ در اﯾﻦ ﻣﻘﺎﻟﻪ ﺑﺮاي ﻣﻘﺎﯾﺴﻪ ﻣﯿﺎن رﮐﻮردﻫﺎ، ﯾﮏ ﺗﺎﺑﻊ ﺗﺸﺎﺑﻪ ﻧﺴﺒﯽ ﺑﺮ ﭘﺎﯾﻪ ﺗﺎﺑﻊ ﻓﺎﺻﻠﻪ وﯾﺮاﯾﺸﯽ اراﺋﻪ ﺷﺪه ﮐﻪ دﻗﺖ ﺑﺴﯿﺎر ﺑﺎﻻﯾﯽ ﺑﻪ ﻫﻤﺮاه دارد. ﻣﻘﺎﯾﺴﻪ ﻧﺘﺎ ﯾﺞ ارزﯾﺎﺑﯽ ﺳﺎﻣﺎﻧﻪ ﻧﺸﺎن ﻣ ﯽدﻫﺪ ﮐﻪ روش اراﺋﻪﺷﺪه، در زﻣﺎن ﮐﻤﺘﺮي ،90 % ﺗﮑﺮاريﻫﺎي ﻣﻮﺟﻮد را ﺑﺎ دﻗﺖ 97% ﮐﺸﻒ ﻣﯽﮐﻨﺪ و ﺑﻬﺒﻮد داﺷﺘﻪ اﺳﺖ.
چكيده لاتين :
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of data sources and human faults in data entry, it is possible to appear several copies of an entity in a data source. This problem leads to error occurrence in operations or output results of a system; also, it costs a lot for related organization or business. Therefore, data cleaning process especially duplicate record detection, became one of the most important area of computer science in recent years. Many solutions presented for detecting duplicates in different situations, but they almost are all time-consuming. Also, the volume of data is growing up every day. hence, previous methods don’t have enough performance anymore. Incorrect detection of two different records as duplicates, is another problem that recent works are being faced. This becomes important because duplicates will usually be deleted and some correct data will be lost. So it seems that presenting new methods is necessary.
In this paper, a method has been proposed that reduces required volume of process using hierarchical clustering with appropriate features. In this method, similarity between records has been estimated in several levels. In each level, a different feature has been used for estimating similarity between records. As a result, clusters that contain very similar records will be created in the last level. The comparisons are done on these records for detecting duplicates. Also, in this paper, a relative similarity function has been proposed for comparing between records. This function has high precision in determining the similarity. Eventually, the evaluation results show that the proposed method detects 90% of duplicate records with 97% accuracy in less time and results have improved.
عنوان نشريه :
پردازش علائم و داده ها