پديد آورندگان :
واحدي طرقبه، بهزاد نويسنده دانشگاه صنعتي خواجه نصيرالدين طوسي B. Vahedi, , آل شيخ، علي اصغر نويسنده دانشگاه صنعتي خواجه نصيرالدين طوسي A. A. Alesheikh,
كليدواژه :
تناظريابي , الگوريتم Levenshtein , دقت توصيفي , Volunteered geographic information , OpenStreetMap , Levenshtein algorithm , Data matching , spatial data quality , كيفيت اطلاعات مكاني , اطلاعات مكاني مردمگستر , attribute accuracy
چكيده فارسي :
از زمان پیدایش مفهوم اطلاعات مكانی مردمگستر (داوطلبانه)1 كیفیت این اطلاعات به عنوان بزرگترین مشكل آن معرفی شده است. بنابراین تا كنون تحقیقات مختلفی به بررسی كیفیت دادههای مردمگستر پرداخته و سعی در برآورد كیفیت این اطلاعات داشته اند. اما در این تحقیقات به دقت توصیفی كمتر از سایر المانهای كیفیت پرداخته شده است؛ در حالیكه این المان در آنالیزهای گوناگون مكانی و كاربردهای مختلف اطلاعات مردم گستر از اهمیت بالایی برخوردار است. بنابراین در این تحقیق با استفاده از یك روش جدید و با استفاده از الگوریتم Levenshtein به همراه پیش پردازش های متنی، دقت توصیفی عوارض مردمگستر (در قالب نام عارضه) با مقایسه آنها با عوارض مرجع مورد بررسی قرار میگیرد. برای محاسبه دقت توصیفی فرض میشود كه بین عوارض مرجع و مردمگستر تناظریابی انجام شده است. منطقه مورد مطالعه این تحقیق شهر تهران است و از دادههای تولیدی شهرداری تهران به عنوان مجموعه داده مرجع و از دادههای سایت OpenStreetMap به عنوان مجموعه داده مردمگستر استفاده شده است. طبق نتایج حاصل، 33 درصد از عوارض مردمگستر دارای نام، نام صحیح، 44 درصد از آنها نام تقریباً صحیح و 23 درصد باقیمانده نام نادرست دارند و دقت توصیفی كل دادههای مردمگستر برابر 77 درصد میباشد.
چكيده لاتين :
Since the emergence of the concept of Volunteered Geographic Information (VGI), the quality of this type of information is presented as its biggest problem. Therefore, this issue has been addressed frequently in the literature, and scientists have tried to evaluate the quality of VGI. However, attribute accuracy, despite its important role in a variety of spatial analyses and applications of VGI, has received less attention in comparison to other elements of quality in the literature. Positional accuracy, completeness, lineage, resolution, and time accuracy are among the most important elements of spatial data quality.
In this study, using a novel method and by leveraging Levenshtein algorithm along with text pre-processing, attribute accuracy of volunteered geographic features is examined, comparing this data with reference data. Levenshtein algorithm calculates the difference between two strings of text by counting the number of changes (edits) necessary to change one word to another, and thus sometimes is referred to as Levenshtein distance.
The first step of the proposed method is to find corresponding features in the two data sets to perform the comparison based on. This step is done by applying an automatic data matching algorithm between the two sets. This algorithm consists of five stages, each applied on either the reference data set or the VGI data set.
After data matching is done, each VGI feature is compared with its corresponding match in the reference data set and the Levenshtein distance between the “name” attribute of these two features is calculated. Then, features are categorized as having correct (accurate), approximately correct, or incorrect names based on the Levenshtein distance and assuming that the name of the reference features are correct. For VGI features without a match in the reference data set, a search distance is defined, inside which reference features with the exact same name as the VGI feature are sought.
The study area of this research is Tehran city, Iran. A data set produced by the municipality of Tehran is used as the reference data set and OpenStreetMap data as the VGI data set. According to the results, 47 percent of VGI features have a name attribute and among these, 33 percent of them have correct name, 44 percent have approximately correct name, and the remaining 23 percent have incorrect names. The Overall attribute accuracy of the VGI data set used in this study, is thus 77 percent, indicating that among those features that have a name attribute, 77 percent of them have either correct or approximately correct names. A future line of research, based on the findings of this paper, could be to develop methods for evaluating the attribute accuracy of a data set without having to compare it with a reference data set.