مرکز منطقه ای اطلاع رساني علوم و فناوري - Performance evaluation of distance measures for preprocessing of set-valued data in feature vector generated from LOD datasets

Abstract :

The linked open data cloud has evolved as a huge repository of data with data from various domains. A lot of work has been done in generating these datasets and enhancing the LOD cloud, whereas a little work is being done in the consumption of the available data from the LOD. There are several types of applications that have been developed using the data from the LOD cloud; of which, one of the areas that has attracted the researchers and developers most is the use of these data for machine learning and knowledge discovery. Using the available, state of the art knowledge discovery and machine learning algorithms requires conversion of the heterogeneous interlinked RDF graph datasets, available in LOD cloud, to a feature vector. This conversion is performed with the subject set as instances; the predicates set as attributes and object set as attribute values in a feature vector. The converted feature vector may contain set-valued attributes as there can be more than one object for a subject with the same predicate name. These set-valued data in the attribute of the feature vector needs to be pre-processed so that the feature vector contains attributes with single values only and can be used directly in machine learning algorithms. The pre-processing approach involves distance calculation and application of Fastmap algorithm for transformation of set-valued data attribute into k columns, which are replaced in the feature vector making the feature vector appropriate to be used as input for knowledge discovery and machine learning. However choosing the most suitable distance measures of the different distance measures available is a problem that needs to be catered. This paper provides a performance study to select the most suitable distance measure that can be used in pre-processing by building the feature vector with the different distance measures for set-valued data attributes and applying transformation with Fastmap. The evaluation of the distance measures is done using clustering of the transformed feature vector table with pre-identified class labels and getting micro-precision values for the clustering results. Performing the experimental analysis with LMDB data it has been found that the Hausdorff and RIBL distance measures are the most suitable distance measures that can be used to pre-process the created feature vector with set-valued data from the linked open data cloud.