Author :
Tauer, Gregory ; Rudnicki, Ronald ; Sudit, Moises
Abstract :
The Resource Description Framework (RDF), a language for describing resources, is being used more commonly in information fusion systems. SPARQL is a standard query language that enables knowledge extraction from data encoded in RDF. A SPARQL query is, in essence, an exact subgraph matching problem. Unfortunately, many of the techniques that produce data in RDF (such as manual data entry, social network analysis, natural language processing, etc.) make annotation mistakes, which result in dirty RDF data. SPARQL performs suboptimally on RDF data containing errors since, as an exact graph matching tool, it is not designed to cope with noisy data. To improve knowledge extraction under these conditions, we propose an extension to SPARQL that permits approximate graph matches. This allows queries to cope with errors in the RDF graph, both on the attribute level (such as misspelled names) as well as on the structural level (missing or extra edges). We use the TruST heuristic algorithm to solve the underlying approximate graph matching problem and demonstrate the benefit it brings to answering questions on the DBpedia knowledge base.
Keywords :
SQL; database management systems; graph theory; knowledge acquisition; knowledge based systems; pattern matching; sensor fusion; DBpedia knowledge base; RDF graph; RDF language; Resource Description Framework; SPARQL query; TruST heuristic algorithm; annotation mistakes; approximate graph matching problem; attribute level; dirty RDF data; error tolerant queries; graph matching tool; information fusion systems; knowledge extraction; noisy data; standard query language; structural level; subgraph matching problem; Cities and towns; Inductors; Power generation; Resource description framework; Rivers; Sociology; Statistics;