Classification algorithms for relation prediction

Author

Boden, Christoph ; Häfele, Thomas ; Löser, Alexander

Author_Institution

Univ. of Technol. Berlin, Berlin, Germany

fYear

2011

fDate

11-16 April 2011

Firstpage

46

Lastpage

52

Abstract

Knowledge discovery from the Web is a cyclic process. In this paper we focus on the important part of transforming unstructured information from Web pages into structured relations. Relation extraction systems capture information from natural language text on Web pages, called Web text. However, extraction is quite costly and time consuming. Worse, many Web pages may not contain a textual representation of a relation that the extractor can capture. As a result many irrelevant pages are processed by relation extractors. We propose a relation predictor to filter out irrelevant pages and substantially speed up the overall information extraction process. As a classifier, we trained a support vector machine (SVM). We evaluate pages on a sentence level, where each sentence is transformed into a token representation of shallow text features. We evaluate our relation predictor on 18 different relation extractors. Extractors vary in their number of attributes and their extraction domain. Our evaluation corpus contains more than six million sentences from several hundred thousand pages. We report a prediction time of tens of milliseconds per page and observe high recall across domains. Our experimental study shows that the relation predictor effectively forwards only relevant pages to the relation extractor. We report a speedup of at least factor two while discarding only a minimal amount of relations. If only fixed amount e.g. 10% of the pages in the corpus are processed, the predictor drastically increases the recall by a factor of five on average.

Keywords

Internet; data mining; information retrieval; pattern classification; support vector machines; Web pages; Web text; classification algorithms; information extraction process; knowledge discovery; natural language text; relation extraction systems; relation prediction; support vector machine; unstructured information transformation; Data mining; Encoding; Feature extraction; Pipelines; Semantics; Training; Web pages;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Engineering Workshops (ICDEW), 2011 IEEE 27th International Conference on

Conference_Location

Hannover

Print_ISBN

978-1-4244-9195-7

Electronic_ISBN

978-1-4244-9194-0

Type

conf

DOI

10.1109/ICDEW.2011.5767644

Filename

5767644