DocumentCode :
3141992
Title :
Classification algorithms for relation prediction
Author :
Boden, Christoph ; Häfele, Thomas ; Löser, Alexander
Author_Institution :
Univ. of Technol. Berlin, Berlin, Germany
fYear :
2011
fDate :
11-16 April 2011
Firstpage :
46
Lastpage :
52
Abstract :
Knowledge discovery from the Web is a cyclic process. In this paper we focus on the important part of transforming unstructured information from Web pages into structured relations. Relation extraction systems capture information from natural language text on Web pages, called Web text. However, extraction is quite costly and time consuming. Worse, many Web pages may not contain a textual representation of a relation that the extractor can capture. As a result many irrelevant pages are processed by relation extractors. We propose a relation predictor to filter out irrelevant pages and substantially speed up the overall information extraction process. As a classifier, we trained a support vector machine (SVM). We evaluate pages on a sentence level, where each sentence is transformed into a token representation of shallow text features. We evaluate our relation predictor on 18 different relation extractors. Extractors vary in their number of attributes and their extraction domain. Our evaluation corpus contains more than six million sentences from several hundred thousand pages. We report a prediction time of tens of milliseconds per page and observe high recall across domains. Our experimental study shows that the relation predictor effectively forwards only relevant pages to the relation extractor. We report a speedup of at least factor two while discarding only a minimal amount of relations. If only fixed amount e.g. 10% of the pages in the corpus are processed, the predictor drastically increases the recall by a factor of five on average.
Keywords :
Internet; data mining; information retrieval; pattern classification; support vector machines; Web pages; Web text; classification algorithms; information extraction process; knowledge discovery; natural language text; relation extraction systems; relation prediction; support vector machine; unstructured information transformation; Data mining; Encoding; Feature extraction; Pipelines; Semantics; Training; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering Workshops (ICDEW), 2011 IEEE 27th International Conference on
Conference_Location :
Hannover
Print_ISBN :
978-1-4244-9195-7
Electronic_ISBN :
978-1-4244-9194-0
Type :
conf
DOI :
10.1109/ICDEW.2011.5767644
Filename :
5767644
Link To Document :
بازگشت