DocumentCode :
2369628
Title :
Statistical relational learning for document mining
Author :
Popescul, Alexandrin ; Ungar, Lyle H. ; Lawrence, Steve ; Pennock, David M.
Author_Institution :
Dept. of Comput. & Inf. Sci., Pennsylvania Univ., Philadelphia, PA, USA
fYear :
2003
fDate :
19-22 Nov. 2003
Firstpage :
275
Lastpage :
282
Abstract :
A major obstacle to fully integrated deployment of many data mining algorithms is the assumption that data sits in a single table, even though most real-world databases have complex relational structures. We propose an integrated approach to statistical modelling from relational databases. We structure the search space based on "refinement graphs", which are widely used in inductive logic programming for learning logic descriptions. The use of statistics allows us to extend the search space to include richer set of features, including many which are not Boolean. Search and model selection are integrated into a single process, allowing information criteria native to the statistical model, for example logistic regression, to make feature selection decisions in a step-wise manner. We present experimental results for the task of predicting where scientific papers will be published based on relational data taken from CiteSeer. Our approach results in classification accuracies superior to those achieved when using classical "flat" features. The resulting classifier can be used to recommend where to publish articles.
Keywords :
data mining; decision theory; document handling; inductive logic programming; learning (artificial intelligence); regression analysis; relational databases; classical flat feature; complex relational structure; data mining algorithm; document mining; feature selection decision; inductive logic programming; learning logic description; logistic regression; real-world database; refinement graphs; relational database; scientific paper; search space; statistical model; statistical relational learning; Data mining; Information science; Laboratories; Logic programming; Logistics; National electric code; Relational databases; Spatial databases; Statistical learning; Statistics;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
Print_ISBN :
0-7695-1978-4
Type :
conf
DOI :
10.1109/ICDM.2003.1250930
Filename :
1250930
Link To Document :
بازگشت