مرکز منطقه ای اطلاع رساني علوم و فناوري - Evaluating Statistical Tests for Within-Network Classifiers of Relational Data

DocumentCode :

2771543

Title :

Evaluating Statistical Tests for Within-Network Classifiers of Relational Data

Author :

Neville, Jennifer ; Gallagher, Brian ; Eliassi-Rad, Tina

Author_Institution :

Purdue Univ., West Lafayette, IN, USA

fYear :

2009

fDate :

6-9 Dec. 2009

Firstpage :

397

Lastpage :

406

Abstract :

Recently a number of modeling techniques have been developed for data mining and machine learning in relational and network domains where the instances are not independent and identically distributed (i.i.d.). These methods specifically exploit the statistical dependencies among instances in order to improve classification accuracy. However, there has been little focus on how these same dependencies affect our ability to draw accurate conclusions about the performance of the models. More specifically, the complex link structure and attribute dependencies in network data violate the assumptions of many conventional statistical tests and make it difficult to use these tests to assess the models in an unbiased manner. In this work, we examine the task of within-network classification and the question of whether two algorithms will learn models which will result in significantly different levels of performance. We show that the commonly-used form of evaluation (paired t-test on overlapping network samples) can result in an unacceptable level of Type I error. Furthermore we show that Type I error increases as (1) the correlation among instances increases and (2) the size of the evaluation set increases (i.e., the proportion of labeled nodes in the network decreases). We propose a method for network cross-validation that combined with paired t-tests produces more acceptable levels of Type I error while still providing reasonable levels of statistical power (i.e., Type II error).

Keywords :

data mining; learning (artificial intelligence); pattern classification; relational databases; statistical testing; complex link structure; data mining; machine learning; network cross-validation; relational data; statistical test; within-network classifier; Algorithm design and analysis; Data mining; Laboratories; Machine learning; Machine learning algorithms; Performance analysis; Probability; Standards development; Taxonomy; Testing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on

Conference_Location :

Miami, FL

ISSN :

1550-4786

Print_ISBN :

978-1-4244-5242-2

Electronic_ISBN :

1550-4786

Type :

conf

DOI :

10.1109/ICDM.2009.50

Filename :

5360265

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2771543