Title :
Strong Compound-Risk Factors: Efficient Discovery Through Emerging Patterns and Contrast Sets
Author :
Li, Jinyan ; Yang, Qiang
Author_Institution :
Nanyang Technol. Univ., Singapore
Abstract :
Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have often been used to assess simple risk factors. In this paper, we introduce the concept of compound-risk factors to broaden the applicability of these statistical tests for assessing factor interplays. We observe that compound-risk factors with a high risk ratio or a big risk difference have an one-to-one correspondence to strong emerging patterns or strong contrast sets-two types of patterns that have been extensively studied in the data mining field. Such a relationship has been unknown to researchers in the past, and efficient algorithms for discovering strong compound-risk factors have been lacking. In this paper, we propose a theoretical framework and a new algorithm that unify the discovery of compound- risk factors that have a strong OR, risk ratio, or a risk difference. Our method guarantees that all patterns meeting a certain test threshold can be efficiently discovered. Our contribution thus represents the first of its kind in linking the risk ratios and ORs to pattern mining algorithms, making it possible to find compound- risk factors in large-scale data sets. In addition, we show that using compound-risk factors can improve classification accuracy in probabilistic learning algorithms on several disease data sets, because these compound-risk factors capture the interdependency between important data attributes.
Keywords :
medicine; statistical analysis; absolute risk reduction; biostatistics measurements; compound risk factors; disease data sets; odds ratio; pattern mining algorithm; probabilistic learning algorithm; relative risk; risk difference; risk ratio; statistical tests; Biomedical measurements; Computer science; Data analysis; Data mining; Diseases; Isolation technology; Joining processes; Large-scale systems; Risk management; Testing; Compound-risk factors; emerging patterns; odds ratio (OR); relative risk (RR); Biometry; Computer Simulation; Data Interpretation, Statistical; Evidence-Based Medicine; Models, Statistical; Odds Ratio; Pattern Recognition, Automated; Risk Assessment;
Journal_Title :
Information Technology in Biomedicine, IEEE Transactions on
DOI :
10.1109/TITB.2007.891163