• DocumentCode
    1116055
  • Title

    Strong Compound-Risk Factors: Efficient Discovery Through Emerging Patterns and Contrast Sets

  • Author

    Li, Jinyan ; Yang, Qiang

  • Author_Institution
    Nanyang Technol. Univ., Singapore
  • Volume
    11
  • Issue
    5
  • fYear
    2007
  • Firstpage
    544
  • Lastpage
    552
  • Abstract
    Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have often been used to assess simple risk factors. In this paper, we introduce the concept of compound-risk factors to broaden the applicability of these statistical tests for assessing factor interplays. We observe that compound-risk factors with a high risk ratio or a big risk difference have an one-to-one correspondence to strong emerging patterns or strong contrast sets-two types of patterns that have been extensively studied in the data mining field. Such a relationship has been unknown to researchers in the past, and efficient algorithms for discovering strong compound-risk factors have been lacking. In this paper, we propose a theoretical framework and a new algorithm that unify the discovery of compound- risk factors that have a strong OR, risk ratio, or a risk difference. Our method guarantees that all patterns meeting a certain test threshold can be efficiently discovered. Our contribution thus represents the first of its kind in linking the risk ratios and ORs to pattern mining algorithms, making it possible to find compound- risk factors in large-scale data sets. In addition, we show that using compound-risk factors can improve classification accuracy in probabilistic learning algorithms on several disease data sets, because these compound-risk factors capture the interdependency between important data attributes.
  • Keywords
    medicine; statistical analysis; absolute risk reduction; biostatistics measurements; compound risk factors; disease data sets; odds ratio; pattern mining algorithm; probabilistic learning algorithm; relative risk; risk difference; risk ratio; statistical tests; Biomedical measurements; Computer science; Data analysis; Data mining; Diseases; Isolation technology; Joining processes; Large-scale systems; Risk management; Testing; Compound-risk factors; emerging patterns; odds ratio (OR); relative risk (RR); Biometry; Computer Simulation; Data Interpretation, Statistical; Evidence-Based Medicine; Models, Statistical; Odds Ratio; Pattern Recognition, Automated; Risk Assessment;
  • fLanguage
    English
  • Journal_Title
    Information Technology in Biomedicine, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1089-7771
  • Type

    jour

  • DOI
    10.1109/TITB.2007.891163
  • Filename
    4300838