مرکز منطقه ای اطلاع رساني علوم و فناوري - Modeling genetic heterogeneity in Hepatitis C Virus hyper-variable region 1 infers demographic characteristics of infected hosts

Abstract :

Hepatitis C Virus (HCV) is the most common etiological cause of non-A/non-B blood-borne viral hepatitis and the leading cause for liver transplantation. The population of HCV-infected individuals in the US is estimated to be over 3 million. There are 7 major HCV genotypes with world-wide distribution, which are further grouped into numerous sub-genotypes. HCV genotype 1a is the most common genotype in the US, with genotype 1b being the next most common. Both genotypes are responsible for the most difficult-to-treat infections. Several host- and viral-related factors have been identified as risk factors for development of HCV chronic (HCH) infection, liver disease progression and therapy outcome. We previously reported that certain host demographic characteristics were found associated to the genetic properties of HCV strains in a group of chronically infected patients undergoing combined interferon and ribavirin therapy. In this study we expanded analysis to a larger dataset to further explore association between the hosts´ ethnic background and the genetic properties of the HCV hyper-variable region 1 (HVR1). The HCV data contained sequences of intra-host HVR1 variants of HCV1a and HCV1b (n=936 and n=630, respectively) obtained from a national survey and five independent state-wide outbreak investigations. Association between properties of HVR1 strains and hosts´ ethnicity was examined using viral features derived from nucleotide (nt) and amino-acid (aa) sequence information. Nucleotide sequences of 87nt at genome position 1491-1577 and amino acid sequences of 29 aa at polyprotein position 384-412 (GenBank reference sequence AF01175) were associated with ethnicity data, Caucasian (CA) or Afro-American (AA). To identify relevant viral nt- or aa-based features associated with host ethnicity we applied a correlation feature selection (CFS) method to find subsets with features that have a high correlation to the variable of interest and a low correlation between the- features. In HCV1a data, the best HVR1 nt-based feature subset (merit=0.26) and aa-based subset (merit=0.20) consisted of 9 nt sites and 6 aa sites, respectively. In HCV1b data, the best nt-based feature subset (merit=0.35) and aa-based subset (merit=0.25) consisted of 13 nt and 8 aa sites, respectively. These findings indicate the association of the ethnicity variable with genetic heterogeneity of certain sets of genomic and polyprotein sites. It also indicates absence of strong correlation between variation at any single site and the ethnicity variable. Therefore, in order to account for interactions and/or dependencies among features in selected subsets, which are associated as a group with host ethnicity, we modeled genetic relationships to ethnicity using Bayesian network classifiers (BNCs). BNC models were initially constructed as naïve Bayesian networks and then were left to learn dependencies among the features. Performance evaluations of BNCs were measured using F-measure and classification accuracy metrics during the training - 10-fold-cross-validation (10xCV) - and testing phases - out-of-sample data (validation). BNCs evaluations were also carried out using 5 datasets generated by random sampling from HCV data where sequences were randomly assigned to ethnicity classes. Remarkable accuracy in performance (10xCV / validation) was observed for the HCV1a BNCs based on 9nt (91.1% / 91.7%) and 6aa features (83.3% / 82.7%). Accuracy of BNCs on randomly labeled data was significantly lower (9nt-BNC_Rand=60.9% and 6aa-BNC_Rand=47.3%, avg. accuracy). Similar performances were observed for BNCs constructed from HCV1b data, where accuracy of the classification was further improved by integrating the 13nt and 8aa learned BNCs into a single combined 21 feature BNC construct (96.3% / 90.2%). Average accuracy of the BNC_Rand was 48.6%. In conclusion, findings in this study suggest that HVR1 sequence variants are strongly associat

Keywords :

Bayes methods; data analysis; diseases; feature selection; genetics; genomics; liver; microorganisms; molecular biophysics; proteins; random processes; Bayesian network classifiers; HCV chronic infection; HCV data analysis; HCV genotype 1a; HCV genotype genotype 1b; amino-acid sequence information; correlation feature selection method; demographic characteristics; genetic heterogeneity modeling; genetic properties; genomic sites; hepatitis C virus hypervariable region 1; interferon therapy; intrahost HVR1 variant sequences; liver disease; liver transplantation; nucleotide sequence information; polyprotein sites; random sampling; ribavirin therapy; viral strains; Accuracy; Bayes methods; Bioinformatics; Correlation; Genomics; Strain; Bayesian network; HVR1; biomarkers; coevolution; hepatitis C; prediction;