Recent technology advances make it possible to collect whole-genome transcription factor binding (TFB) profiles from multiple species through the ChIP-Seq data. This provides rich information to understand TFB evolution. However, few rigorous statistical models are available to infer TFB evolution from these data. We have developed a phylogenetic tree based method to model the on/off rates of TFB events. There are two unique features of our method compared to existing models. First, we mask nucleotide substitutions and focus on INDEL disruption of TFB events, which are rarer evolution events and more appropriate for divergent species and non-coding regulatory regions. Second, we correct for ascertainment bias in ChIP-Seq data by maximizing likelihood conditional on the observed (incomplete) data. Simulations show that our method works well in model selection and parameter estimation when there are sufficient aligned TFB events. When this method is applied to a ChIP-Seq data set with five vertebrates, we find that the instantaneous transition rates to INDELs are higher in TFB regions than in homologous non-binding regions. This is driven by an excess of alignment columns showing binding in one species but gaps in all other species. When we compare the inferred transition rates between the conserved and non-conserved regions, as expected, the conserved regions are estimated to have lower transition rates. The R package TFBphylo that implements the described model can be downloaded from http://bioinformatics.med.yale.edu/.
Contents
-
Requires Authentication UnlicensedStudying the evolution of transcription factor binding events using multi-species ChIP-Seq dataLicensedMarch 26, 2013
-
Requires Authentication UnlicensedApproximate Bayesian computation with functional statisticsLicensedMarch 26, 2013
-
Requires Authentication UnlicensedMonte Carlo estimation of total variation distance of Markov chains on large spaces, with application to phylogeneticsLicensedMarch 26, 2013
-
Requires Authentication UnlicensedHigher order asymptotics for negative binomial regression inferences from RNA-sequencing dataLicensedMarch 26, 2013
-
Requires Authentication UnlicensedFlexible pooling in gene expression profiles: design and statistical modeling of experiments for unbiased contrastsLicensedMarch 26, 2013
-
Requires Authentication UnlicensedOn optimality of kernels for approximate Bayesian computation using sequential Monte CarloLicensedMarch 26, 2013
-
Requires Authentication UnlicensedInferring latent gene regulatory network kineticsLicensedMarch 26, 2013