A Variance-Components Model for Distance-Matrix Phylogenetic Reconstruction
-
Walter R Gilks
, Tom M.W. Nye und Pietro Lio
Phylogenetic trees describe evolutionary relationships between related organisms (taxa). One approach to estimating phylogenetic trees supposes that a matrix of estimated evolutionary distances between taxa is available. Agglomerative methods have been proposed in which closely related taxon-pairs are successively combined to form ancestral taxa. Several of these computationally efficient agglomerative algorithms involve steps to reduce the variance in estimated distances. We propose an agglomerative phylogenetic method which focuses on statistical modeling of variance components in distance estimates. We consider how these variance components evolve during the agglomerative process. Our method simultaneously produces two topologically identical rooted trees, one tree having branch lengths proportional to elapsed time, and the other having branch lengths proportional to underlying evolutionary divergence. The method models two major sources of variation which have been separately discussed in the literature: noise, reflecting inaccuracies in measuring divergences, and distortion, reflecting randomness in the amounts of divergence in different parts of the tree. The methodology is based on successive hierarchical generalized least-squares regressions. It involves only means, variances and covariances of distance estimates, thereby avoiding full distributional assumptions. Exploitation of the algebraic structure of the estimation leads to an algorithm with computational complexity comparable to the leading published agglomerative methods. A parametric bootstrap procedure allows full uncertainty in the phylogenetic reconstruction to be assessed. Software implementing the methodology may be freely downloaded from StatTree.
References
Brodal, G., R. Faberberg, and P. C.N.S. (2004). Computing the quartet distance between evolutionary trees in time O(nlog2n). Algorithmica 38(2), 377–395.10.1007/s00453-003-1065-ySuche in Google Scholar
Bruno, W., N. Socci, and A. Halpern (2000). Weighted Neighbour Joining: a likelihood-based approach to distance-based phylogeny reconstruction. Molecular Biology and Evolution 17, 189–197.10.1093/oxfordjournals.molbev.a026231Suche in Google Scholar PubMed
Bulmer, M. (1991). Use of the method of generalized least squ ares in reconstructing phylogenies from sequence data. Molecular Biology and Evolution 8, 868–883.Suche in Google Scholar
Chakraborty, R. (1977). Estimation of time of divergence from phylogenetic studies. Canadian Journal of Genetics and Cytology 19, 217–223.10.1139/g77-024Suche in Google Scholar PubMed
Crowder, M. (2001). On repeated measures analysis with misspecified covariance structure. Journal of the Royal Statistical Society, Series B 63, 55–62.10.1111/1467-9868.00275Suche in Google Scholar
Desper, R. and O. Gascuel (2004). Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted leastsquares tree fitting. Molecular Biology and Evolution 21(3), 587–598.10.1093/molbev/msh049Suche in Google Scholar PubMed
Felsenstein, J. (1987). Estimation of hominoid phylogeny from a DNA hybridization data set. Journal of Molecular Evolution 26, 123–131.10.1007/BF02111286Suche in Google Scholar PubMed
Felsenstein, J. (2004). Inferring Phylogenies. Massachusetts: Sinauer Associates, Inc.Suche in Google Scholar
Gascuel, O. (1997). BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution 14, 685–695.10.1093/oxfordjournals.molbev.a025808Suche in Google Scholar PubMed
Gascuel, O. (2000). Data model and classification by trees: the minimum variance reduction (MVR) method. Journal of Classification 17, 67–99.10.1007/s003570000005Suche in Google Scholar
Golub, G. and C. Van Loan (1996). Matrix Computations (3rd ed.). Baltimore: The Johns Hopkins University Press.Suche in Google Scholar
Hasegawa, M., H. Kishino, and T. Yano (1985). Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22, 160–174.10.1007/BF02101694Suche in Google Scholar PubMed
Jukes, T. and C. Cantor (1969). Evolution of protein molecules. In M.N.Munro (Ed.), Mammalian Protein Metabolism, Volume III, pp. 21–132. New York: Academic Press.10.1016/B978-1-4832-3211-9.50009-7Suche in Google Scholar
Keele, B., E. Giorgi, J. Salazar-Gonzalez, J. Decker, K. Pham, M. Salazar, et al. (2008). Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection. Proc. Natl. Acad. Sci. USA. 105, 75527557.10.1073/pnas.0802203105Suche in Google Scholar
Kimura, M. (1980). A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16, 111–120.10.1007/BF01731581Suche in Google Scholar
Kishino, H., J. Thorne, and W. Bruno (2001). Performance of a divergence time estimation method under a probabilistic model of rate evolution. Molecular Biology and Evolution 18, 352–361.10.1093/oxfordjournals.molbev.a003811Suche in Google Scholar
Lanave, C., G. Preparata, C. Saccone, and G. Serio (1984). A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution 20, 86–93.10.1007/BF02101990Suche in Google Scholar
Mardia, K., J. Kent, and J. Bibby (1979). Multivariate Analysis. New York: Academic Press.Suche in Google Scholar
Nei, M., J. Stephens, and N. Saitou (1985). Methods for computing the standard errors of branching points in an evolutionary tree and their applications to molecular data from human and apes. Molecular Biology and Evolution 2, 66–85.Suche in Google Scholar
Rambaut, A. and N. Grassly (1997). Seq-gen: an application for the Monte Carlo simulation od DNA sequence evolution along phylogenetic trees. Algorithmica 13(3), 235–238.Suche in Google Scholar
Robinson, D. and L. Foulds (1981). Comparison of phylogenetic trees. Mathematical Bioscience 53, 131–147.10.1016/0025-5564(81)90043-2Suche in Google Scholar
Saitou, N. and M. Nei (1987). The neighbour-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425.Suche in Google Scholar
Salazar-Gonzalez, J., M. Salazar, B. Keele, G. Learn, E. Giorgi, H. Li, et al. (2009). Genetic identity, biological phenotype, and evolutionary pathways of transmitted/founder viruses in acute and early HIV-1 infection. J. Experimental Medecine 206(6), 1273–1289.10.1084/jem.20090378Suche in Google Scholar PubMed PubMed Central
Studier, J. and K. Keppler (1988). A note on the neighbor-joining method of Saitou and Nei. Molecular Biology and Evolution 5, 729–731.Suche in Google Scholar
Susko, E. (2003). Confidence regions and hypothesis tests using generalized least squares. Molecular Biology and Evolution 20, 862–868.10.1093/molbev/msg093Suche in Google Scholar PubMed
Thorne, J., H. Kishino, and I. Painter (1998). Estimating the rate of evolution of the rate of molecular evolution. Molecular Biology and Evolution 15, 1647–1657.10.1093/oxfordjournals.molbev.a025892Suche in Google Scholar PubMed
Wang, L.-S. and T. Warnow (2005). Distance-based genome rearrangement phylogeny. In O. Gascuel (Ed.), Mathematics of Evolution & Phylogeny, Chapter 13, pp. 353–383. Oxford University Press.Suche in Google Scholar
Yang, Z. (1993). Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution 10, 1396–1401.Suche in Google Scholar
Zwickl, D. and D. Hillis (2002). Increased taxon sampling greatly reduces phylogenetic error. Systematic Biology 51(4), 588–598.10.1080/10635150290102339Suche in Google Scholar PubMed
©2011 Walter de Gruyter GmbH & Co. KG, Berlin/Boston
Artikel in diesem Heft
- Invited Editorial
- Measurement of Evidence and Evidence of Measurement
- Article
- Fully Moderated T-statistic for Small Sample Size Gene Expression Arrays
- Determining Coding CpG Islands by Identifying Regions Significant for Pattern Statistics on Markov Chains
- Assessing Modularity Using a Random Matrix Theory Approach
- Choice of Summary Statistic Weights in Approximate Bayesian Computation
- Genetic Linkage Analysis in the Presence of Germline Mosaicism
- Fitting Boolean Networks from Steady State Perturbation Data
- Adaptive Elastic-Net Sparse Principal Component Analysis for Pathway Association Testing
- Bayesian Learning from Marginal Data in Bionetwork Models
- Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome
- Multiple Testing in Candidate Gene Situations: A Comparison of Classical, Discrete, and Resampling-Based Procedures
- Modeling Read Counts for CNV Detection in Exome Sequencing Data
- Multiscale Characterization of Signaling Network Dynamics through Features
- A Calibrated Multiclass Extension of AdaBoost
- False Discovery Rate Estimation for Stability Selection: Application to Genome-Wide Association Studies
- A Markov-Chain Model for the Analysis of High-Resolution Enzymatically 18O-Labeled Mass Spectra
- Repeated Measures Semiparametric Regression Using Targeted Maximum Likelihood Methodology with Application to Transcription Factor Activity Discovery
- Learning Monotonic Genotype-Phenotype Maps
- A Comparison of Multifactor Dimensionality Reduction and L1-Penalized Regression to Identify Gene-Gene Interactions in Genetic Association Studies
- Accuracy and Computational Efficiency of a Graphical Modeling Approach to Linkage Disequilibrium Estimation
- Learning from Past Treatments and Their Outcome Improves Prediction of In Vivo Response to Anti-HIV Therapy
- A Three Component Latent Class Model for Robust Semiparametric Gene Discovery
- Log-Linear Modelling of Protein Dipeptide Structure Reveals Interesting Patterns of Side-Chain-Backbone Interactions
- A Robust Statistical Method to Detect Null Alleles in Microsatellite and SNP Datasets in Both Panmictic and Inbred Populations
- Large Sample Approximations of Probabilities of Correct Evolutionary Tree Estimation and Biases of Maximum Likelihood Estimation
- Interval Estimation of Familial Correlations from Pedigrees
- Information Metrics in Genetic Epidemiology
- Linear Combination Test for Hierarchical Gene Set Analysis
- Exploratory Analysis of Multiple Omics Datasets Using the Adjusted RV Coefficient
- Application of the Lasso to Expression Quantitative Trait Loci Mapping
- A Variance-Components Model for Distance-Matrix Phylogenetic Reconstruction
- Imputation Estimators Partially Correct for Model Misspecification
- On the Statistical Properties of SGoF Multitesting Method
- Meta-Analysis of Family-Based and Case-Control Genetic Association Studies that Use the Same Cases
- A Non-Parametric Method for Detecting Specificity Determining Sites in Protein Sequence Alignments
- Performance of Matrix Representation with Parsimony for Inferring Species from Gene Trees
- Disequilibrium Coefficient: A Bayesian Perspective
- Analyzing Time-Course Microarray Data Using Functional Data Analysis - A Review
- The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq
- Inferring Gene Networks using Robust Statistical Techniques
- A Two-Stage Poisson Model for Testing RNA-Seq Data
- Quantifying the Relative Contribution of the Heterozygous Class to QTL Detection Power
- The Joint Null Criterion for Multiple Hypothesis Tests
- Multiple Imputation of Missing Phenotype Data for QTL Mapping
- Sparse Canonical Covariance Analysis for High-throughput Data
- Comparison of Clinical Subgroup aCGH Profiles through Pseudolikelihood Ratio Tests
- Random Forests for Genetic Association Studies
- Deviance Information Criteria for Model Selection in Approximate Bayesian Computation
- High-Dimensional Regression and Variable Selection Using CAR Scores
- Surveying the Manifold Divergence of an Entire Protein Class for Statistical Clues to Underlying Biochemical Mechanisms
- Smoothing Gene Expression Data with Network Information Improves Consistency of Regulated Genes
- Entropy Based Genetic Association Tests and Gene-Gene Interaction Tests
- Weighted Lasso with Data Integration
- MA-SNP -- A New Genotype Calling Method for Oligonucleotide SNP Arrays Modeling the Batch Effect with a Normal Mixture Model
- A Modified Maximum Contrast Method for Unequal Sample Sizes in Pharmacogenomic Studies
Artikel in diesem Heft
- Invited Editorial
- Measurement of Evidence and Evidence of Measurement
- Article
- Fully Moderated T-statistic for Small Sample Size Gene Expression Arrays
- Determining Coding CpG Islands by Identifying Regions Significant for Pattern Statistics on Markov Chains
- Assessing Modularity Using a Random Matrix Theory Approach
- Choice of Summary Statistic Weights in Approximate Bayesian Computation
- Genetic Linkage Analysis in the Presence of Germline Mosaicism
- Fitting Boolean Networks from Steady State Perturbation Data
- Adaptive Elastic-Net Sparse Principal Component Analysis for Pathway Association Testing
- Bayesian Learning from Marginal Data in Bionetwork Models
- Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome
- Multiple Testing in Candidate Gene Situations: A Comparison of Classical, Discrete, and Resampling-Based Procedures
- Modeling Read Counts for CNV Detection in Exome Sequencing Data
- Multiscale Characterization of Signaling Network Dynamics through Features
- A Calibrated Multiclass Extension of AdaBoost
- False Discovery Rate Estimation for Stability Selection: Application to Genome-Wide Association Studies
- A Markov-Chain Model for the Analysis of High-Resolution Enzymatically 18O-Labeled Mass Spectra
- Repeated Measures Semiparametric Regression Using Targeted Maximum Likelihood Methodology with Application to Transcription Factor Activity Discovery
- Learning Monotonic Genotype-Phenotype Maps
- A Comparison of Multifactor Dimensionality Reduction and L1-Penalized Regression to Identify Gene-Gene Interactions in Genetic Association Studies
- Accuracy and Computational Efficiency of a Graphical Modeling Approach to Linkage Disequilibrium Estimation
- Learning from Past Treatments and Their Outcome Improves Prediction of In Vivo Response to Anti-HIV Therapy
- A Three Component Latent Class Model for Robust Semiparametric Gene Discovery
- Log-Linear Modelling of Protein Dipeptide Structure Reveals Interesting Patterns of Side-Chain-Backbone Interactions
- A Robust Statistical Method to Detect Null Alleles in Microsatellite and SNP Datasets in Both Panmictic and Inbred Populations
- Large Sample Approximations of Probabilities of Correct Evolutionary Tree Estimation and Biases of Maximum Likelihood Estimation
- Interval Estimation of Familial Correlations from Pedigrees
- Information Metrics in Genetic Epidemiology
- Linear Combination Test for Hierarchical Gene Set Analysis
- Exploratory Analysis of Multiple Omics Datasets Using the Adjusted RV Coefficient
- Application of the Lasso to Expression Quantitative Trait Loci Mapping
- A Variance-Components Model for Distance-Matrix Phylogenetic Reconstruction
- Imputation Estimators Partially Correct for Model Misspecification
- On the Statistical Properties of SGoF Multitesting Method
- Meta-Analysis of Family-Based and Case-Control Genetic Association Studies that Use the Same Cases
- A Non-Parametric Method for Detecting Specificity Determining Sites in Protein Sequence Alignments
- Performance of Matrix Representation with Parsimony for Inferring Species from Gene Trees
- Disequilibrium Coefficient: A Bayesian Perspective
- Analyzing Time-Course Microarray Data Using Functional Data Analysis - A Review
- The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq
- Inferring Gene Networks using Robust Statistical Techniques
- A Two-Stage Poisson Model for Testing RNA-Seq Data
- Quantifying the Relative Contribution of the Heterozygous Class to QTL Detection Power
- The Joint Null Criterion for Multiple Hypothesis Tests
- Multiple Imputation of Missing Phenotype Data for QTL Mapping
- Sparse Canonical Covariance Analysis for High-throughput Data
- Comparison of Clinical Subgroup aCGH Profiles through Pseudolikelihood Ratio Tests
- Random Forests for Genetic Association Studies
- Deviance Information Criteria for Model Selection in Approximate Bayesian Computation
- High-Dimensional Regression and Variable Selection Using CAR Scores
- Surveying the Manifold Divergence of an Entire Protein Class for Statistical Clues to Underlying Biochemical Mechanisms
- Smoothing Gene Expression Data with Network Information Improves Consistency of Regulated Genes
- Entropy Based Genetic Association Tests and Gene-Gene Interaction Tests
- Weighted Lasso with Data Integration
- MA-SNP -- A New Genotype Calling Method for Oligonucleotide SNP Arrays Modeling the Batch Effect with a Normal Mixture Model
- A Modified Maximum Contrast Method for Unequal Sample Sizes in Pharmacogenomic Studies