Abstract
A central statistical goal is to choose between alternative explanatory models of data. In many modern applications, such as population genetics, it is not possible to apply standard methods based on evaluating the likelihood functions of the models, as these are numerically intractable. Approximate Bayesian computation (ABC) is a commonly used alternative for such situations. ABC simulates data x for many parameter values under each model, which is compared to the observed data xobs. More weight is placed on models under which S(x) is close to S(xobs), where S maps data to a vector of summary statistics. Previous work has shown the choice of S is crucial to the efficiency and accuracy of ABC. This paper provides a method to select good summary statistics for model choice. It uses a preliminary step, simulating many x values from all models and fitting regressions to this with the model as response. The resulting model weight estimators are used as S in an ABC analysis. Theoretical results are given to justify this as approximating low dimensional sufficient statistics. A substantive application is presented: choosing between competing coalescent models of demographic growth for Campylobacter jejuni in New Zealand using multi-locus sequence typing data.
The authors acknowledge the Marsden Fund project 08-MAU-099 (Cows, starlings and Campylobacter in New Zealand: unifying phylogeny, genealogy, and epidemiology to gain insight into pathogen evolution) for funding this project. This publication made use of the Campylobacter Multi Locus Sequence Typing website (http://pubmlst.org/campylobacter/) developed by Keith Jolley and sited at the University of Oxford (Jolley and Maiden 2010, BMC Bioinformatics, 11:595). The development of this site has been funded by the Wellcome Trust. The paper has benefited from many helpful suggestions of two anonymous reviewers.
References
Atkinson, I. A. and E. K. Cameron (1993): “Human influence on the terrestrial biota and biotic communities of New Zealand,” Trends in Ecology & Evolution, 8, 447–451.10.1016/0169-5347(93)90008-DSearch in Google Scholar
Barnes, C. P., S. Filippi, M. P. H. Stumpf and T. Thorne (2012a): “Considerate approaches to constructing summary statistics for ABC model selection,” Statistics and Computing, 22, 1181–1197.10.1007/s11222-012-9335-7Search in Google Scholar
Barnes, C. P., S. Filippi and M. P. H. Stumpf (2012b): “Contribution to the discussion of Fearnhead and Prangle (2012). Constructing summary statistics for approximate Bayesian computation: Semi-automatic approximate Bayesian computation,” Journal of the Royal Statistical Society: Series B, 74, 453.10.1111/j.1467-9868.2011.01010.xSearch in Google Scholar
Beaumont, M. A. (2008): “Joint determination of topology, divergence time, and immigration in population trees,” In: C. Renfrew, S. Matsumura, and P. Forster, editors, Simulation, Genetics and Human Prehistory. McDonald Institute Monographs, pp. 134–154.Search in Google Scholar
Beaumont, M. A., W. Zhang and D. J. Balding (2002): “Approximate Bayesian computation in population genetics,” Genetics, 162, 2025–2035.10.1093/genetics/162.4.2025Search in Google Scholar PubMed PubMed Central
Blum, M. G. B. (2010): “Approximate Bayesian computation: a nonparametric perspective,” Journal of the American Statistical Association, 105 (491), 1178–1187.10.1198/jasa.2010.tm09448Search in Google Scholar
Blum, M. G. B. and O. François (2010): “Non-linear regression models for approximate Bayesian computation,” Statistics and Computing, 20, 63–73.10.1007/s11222-009-9116-0Search in Google Scholar
Blum, M. G. B., M. A. Nunes, D. Prangle and S. A. Sisson (2013): “A comparative review of dimension reduction methods in approximate Bayesian computation,” Statistical Science, 28, 189–208.10.1214/12-STS406Search in Google Scholar
Del Moral, P., A. Doucet and A. Jasra (2012): “An adaptive sequential Monte Carlo method for approximate Bayesian computation,” Statistics and Computing, 22 (5), 1009–1020.10.1007/s11222-011-9271-ySearch in Google Scholar
Didelot, X., R. G. Everitt, A. M. Johansen and D. J. Lawson (2011): “Likelihood-free estimation of model evidence,” Bayesian Analysis 6 (1), 49–76.10.1214/11-BA602Search in Google Scholar
Dingle, K. E., F. M. Colles, D. R. A. Wareing, M. C. J. Maiden, M. C. J. Ure, R. Maiden, A. J. Fox, F. E. Bolton, H. J. Bootsma, R. J. Willems, R. Urwin and M. C. Maiden (2001): “Multilocus sequence typing system for Campylobacter jejuni,” Journal of Clinical Microbiology, 39, 14–23.10.1128/JCM.39.1.14-23.2001Search in Google Scholar PubMed PubMed Central
Drovandi, C. C. and A. N. Pettitt (2011): “Estimation of parameters for macroparasite population evolution using approximate Bayesian computation,” Biometrics, 67 (1), 225–233.10.1111/j.1541-0420.2010.01410.xSearch in Google Scholar PubMed
Estoup, A., E. Lombaert, J.-M. Marin, T. Guillemaud, P. Pudlo, C. P. Robert and J. Cornuet (2012): “Estimation of demo-genetic model probabilities with approximate Bayesian computation using linear discriminant analysis on summary statistics,” Molecular Ecology Resources, 12 (5), 846–855.10.1111/j.1755-0998.2012.03153.xSearch in Google Scholar PubMed
Fan, Y., D. J. Nott and S. A. Sisson (2013): Approximate Bayesian computation via regression density estimation. Stat, 2, 34–48.10.1002/sta4.15Search in Google Scholar
Fearnhead, P. and D. Prangle (2012): “Constructing summary statistics for approximate Bayesian computation: semi-automatic ABC,” Journal of the Royal Statistical Society, Series B, 74, 419–474.10.1111/j.1467-9868.2011.01010.xSearch in Google Scholar
French, N., S. Yu, P. Biggs, B. Holland, P. Fearnhead, B. Binney, A. Fox, D. H. Grove-White, J. Leigh, W. Miller, P. Muellner and P. Carter (2014): “Evolution of Campylobacter species in New Zealand,” In S. Sheppard and G. Méric, editors, Campylobacter Ecology and Evolution. Caister Academic Press, Norfolk.Search in Google Scholar
Friedman, J., T. Hastie, and R. Tibshirani (2010): “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software, 33 (1).10.18637/jss.v033.i01Search in Google Scholar
Grelaud, A., C. Robert, J.-M. Marin, F. Rodolphe and J. F. Taly (2009): “ABC likelihood-free methods for model choice in Gibbs random fields,” Bayesian Analysis, 4 (2), 317–336.10.1214/09-BA412Search in Google Scholar
Hudson, R. R. (2002): “Generating samples under a Wright-Fisher neutral model of genetic variation,” Bioinformatics, 18, 337–338.10.1093/bioinformatics/18.2.337Search in Google Scholar PubMed
Humphrey, T., S. O’Brien and M. Madsen (2007): “Campylobacters as zoonotic pathogens: a food production perspective,” International Journal of Food Microbiology, 117 (3), 237–57.10.1016/j.ijfoodmicro.2007.01.006Search in Google Scholar PubMed
Joyce, P. and P. Marjoram (2008): “Approximately sufficient statistics and Bayesian computation,” Statistical Applications in Genetics and Molecular Biology, 7, 2008. Article 26.Search in Google Scholar
Kolmogorov, A. N. (1942): “Determination of centre of dispersion and measure of accuracy from a finite number of observations (in Russian),” Izv. Akad. Nauk, USSR Ser. Mat., 6, 3–32.Search in Google Scholar
Liu, J. S. (1996): “Metropolized independent sampling with comparisons to rejection sampling and importance sampling,” Statistics and Computing, 6, 113–119.10.1007/BF00162521Search in Google Scholar
Marin, J.-M., N. Pillai, C. P. Robert and J. Rousseau (2013): “Relevant statistics for Bayesian model choice,” Preprint. Available at http://www.arxiv.org/abs/1110.4700.Search in Google Scholar
Mullner, P., S. E. F. Spencer, D. J. Wilson, G. Jones, A. D. Noble, A. C. Midwinter, J. M. Collins-Emerson, P. Carter, S. Hathaway and N. P. French (2009): “Assigning the source of human campylobacteriosis in New Zealand: a comparative genetic and epidemiological approach,” Infection, Genetics and Evolution 9 (6), 1311–1319.10.1016/j.meegid.2009.09.003Search in Google Scholar PubMed
Nordborg, M. (2004): “Coalescent theory,” In: D.J. Balding, M. Bishop, C. Cannings (Eds.). Handbook of statistical genetics, Wiley-Interscience, volume 2, New York.10.1002/0470022620.bbc21Search in Google Scholar
Nunes, M. A. and D. J. Balding (2010): “On optimal selection of summary statistics for approximate Bayesian computation,” Statistical Applications in Genetics and Molecular Biology, 9 (1), 2010.10.2202/1544-6115.1576Search in Google Scholar PubMed
Rambaut, A. and N. C. Grassly (1997): “Seq-gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees,” Computer Applications in the Biosciences, 13, 235–238.10.1093/bioinformatics/13.3.235Search in Google Scholar PubMed
Rayner, G. D. and H. L. MacGillivray (2002): “Numerical maximum likelihood estimation for the g-and-k and generalized g-and-h distributions,” Statistics and Computing, 12 (1), 57–75.10.1023/A:1013120305780Search in Google Scholar
Robert, C. P. (1996): “Intrinsic losses,” Theory and decision, 40 (2), 191–214.10.1007/BF00133173Search in Google Scholar
Robert, C. P. (2014): Bayesian computational tools. Annual Review of Statistics and Its Application, 1, 16.1–16.25.10.1146/annurev-statistics-022513-115543Search in Google Scholar
Robert, C. P., J. M. Cornuet, J.-M. Marin and N. Pillai (2011): “Lack of confidence in approximate Bayesian computation model choice,” Proceedings of the National Academy of Sciences, 108 (37), 15112–15117.10.1073/pnas.1102900108Search in Google Scholar PubMed PubMed Central
Savill, M., A. Hudson, M. Devane, N. Garrett, B. Gilpin and A. Ball (2003): “Elucidation of potential transmission routes of Campylobacter in New Zealand,” Water Science and Technology, 47 (3), 31–38.10.2166/wst.2003.0154Search in Google Scholar
Sears, A., M. G. Baker, N. Wilson, J. Marshall, P. Muellner, D. M. Campbell, R. J. Lake and N. P. French (2011): “Marked campylobacteriosis decline after interventions aimed at poultry, New Zealand,” Emerging Infectious Diseases, 17 (6), 1007–1015.10.3201/eid/1706.101272Search in Google Scholar PubMed PubMed Central
Sjödin, P., A. E. Sjöstrand, M. Jakobsson and M. G. B. Blum (2012): “Resequencing data provide no evidence for a human bottleneck in Africa during the penultimate glacial period,” Molecular Biology and Evolution, 29 (7), 1851–1860.10.1093/molbev/mss061Search in Google Scholar PubMed
Sousa, V. C., M. A. Beaumont, P. Fernandes, M. M. Coelho and L. Chikhi (2012): “Population divergence with or without admixture: selecting models using an ABC approach,” Heredity, 108, 521–530.10.1038/hdy.2011.116Search in Google Scholar PubMed PubMed Central
Toni, T. and M. P. H. Stumpf (2010): “Simulation-based model selection for dynamical systems in systems and population biology,” Bioinformatics, 26 (1), 104–110.10.1093/bioinformatics/btp619Search in Google Scholar PubMed PubMed Central
Wilson, D. J., E. Gabriel, A. J. H. Leatherbarrow, J. Cheesbrough, S. Gee, E. Bolton, A. Fox, C. A. Hart, P. J. Diggle and P. Fearnhead (2009): “Rapid evolution and the importance of recombination to the gastroenteric pathogen Campylobacter jejuni,” Molecular Biology and Evolution, 26 (2), 385–397.10.1093/molbev/msn264Search in Google Scholar PubMed PubMed Central
Wiuf C. and J. Hein (2000): “The coalescent with gene conversion,” Genetics, 155, 451–462.10.1093/genetics/155.1.451Search in Google Scholar PubMed PubMed Central
Yu, S., P. Fearnhead, B. R. Holland, P. Biggs, M. Maiden and N. P. French (2012): “Estimating the relative roles of recombination and point mutation in the generation of single locus variants in Campylobacter jejuni and Campylobacter coli,” Journal of Molecular Evolution, 74 (5–6), 273–280.10.1007/s00239-012-9505-4Search in Google Scholar PubMed PubMed Central
©2014 by Walter de Gruyter Berlin Boston
Articles in the same Issue
- Masthead
- Masthead
- Research Articles
- Modeling angles in proteins and circular genomes using multivariate angular distributions based on multiple nonnegative trigonometric sums
- Second order optimization for the inference of gene regulatory pathways
- Multiple comparisons in genetic association studies: a hierarchical modeling approach
- A Bayesian decision procedure for testing multiple hypotheses in DNA microarray experiments
- Semi-automatic selection of summary statistics for ABC model choice
- Detection of epistatic effects with logic regression and a classical linear regression model
- Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model
Articles in the same Issue
- Masthead
- Masthead
- Research Articles
- Modeling angles in proteins and circular genomes using multivariate angular distributions based on multiple nonnegative trigonometric sums
- Second order optimization for the inference of gene regulatory pathways
- Multiple comparisons in genetic association studies: a hierarchical modeling approach
- A Bayesian decision procedure for testing multiple hypotheses in DNA microarray experiments
- Semi-automatic selection of summary statistics for ABC model choice
- Detection of epistatic effects with logic regression and a classical linear regression model
- Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model