Modeling Read Counts for CNV Detection in Exome Sequencing Data

Michael I. Love; Alena Myšičková; Ruping Sun; Vera Kalscheuer; Martin Vingron; Stefan A. Haas

doi:10.2202/1544-6115.1732

Article

Modeling Read Counts for CNV Detection in Exome Sequencing Data

Michael I. Love , Alena Myšičková , Ruping Sun , Vera Kalscheuer , Martin Vingron and Stefan A. Haas

Published/Copyright: November 8, 2011

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistical Applications in Genetics and Molecular Biology Volume 10 Issue 1

Varying depth of high-throughput sequencing reads along a chromosome makes it possible to observe copy number variants (CNVs) in a sample relative to a reference. In exome and other targeted sequencing projects, technical factors increase variation in read depth while reducing the number of observed locations, adding difficulty to the problem of identifying CNVs. We present a hidden Markov model for detecting CNVs from raw read count data, using background read depth from a control set as well as other positional covariates such as GC-content. The model, exomeCopy, is applied to a large chromosome X exome sequencing project identifying a list of large unique CNVs. CNVs predicted by the model and experimentally validated are then recovered using a cross-platform control set from publicly available exome sequencing data. Simulations show high sensitivity for detecting heterozygous and homozygous CNVs, outperforming normalization and state-of-the-art segmentation methods.

Keywords: exorne sequencing; targeted sequencing; CNV; copy number variant; HMM; hidden Markov model

Author Notes:

We thank our collaborators on the XLID project, Prof. Dr. H.-Hilger Ropers, Wei Chen, Hao Hu, Reinhard Ullmann and the EUROMRX consortium for providing the XLID data, validation of CNVs and for helpful discussion. We also thank Ho-Ryun Chung for suggestions. Part of this work was financed by the European Union’s Seventh Framework Program under grant agreement number 241995, project GENCODYS.

References

1000 Genomes Project Consortium (2010): “A map of human genome variation from population-scale sequencing,” Nature, 467, 1061–1073.10.1038/nature09534Search in Google Scholar PubMed PubMed Central

Alkan, C., J. M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hormozdiari, J. O. Kitzman, C. Baker, M. Malig, O. Mutlu, S. C. Sahinalp, R. A. Gibbs, and E. E. Eichler (2009): “Personalized copy number and segmental duplication maps using next-generation sequencing,” Nature Genetics, 41, 1061–1067.10.1038/ng.437Search in Google Scholar PubMed PubMed Central

Anders, S. and W. Huber (2010): “Differential expression analysis for sequence count data.” Genome biology, 11, R106+.10.1186/gb-2010-11-10-r106Search in Google Scholar PubMed PubMed Central

Benjamini, Y. and T. P. Speed (2011): “Estimation and correction for GC-content bias in high throughput sequencing,” Technical report, University of California at Berkeley.Search in Google Scholar

Bliss, C. I. and R. A. Fisher (1953): “Fitting the Negative Binomial Distribution to Biological Data,” Biometrics, 9.10.2307/3001850Search in Google Scholar

Boeva, V., A. Zinovyev, K. Bleakley, J.-P. Vert, I. Janoueix-Lerosey, O. Delattre, and E. Barillot (2011): “Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization,” Bioinformatics, 27, 268–269.10.1093/bioinformatics/btq635Search in Google Scholar PubMed PubMed Central

Campbell, P. J., P. J. Stephens, E. D. Pleasance, S. O’Meara, H. Li, T. Santarius, L. A. Stebbings, C. Leroy, S. Edkins, C. Hardy, J. W. Teague, A. Menzies, I. Goodhead, D. J. Turner, C. M. Clee, M. A. Quail, A. Cox, C. Brown, R. Durbin, M. E. Hurles, P. A. W. Edwards, G. R. Bignell, M. R. Stratton, and P. A. Futreal (2008): “Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing,” Nature Genetics, 40, 722–729.10.1038/ng.128Search in Google Scholar PubMed PubMed Central

Chiang, D. Y., G. Getz, D. B. Jaffe, M. J. T. O’Kelly, X. Zhao, S. L. Carter, C. Russ, C. Nusbaum, M. Meyerson, and E. S. Lander (2008): “High-resolution mapping of copy-number alterations with massively parallel sequencing,” Nature Methods, 6, 99–103.10.1038/nmeth.1276Search in Google Scholar PubMed PubMed Central

Conrad, D. F., D. Pinto, R. Redon, L. Feuk, O. Gokcumen, Y. Zhang, J. Aerts, T. D. Andrews, C. Barnes, P. Campbell, T. Fitzgerald, M. Hu, C. H. Ihm, K. Kristiansson, D. G. MacArthur, J. R. MacDonald, I. Onyiah, A. W. Pang, S. Robson, K. Stirrups, A. Valsesia, K. Walter, J. Wei, C. Tyler-Smith, N. P. Carter, C. Lee, S. W. Scherer, and M. E. Hurles (2010): “Origins and functional impact of copy number variation in the human genome,” Nature, 464, 704–712.10.1038/nature08516Search in Google Scholar PubMed PubMed Central

Fridlyand, J. (2004): “Hidden Markov models approach to the analysis of array CGH data,” Journal of Multivariate Analysis, 90, 132–153.10.1016/j.jmva.2004.02.008Search in Google Scholar

Gentleman, R., V. Carey, D. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Yang, and J. Zhang (2004): “Bioconductor: open software development for computational biology and bioinformatics,” Genome Biology, 5, R80+.10.1186/gb-2004-5-10-r80Search in Google Scholar PubMed PubMed Central

Glessner, J. T., K. Wang, G. Cai, O. Korvatska, C. E. Kim, S. Wood, H. Zhang, A. Estes, C. W. Brune, J. P. Bradfield, M. Imielinski, E. C. Frackelton, J. Reichert, E. L. Crawford, J. Munson, P. M. A. Sleiman, R. Chiavacci, K. Annaiah, K. Thomas, C. Hou, W. Glaberson, J. Flory, F. Otieno, M. Garris, L. Soorya, L. Klei, J. Piven, K. J. Meyer, E. Anagnostou, T. Sakurai, R. M. Game, D. S. Rudd, D. Zurawiecki, C. J. McDougle, L. K. Davis, J. Miller, D. J. Posey, S. Michaels, A. Kolevzon, J. M. Silverman, R. Bernier, S. E. Levy, R. T. Schultz, G. Dawson, T. Owley, W. M. McMahon, T. H. Wassink, J. A. Sweeney, J. I. Nurnberger, H. Coon, J. S. Sutcliffe, N. J. Minshew, S. F. A. Grant, M. Bucan, E. H. Cook, J. D. Buxbaum, B. Devlin, G. D. Schellenberg, and H. Hakonarson (2009): “Autism genome-wide copy number variation reveals ubiquitin and neuronal genes,” Nature, 459, 569–573.10.1038/nature07953Search in Google Scholar PubMed PubMed Central

Gonzalez, E., H. Kulkarni, H. Bolivar, A. Mangano, R. Sanchez, G. Catano, R. J. Nibbs, B. I. Freedman, M. P. Quinones, M. J. Bamshad, K. K. Murthy, B. H. Rovin, W. Bradley, R. A. Clark, S. A. Anderson, R. J. O’Connell, B. K. Agan, S. S. Ahuja, R. Bologna, L. Sen, M. J. Dolan, and S. K. Ahuja (2005): “The Influence of CCL3L1 Gene-Containing Segmental Duplications on HIV-1/AIDS Susceptibility,” Science, 307, 1434–1440.10.1126/science.1101160Search in Google Scholar PubMed

Harismendy, O., P. Ng, R. Strausberg, X. Wang, T. Stockwell, K. Beeson, N. Schork, S. Murray, E. Topol, S. Levy, and K. Frazer (2009): “Evaluation of next generation sequencing platforms for population targeted sequencing studies,” Genome Biology, 10, R32+.10.1186/gb-2009-10-3-r32Search in Google Scholar PubMed PubMed Central

Hedges, D. J., T. Guettouche, S. Yang, G. Bademci, A. Diaz, A. Andersen, W. F. Hulme, S. Linker, A. Mehta, Y. J. K. Edwards, G. W. Beecham, E. R. Martin, M. A. Pericak-Vance, S. Zuchner, J. M. Vance, and J. R. Gilbert (2011): “Comparison of Three Targeted Enrichment Strategies on the SOLiD Sequencing Platform,” PLoS ONE, 6, e18595+.10.1371/journal.pone.0018595Search in Google Scholar PubMed PubMed Central

Herman, D. S., G. K. Hovingh, O. Iartchouk, H. L. Rehm, R. Kucherlapati, J. G. Seidman, and C. E. Seidman (2009): “Filter-based hybridization capture of subgenomes enables resequencing and copy-number detection.” Nature methods, 6, 507–510.10.1038/nmeth.1343Search in Google Scholar PubMed PubMed Central

Ivakhno, S., T. Royce, A. J. Cox, D. J. Evers, R. K. Cheetham, and S. Tavaré (2010): “CNAsega novel framework for identification of copy number changes in cancer from second-generation sequencing data,” Bioinformatics, 26, 3051–3058.10.1093/bioinformatics/btq587Search in Google Scholar PubMed

Kleinjan, D.-J. and V. van Heyningen (1998): “Position Effect in Human Genetic Disease,” Human Molecular Genetics, 7, 1611–1618.10.1093/hmg/7.10.1611Search in Google Scholar PubMed

Li, Y., N. Vinckenbosch, G. Tian, E. Huerta-Sanchez, T. Jiang, H. Jiang, A. Albrechtsen, G. Andersen, H. Cao, T. Korneliussen, N. Grarup, Y. Guo, I. Hellman, X. Jin, Q. Li, J. Liu, X. Liu, T. Sparso, M. Tang, H. Wu, R. Wu, C. Yu, H. Zheng, A. Astrup, L. Bolund, J. Holmkvist, T. Jorgensen, K. Kristiansen, O. Schmitz, T. W. Schwartz, X. Zhang, R. Li, H. Yang, J. Wang, T. Hansen, O. Pedersen, R. Nielsen, and J. Wang (2010): “Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants,” Nature Genetics, 42, 969–972.10.1038/ng.680Search in Google Scholar PubMed

Madrigal, I., L. Rodríguez-Revenga, L. Armengol, E. González, B. Rodriguez, C. Badenas, A. Sánchez, F. Martínez, M. Guitart, I. Fernández, J. A. Arranz, M. Tejada, L. A. Pérez-Jurado, X. Estivill, and M. Milà (2007): “X-chromosome tiling path array detection of copy number variants in patients with chromosome X-linked mental retardation.” BMC genomics, 8, 443+.10.1186/1471-2164-8-443Search in Google Scholar PubMed PubMed Central

Marioni, J. C., N. P. Thorne, and S. Tavaré (2006): “BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data.” Bioinformatics, 22, 1144–1146.10.1093/bioinformatics/btl089Search in Google Scholar PubMed

Medvedev, P., M. Stanciu, and M. Brudno (2009): “Computational methods for discovering structural variation with next-generation sequencing,” Nature Methods, 6, S13–S20.10.1038/nmeth.1374Search in Google Scholar PubMed

Miller, C. A., O. Hampton, C. Coarfa, and A. Milosavljevic (2011): “ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads,” PLoS ONE, 6, e16327+.10.1371/journal.pone.0016327Search in Google Scholar PubMed PubMed Central

Nord, A., M. Lee, M. C. King, and T. Walsh (2011): “Accurate and exact CNV identification from targeted high-throughput sequence data,” BMC Genomics, 12, 184+.10.1186/1471-2164-12-184Search in Google Scholar PubMed PubMed Central

O’Roak, B. J., P. Deriziotis, C. Lee, L. Vives, J. J. Schwartz, S. Girirajan, E. Karakoc, A. P. MacKenzie, S. B. Ng, C. Baker, M. J. Rieder, D. A. Nickerson, R. Bernier, S. E. Fisher, J. Shendure, and E. E. Eichler (2011): “Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations,” Nature Genetics, 43, 585–589.10.1038/ng.835Search in Google Scholar PubMed PubMed Central

Pang, A., J. MacDonald, D. Pinto, J. Wei, M. Rafiq, D. Conrad, H. Park, M. Hurles, C. Lee, J. C. Venter, E. Kirkness, S. Levy, L. Feuk, and S. Scherer (2010): “Towards a comprehensive structural variation map of an individual human genome,” Genome Biology, 11, R52+.10.1186/gb-2010-11-5-r52Search in Google Scholar PubMed PubMed Central

Pruitt, K. D., J. Harrow, R. A. Harte, C. Wallin, M. Diekhans, D. R. Maglott, S. Searle, C. M. Farrell, J. E. Loveland, B. J. Ruef, E. Hart, M.-M. M. Suner, M. J. Landrum, B. Aken, S. Ayling, R. Baertsch, J. Fernandez-Banet, J. L. Cherry, V. Curwen, M. Dicuccio, M. Kellis, J. Lee, M. F. Lin, M. Schuster, A. Shkeda, C. Amid, G. Brown, O. Dukhanina, A. Frankish, J. Hart, B. L. Maidak, J. Mudge, M. R. Murphy, T. Murphy, J. Rajan, B. Rajput, L. D. Riddick, C. Snow, C. Steward, D. Webb, J. A. Weber, L. Wilming, W. Wu, E. Birney, D. Haussler, T. Hubbard, J. Ostell, R. Durbin, and D. Lipman (2009): “The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.” Genome research, 19, 1316–1323.10.1101/gr.080531.108Search in Google Scholar PubMed PubMed Central

R Development Core Team (2011): R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria.Search in Google Scholar

Rabiner, L. R. (1989): “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, 77, 257–286.10.1109/5.18626Search in Google Scholar

Robinson, M. D., D. J. McCarthy, and G. K. Smyth (2010): “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics (Oxford, England), 26, 139–140.10.1093/bioinformatics/btp616Search in Google Scholar PubMed PubMed Central

Sathirapongsasuti, J. F., H. Lee, B. A. Horst, G. Brunner, A. J. Cochran, S. Binder, J. Quackenbush, and S. F. Nelson (2011): “Exome Sequencing-Based Copy-Number Variation and Loss of Heterozygosity Detection: ExomeCNV.” Bioinformatics (Oxford, England).10.1093/bioinformatics/btr462Search in Google Scholar PubMed PubMed Central

Sebat, J., B. Lakshmi, D. Malhotra, J. Troge, C. Lese-Martin, T. Walsh, B. Yamrom, S. Yoon, A. Krasnitz, J. Kendall, A. Leotta, D. Pai, R. Zhang, Y.-H. H. Lee, J. Hicks, S. J. Spence, A. T. Lee, K. Puura, T. Lehtimäki, D. Ledbetter, P. K. Gregersen, J. Bregman, J. S. Sutcliffe, V. Jobanputra, W. Chung, D. Warburton, M.-C. C. King, D. Skuse, D. H. Geschwind, T. C. Gilliam, K. Ye, and M. Wigler (2007): “Strong association of de novo copy number mutations with autism.” Science (New York, N.Y.), 316, 445–449.Search in Google Scholar

Shen, J. J. and N. R. Zhang (2011): “Change-Point Model on Non-Homogeneous Poisson Processes with Application in Copy Number Profiling by Next-Generation DNA Sequencing,” Technical report, Division of Biostatistics, Stanford University.10.1214/11-AOAS517Search in Google Scholar

St Clair, D. (2009): “Copy number variation and schizophrenia.” Schizophrenia bulletin, 35, 9–12.10.1093/schbul/sbn147Search in Google Scholar PubMed PubMed Central

Venkatraman, E. S. and A. B. Olshen (2007): “A faster circular binary segmentation algorithm for the analysis of array CGH data,” Bioinformatics, 23, 657–663.10.1093/bioinformatics/btl646Search in Google Scholar PubMed

Weese, D., A.-K. Emde, T. Rausch, A. Döring, and K. Reinert (2009): “RazerSfast read mapping with sensitivity control,” Genome Research, 19, 1646–1654.10.1101/gr.088823.108Search in Google Scholar PubMed PubMed Central

Xie, C. and M. Tammi (2009): “CNV-seq, a new method to detect copy number variation using high-throughput sequencing,” BMC Bioinformatics, 10, 80+.10.1186/1471-2105-10-80Search in Google Scholar PubMed PubMed Central

Yoon, S., Z. Xuan, V. Makarov, K. Ye, and J. Sebat (2009): “Sensitive and accurate detection of copy number variants using read depth of coverage,” Genome Research, 19, 1586–1592.10.1101/gr.092981.109Search in Google Scholar PubMed PubMed Central

Zhang, J., L. Feuk, G. E. Duggan, R. Khaja, and S. W. Scherer (2006): “Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome,” Cytogenetic and Genome Research, 115, 205–214.10.1159/000095916Search in Google Scholar PubMed

Published Online: 2011-11-8

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.2202/1544-6115.1732

Keywords for this article

exorne sequencing; targeted sequencing; CNV; copy number variant; HMM; hidden Markov model