Abstract
Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore provide faster, cost-effective and simpler assembly process generating longer reads than the ones in the next generation sequencing. However, the error rates of these long reads are higher than those of the short reads, resulting in an error correcting process before the assembly such as using the Circular Consensus Sequencing (CCS) reads in PacBio sequencing machines. In this paper, we propose a probabilistic model for the error occurrence along the CCS reads. We obtain the error probability of any arbitrary nucleotide as well as the base calling Phred quality score of the nucleotides along the CCS reads in terms of the number of sub-reads. Furthermore, we derive the error rate distribution of the reads in relation to the pass number. It follows the binomial distribution which can be approximated by the normal distribution for long reads. Finally, we evaluate our proposed model by comparing it with three real PacBio datasets, namely, Lambda, and E. coli genomes, and Alzheimer’s disease targeted experiment.
-
Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
-
Research funding: None declared.
-
Conflict of interest statement: The authors declare no conflicts of interest regarding this article
References
1. Pourmohammadi, R, Abouei, J, Anpalagan, A. Probabilistic modeling and analysis of DNA fragmentation. J Biol Syst 2019;27:281–307. https://doi.org/10.1142/s0218339019500128.Suche in Google Scholar
2. van Dijk, EL, Jaszczyszyn, Y, Naquin, D, Thermes, C. The third revolution in sequencing technology. Trends Genet 2018;34:666–81. https://doi.org/10.1016/j.tig.2018.05.008.Suche in Google Scholar PubMed
3. Johnson, SS, Zaikova, E, Goerlitz, DS, Bai, Y, Tighe, SW. Real-time DNA sequencing in the antarctic dry valleys using the Oxford Nanopore sequencer. J Biomol Tech 2017;28:2–7. https://doi.org/10.7171/jbt.17-2801-009.Suche in Google Scholar PubMed PubMed Central
4. Jiao, X, Zheng, X, Ma, L, Kutty, G, Gogineni, E, Sun, Q, et al.. A benchmark study on error assessment and quality control of CCS reads derived from the PacBio RS. J Data Min Genom Proteonomics 2013;4:1–5. https://doi.org/10.4172/2153-0602.1000136.Suche in Google Scholar PubMed PubMed Central
5. Koren, S, Schatz, MC, Walenz, BP, Martin, J, Howard, JT, Ganapathy, G, et al.. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 2012;30:693–700. https://doi.org/10.1038/nbt.2280.Suche in Google Scholar PubMed PubMed Central
6. Laehnemann, D, Borkhardt, A, McHardy, AC. Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction. Briefings Bioinf 2015;17:154–79. https://doi.org/10.1093/bib/bbv029.Suche in Google Scholar PubMed PubMed Central
7. Yang, X, Chockalingam, SP, Aluru, S. A survey of error-correction methods for next-generation sequencing. Briefings Bioinf 2012;14:56–66. https://doi.org/10.1093/bib/bbs015.Suche in Google Scholar PubMed
8. Salmela, L, Rivals, E. Lordec: accurate and efficient long read error correction. Bioinformatics 2014;30:3506–14. https://doi.org/10.1093/bioinformatics/btu538.Suche in Google Scholar PubMed PubMed Central
9. Salmela, L, Walve, R, Rivals, E, Ukkonen, E. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 2016;33:799–806. https://doi.org/10.1093/bioinformatics/btw321.Suche in Google Scholar PubMed PubMed Central
10. Berlin, K, Koren, S, Chin, CS, Drake, JP, Landolin, JM, Phillippy, AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 2015;33:623–30. https://doi.org/10.1038/nbt.3238.Suche in Google Scholar PubMed
11. Au, KF, Underwood, JG, Lee, L, Wong, WH. Improving PacBio long read accuracy by short read alignment. PLoS One 2012;7:e46679.10.1371/journal.pone.0046679Suche in Google Scholar PubMed PubMed Central
12. Miclotte, G, Heydari, M, Demeester, P, Rombauts, S, Van de Peer, Y, Audenaert, P, et al.. Jabba: hybrid error correction for long sequencing reads. Algorithm Mol Biol 2016;11:10. https://doi.org/10.1186/s13015-016-0075-7.Suche in Google Scholar PubMed PubMed Central
13. Morisse, P, Lecroq, T, Lefebvre, A. Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph. Bioinformatics 2018;34:4213–22. https://doi.org/10.1093/bioinformatics/bty521.Suche in Google Scholar PubMed
14. Motahari, A, Bresler, G, Tse, D. Information theory of DNA shotgun sequencing. IEEE Trans Inf Theor 2013;59:6273–89. https://doi.org/10.1109/tit.2013.2270273.Suche in Google Scholar
15. Lam, K-K, Khalak, A, Tse, D. Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinf 2014;15(9 Suppl):S4. https://doi.org/10.1186/1471-2105-15-s9-s4.Suche in Google Scholar PubMed PubMed Central
16. Ambardar, S, Gupta, R, Trakroo, D, Lal, R, Vakhlu, J. High throughput sequencing: an overview of sequencing chemistry. Ind J Microbiol 2016;56:394–404. https://doi.org/10.1007/s12088-016-0606-4.Suche in Google Scholar PubMed PubMed Central
17. Johnson, RR, Kuby, PJ. Elementary statistics. Boston, MA: Cengage Learning; 2011.Suche in Google Scholar
18. Wei, ZG, Npbss, SWZ. Npbss: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model. BMC Bioinf 2018;19:177. https://doi.org/10.1186/s12859-018-2208-0.Suche in Google Scholar PubMed PubMed Central
19. Wenger, AM, Peluso, P, et al.. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. bioRxiv 2019;1:519025. https://doi.org/10.1101/519025.Suche in Google Scholar
20. Pacific Biosciences. Targeted sequencing – SNP detection and validation. 2012; Available from: http://www.mscience.com.au/upload/pages/pacbio/technical-note–experimental-design-for-targeted-sequencing.pdf.Suche in Google Scholar
21. Thrash, A, Arick, MII, Peterson, DG. Quack: a quality assurance tool for high throughput sequence data. Anal Biochem 2018;548:38–43. https://doi.org/10.1016/j.ab.2018.01.028.Suche in Google Scholar PubMed
22. Pacific Biosciences. E. coli bacterial assembly; 2017. Available from: https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-Bacterial-Assembly.Suche in Google Scholar
23. Pacific, Biosciences. Alzheimer’s disease capture panel data release; 2017. Available from: https://github.com/PacificBiosciences/DevNet/wiki/Alzheimer%E2%80%99s-Disease-Capture-Panel-Data-Release.Suche in Google Scholar
24. Pourmohammadi, R. SMRT link software installation and running analysis; 2019. Available from: https://www.linkedin.com/pulse/smrt-link-software-installation-running-analysis-reza-pourmohammadi.Suche in Google Scholar
25. Ono, Y, Asai, K, Hamada, M. Pbsim: pacbio reads simulator—toward accurate genome assembly. Bioinformatics 2012;29:119–21. https://doi.org/10.1093/bioinformatics/bts649.Suche in Google Scholar PubMed
26. Pacific Biosciences. Pbreports; 2018. Available from: https://github.com/PacificBiosciences/pbreports/blob/master/pbreports/util.py.Suche in Google Scholar
27. Rhoads, A, Au, KF. PacBio sequencing and its applications. Dev Reprod Biol 2015;13:278–89. https://doi.org/10.1016/j.gpb.2015.08.002.Suche in Google Scholar PubMed PubMed Central
Supplementary Material
This article contains supplementary material (https://doi.org/10.1515/ijb-2021-0091).
© 2023 Walter de Gruyter GmbH, Berlin/Boston
Artikel in diesem Heft
- Frontmatter
- Part-1: SMAC 2021 Webconference
- Statistics, philosophy, and health: the SMAC 2021 webconference
- Part-2: Regular Articles
- “Show me the DAG!”
- Causal inference for oncology: past developments and current challenges
- The EBM+ movement
- Bayesianism from a philosophical perspective and its application to medicine
- Bayesian inference for optimal dynamic treatment regimes in practice
- Agent-based modeling in medical research, virtual baseline generator and change in patients’ profile issue
- Agent based modeling in health care economics: examples in the field of thyroid cancer
- A copula-based set-variant association test for bivariate continuous, binary or mixed phenotypes
- Detection of atypical response trajectories in biomedical longitudinal databases
- Potential application of elastic nets for shared polygenicity detection with adapted threshold selection
- Error analysis of the PacBio sequencing CCS reads
- A SIMEX approach for meta-analysis of diagnostic accuracy studies with attention to ROC curves
- Statistical modelling of COVID-19 and drug data via an INAR(1) process with a recent thinning operator and cosine Poisson innovations
- The balanced discrete triplet Lindley model and its INAR(1) extension: properties and COVID-19 applications
Artikel in diesem Heft
- Frontmatter
- Part-1: SMAC 2021 Webconference
- Statistics, philosophy, and health: the SMAC 2021 webconference
- Part-2: Regular Articles
- “Show me the DAG!”
- Causal inference for oncology: past developments and current challenges
- The EBM+ movement
- Bayesianism from a philosophical perspective and its application to medicine
- Bayesian inference for optimal dynamic treatment regimes in practice
- Agent-based modeling in medical research, virtual baseline generator and change in patients’ profile issue
- Agent based modeling in health care economics: examples in the field of thyroid cancer
- A copula-based set-variant association test for bivariate continuous, binary or mixed phenotypes
- Detection of atypical response trajectories in biomedical longitudinal databases
- Potential application of elastic nets for shared polygenicity detection with adapted threshold selection
- Error analysis of the PacBio sequencing CCS reads
- A SIMEX approach for meta-analysis of diagnostic accuracy studies with attention to ROC curves
- Statistical modelling of COVID-19 and drug data via an INAR(1) process with a recent thinning operator and cosine Poisson innovations
- The balanced discrete triplet Lindley model and its INAR(1) extension: properties and COVID-19 applications