Error analysis of the PacBio sequencing CCS reads

Reza Pourmohammadi; Jamshid Abouei; Alagan Anpalagan

doi:10.1515/ijb-2021-0091

Artikel

Error analysis of the PacBio sequencing CCS reads

Reza Pourmohammadi , Jamshid Abouei und Alagan Anpalagan

Veröffentlicht/Copyright: 8. Mai 2023

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift The International Journal of Biostatistics Band 19 Heft 2

Abstract

Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore provide faster, cost-effective and simpler assembly process generating longer reads than the ones in the next generation sequencing. However, the error rates of these long reads are higher than those of the short reads, resulting in an error correcting process before the assembly such as using the Circular Consensus Sequencing (CCS) reads in PacBio sequencing machines. In this paper, we propose a probabilistic model for the error occurrence along the CCS reads. We obtain the error probability of any arbitrary nucleotide as well as the base calling Phred quality score of the nucleotides along the CCS reads in terms of the number of sub-reads. Furthermore, we derive the error rate distribution of the reads in relation to the pass number. It follows the binomial distribution which can be approximated by the normal distribution for long reads. Finally, we evaluate our proposed model by comparing it with three real PacBio datasets, namely, Lambda, and E. coli genomes, and Alzheimer’s disease targeted experiment.

Keywords: CCS reads accuracy; CCS reads quality; PacBio error model; sequencing noise

Corresponding author: Jamshid Abouei, WINEL Research Laboratory at the Department of Electrical Engineering, Yazd University, Yazd, Iran, E-mail: abouei@yazd.ac.ir

Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: None declared.
Conflict of interest statement: The authors declare no conflicts of interest regarding this article

References

1. Pourmohammadi, R, Abouei, J, Anpalagan, A. Probabilistic modeling and analysis of DNA fragmentation. J Biol Syst 2019;27:281–307. https://doi.org/10.1142/s0218339019500128.Suche in Google Scholar

2. van Dijk, EL, Jaszczyszyn, Y, Naquin, D, Thermes, C. The third revolution in sequencing technology. Trends Genet 2018;34:666–81. https://doi.org/10.1016/j.tig.2018.05.008.Suche in Google Scholar PubMed

3. Johnson, SS, Zaikova, E, Goerlitz, DS, Bai, Y, Tighe, SW. Real-time DNA sequencing in the antarctic dry valleys using the Oxford Nanopore sequencer. J Biomol Tech 2017;28:2–7. https://doi.org/10.7171/jbt.17-2801-009.Suche in Google Scholar PubMed PubMed Central

4. Jiao, X, Zheng, X, Ma, L, Kutty, G, Gogineni, E, Sun, Q, et al.. A benchmark study on error assessment and quality control of CCS reads derived from the PacBio RS. J Data Min Genom Proteonomics 2013;4:1–5. https://doi.org/10.4172/2153-0602.1000136.Suche in Google Scholar PubMed PubMed Central

5. Koren, S, Schatz, MC, Walenz, BP, Martin, J, Howard, JT, Ganapathy, G, et al.. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 2012;30:693–700. https://doi.org/10.1038/nbt.2280.Suche in Google Scholar PubMed PubMed Central

6. Laehnemann, D, Borkhardt, A, McHardy, AC. Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction. Briefings Bioinf 2015;17:154–79. https://doi.org/10.1093/bib/bbv029.Suche in Google Scholar PubMed PubMed Central

7. Yang, X, Chockalingam, SP, Aluru, S. A survey of error-correction methods for next-generation sequencing. Briefings Bioinf 2012;14:56–66. https://doi.org/10.1093/bib/bbs015.Suche in Google Scholar PubMed

8. Salmela, L, Rivals, E. Lordec: accurate and efficient long read error correction. Bioinformatics 2014;30:3506–14. https://doi.org/10.1093/bioinformatics/btu538.Suche in Google Scholar PubMed PubMed Central

9. Salmela, L, Walve, R, Rivals, E, Ukkonen, E. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 2016;33:799–806. https://doi.org/10.1093/bioinformatics/btw321.Suche in Google Scholar PubMed PubMed Central

10. Berlin, K, Koren, S, Chin, CS, Drake, JP, Landolin, JM, Phillippy, AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 2015;33:623–30. https://doi.org/10.1038/nbt.3238.Suche in Google Scholar PubMed

11. Au, KF, Underwood, JG, Lee, L, Wong, WH. Improving PacBio long read accuracy by short read alignment. PLoS One 2012;7:e46679.10.1371/journal.pone.0046679Suche in Google Scholar PubMed PubMed Central

12. Miclotte, G, Heydari, M, Demeester, P, Rombauts, S, Van de Peer, Y, Audenaert, P, et al.. Jabba: hybrid error correction for long sequencing reads. Algorithm Mol Biol 2016;11:10. https://doi.org/10.1186/s13015-016-0075-7.Suche in Google Scholar PubMed PubMed Central

13. Morisse, P, Lecroq, T, Lefebvre, A. Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph. Bioinformatics 2018;34:4213–22. https://doi.org/10.1093/bioinformatics/bty521.Suche in Google Scholar PubMed

14. Motahari, A, Bresler, G, Tse, D. Information theory of DNA shotgun sequencing. IEEE Trans Inf Theor 2013;59:6273–89. https://doi.org/10.1109/tit.2013.2270273.Suche in Google Scholar

15. Lam, K-K, Khalak, A, Tse, D. Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinf 2014;15(9 Suppl):S4. https://doi.org/10.1186/1471-2105-15-s9-s4.Suche in Google Scholar PubMed PubMed Central

16. Ambardar, S, Gupta, R, Trakroo, D, Lal, R, Vakhlu, J. High throughput sequencing: an overview of sequencing chemistry. Ind J Microbiol 2016;56:394–404. https://doi.org/10.1007/s12088-016-0606-4.Suche in Google Scholar PubMed PubMed Central

17. Johnson, RR, Kuby, PJ. Elementary statistics. Boston, MA: Cengage Learning; 2011.Suche in Google Scholar

18. Wei, ZG, Npbss, SWZ. Npbss: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model. BMC Bioinf 2018;19:177. https://doi.org/10.1186/s12859-018-2208-0.Suche in Google Scholar PubMed PubMed Central

19. Wenger, AM, Peluso, P, et al.. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. bioRxiv 2019;1:519025. https://doi.org/10.1101/519025.Suche in Google Scholar

20. Pacific Biosciences. Targeted sequencing – SNP detection and validation. 2012; Available from: http://www.mscience.com.au/upload/pages/pacbio/technical-note–experimental-design-for-targeted-sequencing.pdf.Suche in Google Scholar

21. Thrash, A, Arick, MII, Peterson, DG. Quack: a quality assurance tool for high throughput sequence data. Anal Biochem 2018;548:38–43. https://doi.org/10.1016/j.ab.2018.01.028.Suche in Google Scholar PubMed

22. Pacific Biosciences. E. coli bacterial assembly; 2017. Available from: https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-Bacterial-Assembly.Suche in Google Scholar

23. Pacific, Biosciences. Alzheimer’s disease capture panel data release; 2017. Available from: https://github.com/PacificBiosciences/DevNet/wiki/Alzheimer%E2%80%99s-Disease-Capture-Panel-Data-Release.Suche in Google Scholar

24. Pourmohammadi, R. SMRT link software installation and running analysis; 2019. Available from: https://www.linkedin.com/pulse/smrt-link-software-installation-running-analysis-reza-pourmohammadi.Suche in Google Scholar

25. Ono, Y, Asai, K, Hamada, M. Pbsim: pacbio reads simulator—toward accurate genome assembly. Bioinformatics 2012;29:119–21. https://doi.org/10.1093/bioinformatics/bts649.Suche in Google Scholar PubMed

26. Pacific Biosciences. Pbreports; 2018. Available from: https://github.com/PacificBiosciences/pbreports/blob/master/pbreports/util.py.Suche in Google Scholar

27. Rhoads, A, Au, KF. PacBio sequencing and its applications. Dev Reprod Biol 2015;13:278–89. https://doi.org/10.1016/j.gpb.2015.08.002.Suche in Google Scholar PubMed PubMed Central

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/ijb-2021-0091).

Received: 2021-08-23

Accepted: 2022-09-07

Published Online: 2023-05-08

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Supplementary Material Details

Artikel in diesem Heft

https://doi.org/10.1515/ijb-2021-0091

Schlagwörter für diesen Artikel

CCS reads accuracy; CCS reads quality; PacBio error model; sequencing noise

Error analysis of the PacBio sequencing CCS reads

Artikel

Abstract

References

Supplementary Material

Zusatzmaterial

Artikel in diesem Heft

Artikel in diesem Heft

Artikel in diesem Heft