Home Life Sciences Bayesian approach to discriminant problems for count data with application to multilocus short tandem repeat dataset
Article
Licensed
Unlicensed Requires Authentication

Bayesian approach to discriminant problems for count data with application to multilocus short tandem repeat dataset

  • Koji Tsukuda ORCID logo EMAIL logo , Shuhei Mano and Toshimichi Yamamoto
Published/Copyright: May 4, 2020

Abstract

Short Tandem Repeats (STRs) are a type of DNA polymorphism. This study considers discriminant analysis to determine the population of test individuals using an STR database containing the lengths of STRs observed at more than one locus. The discriminant method based on the Bayes factor is discussed and an improved method is proposed. The main issues are to develop a method that is relatively robust to sample size imbalance, identify a procedure to select loci, and treat the parameter in the prior distribution. A previous study achieved a classification accuracy of 0.748 for the g-mean (geometric mean of classification accuracies for two populations) and 0.867 for the AUC (area under the receiver operating characteristic curve). We improve the maximum values for the g-mean to 0.830 and the AUC to 0.935. Computer simulations indicate that the previous method is susceptible to sample size imbalance, whereas the proposed method is more robust while achieving almost identical classification accuracy. Furthermore, the results confirm that threshold adjustment is an effective countermeasure to sample size imbalance.


Corresponding author: Koji Tsukuda, Graduate School of Arts and Sciences, The University of Tokyo, 3-8-1 Komaba, Meguro-ku, 153-8902, Tokyo, Japan; and Faculty of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka-shi, 819-0395, Fukuoka, Japan, E-mail:

Funding source: Japan Society for the Promotion of Science

Award Identifier / Grant number: 16H02791

Award Identifier / Grant number: 17H04148

Award Identifier / Grant number: 18K13454

Acknowledgments

The authors would like to express their gratitude to Professor Naruya Saitou who provided part of the data used in this study, and to the referees who provided a lot of insightful comments. This work was partly supported by Japan Society for the Promotion of Science KAKENHI Grant Number 16H02791, 17H04148 and 18K13454. This study was partly carried out when the first author was a member of Biostatistics center, Kurume University.

Appendix A

Selecting prior distribution

We examine the following prior distributions:

  1. Bayes–Laplace uniform prior: γ=1;

  2. Jeffreys prior(Jeffreys 1946): γ=1/2;

  3. Perks prior(Perks 1947): γ=1/d;

  4. Minimax prior(Trybuła 1958): γ=n/d;

where d denotes the number of classes and n denotes the sample size. With respect to the Perks and minimax priors, we consider two methods to acquire the label of classes as is mentioned in Subsection 3.2. In the data analysis, Perks (C) and minimax (C) denote methods in which the class labels are acquired as the union of labels of each population; Perks (NC) and minimax (NC) denote methods in which the class labels are acquired by each population.

Figure 2 shows the AUC using the above prior distributions. Perks (C) is optimal with respect to PM1, and Jeffreys prior is optimal with respect to the other methods. In all other cases, the AUCs are not improved over the case in which the DP is estimated by maximizing the marginal likelihood. Overall, in our analysis, the results indicate that these prior distributions do not improve the accuracy.

Appendix B

Simulation results

In this section, we show the tables that represent the results of the simulation presented in Section 7.

Table 10:

Simulation results (L~=0). Numbers are the average true classification accuracies for population A (PopA) and population B (PopB) and average g-means (GM). Threshold value of Bayes factor is denoted by t.

parameters(r,γ,n2)methodt=1t=tpt=tg
PopAPopBGMPopAPopBGMPopAPopBGM
r=1,γ=5Y-C0.5140.5210.504
n2=100P1-C0.4920.5290.495
r=1,γ=5Y-C0.8850.1060.2450.7990.2020.3590.5110.4870.468
n2=25P1-C0.6560.3580.4590.4800.5200.4750.5030.5200.482
r=1,γ=1Y-C0.5080.4990.488
n2=100P1-C0.4970.4890.478
r=1,γ=1Y-C0.8600.1440.3100.7590.2470.4030.5010.4850.469
n2=25P1-C0.6530.3540.4580.4610.5150.4680.5040.4890.475
r=1,γ=1/2Y-C0.4810.5100.477
n2=100P1-C0.4940.5100.485
r=1,γ=1/2Y-C0.8340.1950.3500.7060.3090.4470.5310.4890.479
n2=25P1-C0.6590.3490.4560.4800.5380.4890.5080.5190.492
r=0,γ=5Y-C0.6550.6290.631
n2=100P1-C0.6530.6270.628
r=0,γ=5Y-C0.9230.1930.3720.8610.2890.4700.6130.5940.579
n2=25P1-C0.7360.4560.5610.5720.6350.5880.5910.6110.579
r=0,γ=1Y-C0.8920.8960.891
n2=100P1-C0.8920.8870.887
r=0,γ=1Y-C0.9730.5120.6930.9540.6140.7570.8390.7990.811
n2=25P1-C0.8970.7320.8040.8270.8220.8190.8460.8060.820
r=0,γ=1/2Y-C0.9700.9690.969
n2=100P1-C0.9720.9690.970
r=0,γ=1/2Y-C0.9880.7620.8640.9810.8250.8970.9390.9080.921
n2=25P1-C0.9660.8660.9120.9290.9200.9220.9260.9150.918
r=1/2,γ=5Y-C0.5490.5560.540
n2=100P1-C0.5580.5410.540
r=1/2,γ=5Y-C0.8940.1090.2480.8240.2090.3850.5490.5200.505
n2=25P1-C0.6800.3600.4780.5270.5470.5140.5270.5320.501
r=1/2,γ=1Y-C0.6990.6480.664
n2=100P1-C0.6890.6280.648
r=1/2,γ=1Y-C0.9180.2010.4000.8510.3160.5020.6300.5990.591
n2=25P1-C0.7400.4430.5560.5980.6100.5900.6010.6100.588
r=1/2,γ=1/2Y-C0.7790.7510.758
n2=100P1-C0.7870.7600.767
r=1/2,γ=1/2Y-C0.9200.3460.5350.8710.4730.6230.7190.6850.687
n2=25P1-C0.8070.5970.6790.6670.7260.6820.7090.6990.691
Table 11:

Simulation results (L~=7,17). γ is always 1. Optimization to determine tgdoes not converge in the search range for several cases.

parameters (r,n2,L~)methodt=1t=tpt=tg
PopAPopBGMPopAPopBGMPopAPopBGM
r=1,n2=100Y-C0.5510.4900.505
L~=7P1-C0.5430.4920.505
r=1,n2=25Y-C0.9780.0370.1010.9600.0580.1470.8710.1670.324
L~=7P1-C0.7920.2580.4200.7110.3270.4600.5620.4820.486
r=0,n2=100Y-C0.8090.8270.814
L~=7P1-C0.8130.8170.811
r=0,n2=25Y-C0.9930.1130.2770.9850.1490.3380.9420.3320.534
L~=7P1-C0.9000.4690.6350.8550.5770.6930.7550.7050.718
r=1,n2=100Y-C0.4910.5290.498
L~=17P1-C0.4900.5180.491
r=1,n2=25Y-C0.9950.0030.0090.9940.0030.0090.9920.0030.009
L~=17P1-C0.8550.1510.2930.8030.2060.3570.6300.3900.471
r=0,n2=100Y-C0.7610.7610.754
L~=17P1-C0.7540.7660.753
r=0,n2=25Y-C0.9980.0130.0410.9970.0170.0520.9970.0210.063
L~=17P1-C0.9300.2930.4880.9010.3590.5460.7800.5690.653

References

Andersen, M.M., Eriksen, P.S., and Morling, N. (2013). The discrete Laplace exponential family and estimation of Y-STR haplotype frequencies. J. Theoret. Biol. 329: 39–51, https://doi.org/10.1016/j.jtbi.2013.03.009.10.1016/j.jtbi.2013.03.009Search in Google Scholar

Balding, D.J. (1995). Estimating products in forensic identification using DNA profiles. J. Amer. Statist. Assoc. 90: 839–844, https://doi.org/10.1080/01621459.1995.10476582.10.1080/01621459.1995.10476582Search in Google Scholar

Chakraborty, R., Srinvasan, M.R., and Daiger, S.P. (1993). Evaluation of standard error and confidence interval of estimated multilocus genotype probabilities, and their implications in DNA forensics. Am. J. Hum. Genet. 52: 60–70.Search in Google Scholar

Chakraborty, R., Kimmel, M., Stivers, D.N., Davison, L.J., and Deka, R. (1997). Relative mutation rates at di-, tri-, tetranucleotide microsatellite loci. Proc. Natil. Acad. Sci. USA 94: 1041–1046, https://doi.org/10.1073/pnas.94.3.1041.10.1073/pnas.94.3.1041Search in Google Scholar

Chawla, N.V., Bowyer, K.W., Hall, L.O., and Kegelmeyer, W.P. (2002). SMOTE: synthetic minority over-sampling technique. J. Artificial Intelligence Res. 16: 321–357, https://doi.org/10.1613/jair.953.10.1613/jair.953Search in Google Scholar

Durrett, R (2008). Probability Models for DNA Sequence Evolution, 2nd ed. Springer, New York.10.1007/978-0-387-78168-6Search in Google Scholar

Edwards, A., Hammond, H.A., Jin, L., Caskey, T., and Chakraborty, R. (1992). Genetic variation at five trimeric and tetrameric tandem repeat loci in four human population groups. Genomics 12: 241–253, https://doi.org/10.1016/0888-7543(92)90371-x.10.1016/0888-7543(92)90371-XSearch in Google Scholar

Egeland, T. and Mostad, P.F. (2002). Statistical genetics and genetical statistics: a forensic perspective. Scand. J. Statist. 29: 297–307, https://doi.org/10.1111/1467-9469.00284.10.1111/1467-9469.00284Search in Google Scholar

Ewens, W.J. (2004). Mathematical Population Genetics. I. Theoretical introduction, 2nd ed. Springer-Verlag, New York.10.1007/978-0-387-21822-9Search in Google Scholar

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning. Data mining, inference, and prediction, 2nd ed. Springer, New York.10.1007/978-0-387-84858-7Search in Google Scholar

Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. London Ser. A 186: 453–461, https://doi.org/10.1098/rspa.1946.0056.10.1098/rspa.1946.0056Search in Google Scholar PubMed

Kass, R.E. and Raftery, A.E. (1995). Bayes Factors. J. Amer. Statist. Assoc. 90: 773–795, https://doi.org/10.1080/01621459.1995.10476572.10.1080/01621459.1995.10476572Search in Google Scholar

Keener, R., Rothman, E., and Starr, N. (1987). Distributions on partitions. Ann. Statist. 15: 1466–1481, https://doi.org/10.1214/aos/1176350604.10.1214/aos/1176350604Search in Google Scholar

Kononenko, I. and Bratko, I. (1991). Information-based evaluation criterion for classifier's performance. Mach. Learn. 6: 67–80, https://doi.org/10.1007/bf00153760.10.1007/BF00153760Search in Google Scholar

Kubat, M., Holte, R.C., and Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30: 195–215, https://doi.org/10.1023/A/:1007452223027.10.1023/A:1007452223027Search in Google Scholar

Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning. Nashville, Tennesse, Morgan Kaufmann, pp. 179–186.Search in Google Scholar

Lange, K. (1995). Applications of the Dirichlet distribution to forensic match probabilities. Genetica 96: 107–117, https://doi.org/10.1007/bf01441156.10.1007/978-0-306-46851-3_12Search in Google Scholar

Lewis, D.D. and Gale, W.A. (1994). A sequential algorithm for training text classifiers. In: Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag, pp. 3–12.10.1007/978-1-4471-2099-5_1Search in Google Scholar

Matthews, B.W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405: 442–451, https://doi.org/10.1016/0005-2795(75)90109-9.10.1016/0005-2795(75)90109-9Search in Google Scholar

Perks, W. (1947). Some observations on inverse probability including a new indifference rule. J. Inst. Actuar. 73: 285–312; discussion 313–334, https://doi.org/10.1017/s0020268100012270.10.1017/S0020268100012270Search in Google Scholar

Pritchard, J.K., Seielstad M.T., Perez-Lezaun, A., and Feldman, M.W. (1999). Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16: 1791–1798, https://doi.org/10.1093/oxfordjournals.molbev.a026091.10.1093/oxfordjournals.molbev.a026091Search in Google Scholar PubMed

Raymond, M. and Rousset, F. (1994). An exact test for population differentiation. Evolution 49: 1280–1283, https://doi.org/10.1111/j.1558-5646.1995.tb04456.x.10.1111/j.1558-5646.1995.tb04456.xSearch in Google Scholar PubMed

Riebler, A., Held, L., and Stephen, W. (2008). Bayesian variable selection for detecting adaptive genomic differences among populations. Genetics 178: 1817–1829, https://doi.org/10.1534/genetics.107.081281.10.1534/genetics.107.081281Search in Google Scholar PubMed PubMed Central

Trybuła, S. (1958). Some problems of simultaneous minimax estimation. Ann. Math. Statist. 29: 245–253, https://doi.org/10.1214/aoms/1177706722.10.1214/aoms/1177706722Search in Google Scholar

Wilson, I.J., Weale, M.E., and Balding, D.J. (2003). Inferences from DNA data: population histories, evolutionary processes and forensic match probabilities. J. Roy. Statist. Soc. Ser. A 166: 155–201, https://doi.org/10.1111/1467-985x.00264.10.1111/1467-985X.00264Search in Google Scholar

Yamamoto, T., Hiroshige, Y., Ogawa, H., Yoshimoto, T., and Mano, S. (2015). Potential statistical differentiation between Japanese and Korean populations with about 100 STRs. Forensic Sci. Int. Genet. Suppl. Series 5: e348–e349, https://doi.org/10.1016/j.fsigss.2015.09.138.10.1016/j.fsigss.2015.09.138Search in Google Scholar

Zhao, H. (2008). Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl. Inf. Syst. 15: 321–334, https://doi.org/10.1007/s10115-007-0079-1.10.1007/s10115-007-0079-1Search in Google Scholar

Received: 2018-08-03
Accepted: 2020-03-23
Published Online: 2020-05-04

© 2020 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 6.2.2026 from https://www.degruyterbrill.com/document/doi/10.1515/sagmb-2018-0044/html
Scroll to top button