Abstract
Schwarz’s criterion, also known as the Bayesian Information Criterion or BIC, is commonly used for model selection in logistic regression due to its simple intuitive formula. For tests of nested hypotheses in independent and identically distributed data as well as in Normal linear regression, previous results have motivated use of Schwarz’s criterion by its consistent approximation to the Bayes factor (BF), defined as the ratio of posterior to prior model odds. Furthermore, under construction of an intuitive unit-information prior for the parameters of interest to test for inclusion in the nested models, previous results have shown that Schwarz’s criterion approximates the BF to higher order in the neighborhood of the simpler nested model. This paper extends these results to univariate and multivariate logistic regression, providing approximations to the BF for arbitrary prior distributions and definitions of the unit-information prior corresponding to Schwarz’s approximation. Simulations show accuracies of the approximations for small samples sizes as well as comparisons to conclusions from frequentist testing. We present an application in prostate cancer, the motivating setting for our work, which illustrates the approximation for large data sets in a practical example.
-
Author contribution: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
-
Research funding: None declared.
-
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
Appendix
7.1 Proof of Proposition 1
Under Assumption 1 we can apply the Laplace approximation (3) to the numerator and denominator of (4) [1], [21]. Recall that m 0 = dim(Θ) and m = dim(Θ × Ψ). With
and
we obtain the approximation
Applying
Thus a first approximation to the BF is
where
7.2 Proof of Proposition 2
We follow and extend the argumentation by [66] and [67]. With (7) we can rewrite the log-likelihood as follows
For the first and second order partial derivatives with respect to ξ and ψ, we consider their single entries. Then,
using partial derivatives of functions and the chain rule. Thus, first derivative with respect to ψ is a m − m 0 dimensional vector.
The mixed second order partial derivatives are
using partial derivatives of functions, the chain rule, and product derivative rules.
For the expectation of the last part of the sum in (13) it holds that
as Φ (ξ, ψ) is constant with respect to X. Further,
where we could exchange the order of differentiation and integration since for each
where we denote by I* the expected Fisher information matrix for
We are interested in whether
where Id m0 is the m 0 × m 0 identity matrix. This implies
Similarly, by taking the partial derivative with respect to ψ of (7)
so that (14) evaluated at ψ = ψ 0 yields
Next we note that
and this implies
With (12) it follows that
using partial derivatives of functions, the chain rule and product derivatives rules. With (15) and (17) we obtain
where e k , e j denote the unit vectors with one at entry k, j, respectively, and zero otherwise. It follows
Solving (16) for
Thus, θ and ψ are null orthogonal. □
7.3 Proof of Proposition 3
Substituting (5) into Proposition 1 yields
For the remaining approximation we separately consider the components of (18). First, we show that
For the first part, we show that
We expand each component of
With Definition S1 (Supplementary Material) and Assumption 2.4 we note that
We divide (19) by n and with
where we use Assumption 2.4 in the last step. Recall that D
2
with
and it follows d
kj
= O
p
(n). We apply this result to the entries
Thus, we obtain
and it follows that
Next, we show that
With
We note that with the additive properties of O p it follows that
Thus, with Proposition S1 we obtain
Using
and (22), we approximate the diagonal elements of
We obtain for any entry
We note that the indices k and j only index the part of
I (θ, ψ) is the expected Fisher information matrix for a single observation X and thus
where we assume finite second moments for the distribution of data X. We obtain with I (θ, ψ) = O p (1)
and it follows
Analogously to above, we can show
and
which yields
and thus
With Assumption 2.3 and (24) we obtain
Under Assumption 2.1 and with the determinant product rule it follows
With the determinant product rule, (25), and (26) this yields
We obtain with Hadamard’s inequality and the multiplicative and additive properties of O p [69]
Then
and analogously
Substituting this into (27) yields with Proposition S1
and using
It is left to show that
Thus, with Assumption 2.2 it follows
Substituting (29) and (30) into (18) yields, with the additive and multiplicative properties of O p , the following approximation for the BF
7.4 Proof of Corollary 1
We take the logarithm on both sides of the result from Proposition 3 and obtain
where we use
Using (23) we obtain
With
7.5 Proof of Theorem 1
The multivariate normal density of ψ|θ at
where we use (23) and
since
7.6 Proof of Proposition 3 for the multivariate logistic regression model
As stated in the main text, we have to proof the remaining statements as given in the following lemma.
Lemma 1
Under the assumptions of
Proposition 3
applied to the multivariate logistic regression model, we have
Proof
The first two expressions follow from the assumptions of Laplace regularity for ℓ and ℓ
0. This implies that any partial derivative up to the 6th derivative of
Using a Taylor expansion for
Thus,
References
1. Kass, RE, Vaidyanathan, SK. Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. J R Stat Soc B 1992;54:129–44. https://doi.org/10.1111/j.2517-6161.1992.tb01868.x.Search in Google Scholar
2. Pauler, DK. The Schwarz criterion and related methods for normal linear models. Biometrika 1998;85:13–27, https://doi.org/10.1093/biomet/85.1.13.Search in Google Scholar
3. Pauler, DK, Wakefield, JC, Kass, RE. Bayes factors and approximations for variance component models. J Am Stat Assoc 1999;94:1242–53, https://doi.org/10.1080/01621459.1999.10473877.Search in Google Scholar
4. Raftery, AE. Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika 1996;83:251–66, https://doi.org/10.1093/biomet/83.2.251.Search in Google Scholar
5. Volinsky, CT, Raftery, AE. Bayesian information criterion for censored survival models. Biometrics 2000;56:256–62, https://doi.org/10.1111/j.0006-341x.2000.00256.x.Search in Google Scholar PubMed
6. Venables, WN, Ripley, BD. Modern applied statistics with S, 4th ed. New York, NY: Springer; 2010.Search in Google Scholar
7. Kass, RE, Raftery, AE. Bayes factors. J Am Stat Assoc 1995;90:773–95, https://doi.org/10.1080/01621459.1995.10476572.Search in Google Scholar
8. Kass, RE, Wasserman, L. A reference Bayesian test for nested hypotheses and its relationship to the schwarz criterion. J Am Stat Assoc 1995;90:928–34, https://doi.org/10.1080/01621459.1995.10476592.Search in Google Scholar
9. Raftery, AE. Bayesian model selection in social research. Socio Methodol 1995;25:111–63, https://doi.org/10.2307/271063.Search in Google Scholar
10. Cavanaugh, J, Neath, A. Generalizing the derivation of the schwarz information criterion. Commun Stat Theor Methods 1999;28:49–66, https://doi.org/10.1080/03610929908832282.Search in Google Scholar
11. Amin, A. Pitfalls of diagnosis of extraprostatic extension in prostate adenocarcinoma. Ann Clin Pathol 2016;4:1086.Search in Google Scholar
12. Fischer, S, Lin, D, Simon, RM, Howard, LE, Aronson, WJ, Terris, MK, et al. Do all men with pathological gleason score 8-10 prostate cancer have poor outcomes? results from the search database. BJU Int 2016;118:250–7, https://doi.org/10.1111/bju.13319.Search in Google Scholar PubMed
13. Datta, K, Muders, M, Zhang, H, Tindall, DJ. Mechanism of lymph node metastasis in prostate cancer. Future Oncol 2010;6:823–36, https://doi.org/10.2217/fon.10.33.Search in Google Scholar PubMed PubMed Central
14. Mydlo, JH, Godec, CJ, editors. Prostate cancer: science and clinical practice, 2nd ed. London: Elsevier; 2016.Search in Google Scholar
15. Epstein, JI, Feng, Z, Trock, BJ, Pierorazio, PM. Upgrading and downgrading of prostate cancer from biopsy to radical prostatectomy: incidence and predictive factors using the modified gleason grading system and factoring in tertiary grades. Eur Urol 2012;61:1019–24, https://doi.org/10.1016/j.eururo.2012.01.050.Search in Google Scholar PubMed PubMed Central
16. Selig, K. Bayesian information criterion approximations for model selection in multivariate logistic regression with application to electronic medical records, Dissertation. München: Technische Universität München; 2020.10.1515/ijb-2020-0045Search in Google Scholar PubMed
17. D’Amico, AV, Chen, M-H, Roehl, KA, Catalona, WJ. Preoperative PSA velocity and the risk of death from prostate cancer after radical prostatectomy. N Engl J Med 2004;351:125–35.10.1056/NEJMoa032975Search in Google Scholar PubMed
18. O’Brien, MF, Cronin, AM, Fearn, PA, Smith, B, Stasi, J, Guillonneau, B, et al. Pretreatment prostate-specific antigen (PSA) velocity and doubling time are associated with outcome but neither improves prediction of outcome beyond pretreatment PSA alone in patients treated with radical prostatectomy. J Clin Oncol 2009;27:3591–7, https://doi.org/10.1200/jco.2008.19.9794.Search in Google Scholar
19. Collett, D. Modelling binary data, 2nd ed. Boca Raton, FL: Chapman and Hall/CRC; 2003. Available from: http://www.loc.gov/catdir/enhancements/fy0646/2002073648-d.html.Search in Google Scholar
20. McCullagh, P, Nelder, JA. Generalized linear models, monographs on statistics and applied probability, 2nd ed. London: Chapman & Hall; 1999.Search in Google Scholar
21. Kass, RE, Tierney, L, Kadane, JB. The validity of posterior expansions based on laplace’s method. In: Geisser, S, Hodges, JS, Press, SJ, Zellner, A., editors. Essays in honor of George Bernard. Amsterdam: North-Holland; 1990. pp. 473–88.Search in Google Scholar
22. Zehna, PW. Invariance of maximum likelihood estimators. Ann Math Stat 1966;37:744, https://doi.org/10.1214/aoms/1177699475.Search in Google Scholar
23. Wasserman, L. All of statistics: a concise course in statistical inference, 2nd ed.New York, NY: Springer; 2005.10.1007/978-0-387-21736-9Search in Google Scholar
24. Schwarz, G. Estimating the dimension of a model. Ann Stat 1978;6:461–4, https://doi.org/10.1214/aos/1176344136.Search in Google Scholar
25. Kass, RE, Wasserman, L. The selection of prior distributions by formal rules. J Am Stat Assoc 1996;91:1343–70, https://doi.org/10.1080/01621459.1996.10477003.Search in Google Scholar
26. Raftery, AE. Bayes factors and BIC. Socio Methods Res 1999;27:411–27. https://doi.org/10.1177/0049124199027003005.Search in Google Scholar
27. Jeffreys, H. Theory of probability, 3rd ed. Oxford: Clarendon Press; 1998.10.1093/oso/9780198503682.001.0001Search in Google Scholar
28. Neath, AA, Cavanaugh, JE. The Bayesian information criterion: background, derivation, and applications. WIREs Comput Stat 2012;4:199–203. https://doi.org/10.1002/wics.199.Search in Google Scholar
29. R Core Team. R: a language and environment for statistical computing; 2019. Available from: https://www.R-project.org/.Search in Google Scholar
30. Albert, A, Anderson, JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984;71:1–10, https://doi.org/10.1093/biomet/71.1.1.Search in Google Scholar
31. Santner, TJ, Duffy, DE. A note on A. Albert and J. A. Anderson’s conditions for the existence of maximum likelihood estimates in logistic regression models. Biometrika 1986;73:755–8, https://doi.org/10.1093/biomet/73.3.755.Search in Google Scholar
32. O’Brien, SM, Dunson, DB. Bayesian multivariate logistic regression. Biometrics 2004;60:739–46.10.1111/j.0006-341X.2004.00224.xSearch in Google Scholar PubMed
33. Albert, JH, Chib, S. Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 1993;88:669–79, https://doi.org/10.1080/01621459.1993.10476321.Search in Google Scholar
34. Nishimoto, K, Nakashima, J, Hashiguchi, A, Kikuchi, E, Miyajima, A, Nakagawa, K, et al. Prediction of extraprostatic extension by prostate specific antigen velocity, endorectal mri, and biopsy gleason score in clinically localized prostate cancer. Int J Urol 2008;15:520–3, https://doi.org/10.1111/j.1442-2042.2008.02042.x.Search in Google Scholar PubMed
35. Chen, M-H, Ibrahim, JG, Yiannoutsos, C. Prior elicitation, variable selection and Bayesian computation for logistic regression models. J Roy Stat Soc B 1999;61:223–42, https://doi.org/10.1111/1467-9868.00173.Search in Google Scholar
36. Elfadaly, FG, Garthwaite, PH. On quantifying expert opinion about multinomial models that contain covariates. J R Stat Soc 2020;20:845.10.1111/rssa.12546Search in Google Scholar
37. Strobl, AN, Vickers, AJ, van Calster, B, Steyerberg, E, Leach, RJ, Thompson, IM, et al. Improving patient prostate cancer risk assessment: moving from static, globally-applied to dynamic, practice-specific risk calculators. J Biomed Inf 2015;56:87–93, https://doi.org/10.1016/j.jbi.2015.05.001.Search in Google Scholar PubMed PubMed Central
38. Barber, RF, Drton, M. High-dimensional using model selection with Bayesian information criteria. Electron J Stat 2015;9:567–607, https://doi.org/10.1214/15-ejs1012.Search in Google Scholar
39. Chen, J, Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008;95:759–71, https://doi.org/10.1093/biomet/asn034.Search in Google Scholar
40. Chen, J, Chen, Z. Extended BIC for small-n-large-p sparse GLM. Stat Sin 2012;22. https://doi.org/10.5705/ss.2010.216.Search in Google Scholar
41. Drton, M, Plummer, M. A Bayesian information criterion for singular models. J R Stat Soc B 2017;79:323–80, https://doi.org/10.1111/rssb.12187.Search in Google Scholar
42. Foygel, R, Drton, M. Extended Bayesian information criteria for Gaussian graphical models. In: Lafferty, JD, Williams, CKI, Shawe-Taylor, J, Zemel, RS, Culotta, A, editors. Advances in neural information processing systems. Curran Associates, Inc.; 2010, vol. 23. pp. 604–12.Search in Google Scholar
43. Jones, RH. Bayesian information criterion for longitudinal and clustered data. Stat Med 2011;30:3050–6, https://doi.org/10.1002/sim.4323.Search in Google Scholar PubMed
44. Kawano, S. Selection of tuning parameters in bridge regression models via Bayesian information criterion. Stat Pap 2014;55:1207–23, https://doi.org/10.1007/s00362-013-0561-7.Search in Google Scholar
45. Konishi, S, Ando, T, Imoto, S. Bayesian information criteria and smoothing parameter selection in radial basis function networks. Biometrika 2004;91:27–43, https://doi.org/10.1093/biomet/91.1.27.Search in Google Scholar
46. Lee, ER, Noh, H, Park, BU. Model selection via Bayesian information criterion for quantile regression models. J Am Stat Assoc 2014;109:216–29, https://doi.org/10.1080/01621459.2013.836975.Search in Google Scholar
47. Luo, S, Xu, J, Chen, Z. Extended Bayesian information criterion in the cox model with a high-dimensional feature space. Ann Inst Stat Math 2015;67:287–311, https://doi.org/10.1007/s10463-014-0448-y.Search in Google Scholar
48. Mehrjou, A, Hosseini, R, Nadjar Araabi, B. Improved Bayesian information criterion for mixture model selection. Pattern Recogn Lett 2016;69:22–7, https://doi.org/10.1016/j.patrec.2015.10.004.Search in Google Scholar
49. Watanabe, S. A widely applicable bayesian information criterion. J Mach Learn Res 2013;14:867–97.Search in Google Scholar
50. Żak-Szatkowska, M, Bogdan, M. Modified versions of the Bayesian information criterion for sparse generalized linear models. Comput Stat Data Anal 2011;55:2908–24.10.1016/j.csda.2011.04.016Search in Google Scholar
51. Ashford, JR, Sowden, RR. Multi-variate probit analysis. Biometrics 1970;26:535, https://doi.org/10.2307/2529107.Search in Google Scholar
52. Bahadur, RR. A representation of the joint distribution of responses to n dichotomous items. In: Solomon, H, editor. Studies in item analysis and prediction. Stanford, California: Stanford University Press; 1961. pp. 158–68.Search in Google Scholar
53. Bel, K, Fok, D, Paap, R. Parameter estimation in multivariate logit models with many binary choices. Econ Rev 2016;37:534–50, https://doi.org/10.1080/07474938.2015.1093780.Search in Google Scholar
54. Bergsma, WP. Marginal models for categorical data, Dissertation. Tilburg: Tilburg University; 1997.Search in Google Scholar
55. Bergsma, WP, Rudas, T. Marginal models for categorical data. Ann Stat 2002;30:140–59.10.1214/aos/1015362188Search in Google Scholar
56. Bonney, GE. Logistic regression for dependent binary observations. Biometrics 1987;43:951–73, https://doi.org/10.2307/2531548.Search in Google Scholar
57. Chib, S, Greenberg, E. Analysis of multivariate probit models. Biometrika 1998;85:347–61, https://doi.org/10.1093/biomet/85.2.347.Search in Google Scholar
58. Cox, DR. The analysis of multivariate binary data. J R Stat Soc: Ser C (Appl Stat) 1972;21:113–20, https://doi.org/10.2307/2346482.Search in Google Scholar
59. Dai, B. Multivariate Bernoulli distribution models. Dissertation. Madison, Wisconsin: University of Wisconsin; 2012.Search in Google Scholar
60. Dai, B, Ding, S, Wahba, G. Multivariate Bernoulli distribution. Bernoulli 2013;19:1465–83, https://doi.org/10.3150/12-bejsp10.Search in Google Scholar
61. Ekholm, A, Smith, PWF, McDonald, JW. Marginal regression analysis of a multivariate binary response. Biometrika 1995;82:847–54, https://doi.org/10.1093/biomet/82.4.847.Search in Google Scholar
62. Fitzmaurice, GM, Laird, NM, Rotnitzky, AG. Regression models for discrete longitudinal responses. Stat Sci 1993;8:284–99, https://doi.org/10.1214/ss/1177010899.Search in Google Scholar
63. Glonek, G, McCullagh, P. Multivariate logistic models. J R Stat Soc B 1995;57:533–46, https://doi.org/10.1111/j.2517-6161.1995.tb02046.x.Search in Google Scholar
64. Joe, H, Liu, Y. A model for a multivariate binary response with covariates based on compatible conditionally specified logistic regressions. Stat Prob Lett 1996;31:113–20, https://doi.org/10.1016/s0167-7152(96)00021-1.Search in Google Scholar
65. Russell, GJ, Petersen, A. Analysis of cross category dependence in market basket selection. J Retail 2000;76:367–92, https://doi.org/10.1016/s0022-4359(00)00030-0.Search in Google Scholar
66. Cox, DR, Reid, N. Parameter orthogonality and approximate conditional inference. J R Stat Soc B 1987;49:1–39, https://doi.org/10.1111/j.2517-6161.1987.tb01422.x.Search in Google Scholar
67. Huzurbazar, VS, Jeffreys, H. Probability distributions and orthogonal parameters. Math Proc Camb Philos Soc 46;1950:281–4, https://doi.org/10.1017/s0305004100025743.Search in Google Scholar
68. Königsberger, K. Analysis 2, 4th ed. Berlin and Heidelberg: Springer; 2002.10.1007/978-3-662-05699-8Search in Google Scholar
69. Horn, RA, Johnson, CR. Matrix analysis, 2nd ed. New York, NY: Cambridge University Press; 2012.10.1017/CBO9781139020411Search in Google Scholar
Supplementary Material
The online version of this article offers supplementary material (https://doi.org/10.1515/ijb-2020-0045).
© 2020 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Research Articles
- Targeted design for adaptive clinical trials via semiparametric model
- Causal mediation analysis in presence of multiple mediators uncausally related
- Summarizing causal differences in survival curves in the presence of unmeasured confounding
- Bayesian information criterion approximations to Bayes factors for univariate and multivariate logistic regression models
- Marginal quantile regression for longitudinal data analysis in the presence of time-dependent covariates
- Parametric models for combined failure time data from an incident cohort study and a prevalent cohort study with follow-up
- Effects of covariates on alternating recurrent events in accelerated failure time models
- Gradient boosting for linear mixed models
- A kernel- and optimal transport- based test of independence between covariates and right-censored lifetimes
Articles in the same Issue
- Frontmatter
- Research Articles
- Targeted design for adaptive clinical trials via semiparametric model
- Causal mediation analysis in presence of multiple mediators uncausally related
- Summarizing causal differences in survival curves in the presence of unmeasured confounding
- Bayesian information criterion approximations to Bayes factors for univariate and multivariate logistic regression models
- Marginal quantile regression for longitudinal data analysis in the presence of time-dependent covariates
- Parametric models for combined failure time data from an incident cohort study and a prevalent cohort study with follow-up
- Effects of covariates on alternating recurrent events in accelerated failure time models
- Gradient boosting for linear mixed models
- A kernel- and optimal transport- based test of independence between covariates and right-censored lifetimes