Abstract
We propose a full Bayesian subset selection method for reproductive dispersion linear models, which bases on expanding the usual link function to a function that incorporates all possible subsets of predictors by adding indictors as parameters. The vector of indicator variables dictates which predictors to delete. An efficient MCMC procedure that combining Gibbs sampler and Metropolis-Hastings algorithm is presented to approximate the posterior distribution of the indicator variables. The promising subsets of predictors can be identified as those with higher posterior probability. Several numerical examples are used to illustrate the newly developed methodology.
1 Introduction
Reproductive dispersion distribution family proposed by Jorgensen[1], which is a general distribution family and includes normal distribution, Gumbel distribution and exponential family distribution as its special case, has been received a lot of attention in recent years. For example, Tang et al.[2] proposed a nonlinear reproductive dispersion model and studied its influence diagnostics; Chen et al.[3] discussed influence diagnostics of semiparametric nonlinear reproductive dispersion models; Fu et al.[4] investigated local influence analysis for nonlinear structural equation models with non-ignorable missing outcomes from a reproductive dispersion model. In particular, Chen and Tang[5] developed a Bayesian approach to analyze semiparametric reproductive dispersion mixed-effects models on the basis of P-spline estimates of nonparametric components by using the Gibbs sampler[6] and the Metropolis-Hastings algorithm[7–8]. Recently, Tang and Zhao[9] presented semiparametric Bayesian analysis of nonlinear reproductive dispersion mixed models for longitudinal data.
Various Bayesian methods have been proposed for selecting suitable subset models, those classical methods include Bayeisan information criterion[10], asymptotic information criterion[11], Bayes factor[12] and pseudo-Bayes factor[13], etc. In particular, Mitchell and Beauchamp[14] presented a Bayesian variable selection method assuming the prior distribution of each regression coefficient is a mixture of a point mass at 0 and a diffuse uniform distribution elsewhere. Furthermore, George and McCulloch[15] developed a variable selection procedure via Gibbs sampling. The reversible jump MCMC[16] was proposed for investigating model uncertainty. Nott and Leonte[17] proposed sampling schemes for Bayesian selection in generalized linear models based on the Swendsen-Wang algorithm for the Ising model[18]. Hans[19] addressed model uncertainty for the Bayesian lasso regression by computing the marginal posterior probabilities for small model space. Recently, Lykou and Ntzoufras[20] studied how to select variables using Bayesian lasso. However, to our knowledge, there is little work done on Bayesian subset selection for reproductive dispersion distribution family. Motivated by the idea of Kuo and Mallick[21], the main purpose of this paper is to develop a novel Bayesian approach to select promising subsets under reproductive dispersion linear models by using MCMC technology.
This article is structured as follows. A reproductive dispersion linear model which includes model uncertainty is defined in Section 2. In Section 3, we explicitly describe a MCMC procedure that can identify the promising subsets of predictors under reproductive dispersion linear models. Several numerical examples are used to illustrate our proposed methodology in Section 4. A discussion is given in the final section. Technical details are given in the Appendix.
2 Model and notation
Suppose yi (i = 1, 2, ⋯, n) is the observation for the ith individual, we assume that y1, y2, ⋯, yn are independently distributed and has the following probability density function
where μi is the location parameter and may represent the mean of yi, σ2 ≜ τ−1 ∈ 𝔹(𝔹⊂ℝ+) is referred to as the dispersion parameter which is known or can be estimated separately; a(⋅;⋅) is some suitable known function; d(y; μ) is a unit deviance on ℂ×Ω (Ω ⊆ ℂ ⊆ ℝ are open intervals) and satisfies d(y; y) = 0, ∀y ∈ ℂ and d(y; μ) > 0, ∀ y ≠ μ, and is twice continuously differentiable with respect to (y, μ) on ℂ × Ω. The model (2.1) is referred to as a reproductive dispersion distributional family[1] and includes normal distribution, double exponential distribution, Gumbel distribution and exponential family distribution as its special case. Motivated by the idea of Kuo and Mallick[21], we explore a simple method of subset selection. Instead of building a hierarchical model, we embed indicator variable in predictors that incorporates all 2p submodels. Specifically, let γj (j = 1, 2, ⋯, p) be the indicator variable supported at two points 1 and 0. When γj = 1 we include the predictor in the predictor model. When γj = 0, we omit the predictor when building the model. Therefore we assume that the systematic part of the model is specified by
where g(⋅) is a known continuously differentiable link function. For simplicity, we only consider the case that g(⋅) is the canonical link in this article, i.e., g(μi) = μi, then (2.2) reduces to
The model defined by equations (2.1) and (2.3) is referred to as a reproductive dispersion linear model which includes model uncertainty.
Let β = (β1, β2, ⋯, βp)T, γ = (γ1, γ2, ⋯, γp)T, θ = (βT, τ)T and xi =(xi1, xi2, ⋯, xip)T. We assume that the prior of θ is independent on γ and γj is mutually independent for j = 1, 2, ⋯, p. Under the above assumption, the prior density p(θ, γ) of (θ, γ) can be expressed as
Based on the recommendations given by Chen and Tang[5] and Kuo and Mallick[21], we consider the following prior distributions for parameters τ, β and γj(j = 1, 2, ⋯, p):
where a1, a2, β0, H0 and pj are hyperparameters whose values are assumed to be given, Γ (a, b) denotes the gamma distribution with parameters a and b, N(β, Σ) denotes the multivariate normal distribution with mean β and covariance Σ, B(1, q) is the Bernoulli distribution with parameter q. In this paper, we set a1 = a2 = 0.01, β0 = 0, H0 = 100Ip and pj = 0.5 in all numerical examples which may present noninformation priors, where Ip is the p × p identity matrix.
3 Bayesian subset selection
In this section, we propose and develop a MCMC procedure that can identify the promising subsets of predictors under reproductive dispersion linear models. Denote y = {y1, y2, ⋯, yn}, x = {x1, x2, ⋯, xn}. The Bayesian inference on θ and γ can be via the posterior distribution log p(θ, γ|y, x)∝ log p(y, x|θ,γ) + log p(θ, γ), where p(y, x|θ, γ) is the likelihood function of the observed data {y, x}. The likelihood p(y, x|θ, γ) involves intractable multiple high dimension integrals. Thus, it is rather difficult and tedious to work directly with this posterior density. Inspired by recent development in statistical computing, the Gibbs sampler[6] is used to generate a sequence of random observations from the joint posterior distribution p(θ, γ|y, x), which is proportional to
In this algorithm, observations {θ, γ} are iteratively simulated from the following conditional distributions: p(θ|y, x, γ) and p(γ|y, x, θ). Then we identify the promising subset of predictors with high posterior probabilities of γ.
3.1 Conditional distributions
In this subsection, we briefly present the conditional distributions required in implementing the Gibbs sampler.
For the conditional distribution of γ, we have
For the conditional distribution p(τ|y, x, θ), it follows from equations (3.1) and (2.4) that
where a(y, τ) =
where
For the conditional distribution p(β|y, x, τ, γ), it follows from equations (3.1) and (2.4) that
3.2 Implementation
Now, we describe how to update the parameter γ. We update the variate γj (j = 1, 2, ⋯, p) given y, x, γ−j, β, τ by a Bernoulli distribution
and
in which the vector γ*(γ**) is obtained from γ with the jth element replaced by 1(0). Clearly, sampling observations from conditional distributions given in (3.4) are straightforward and fast because the conditional distribution is a familiar standard distribution. However, it is rather difficult to sample observations from the conditional distributions given in (3.3) and (3.5) because these conditional distributions are unfamiliar standard and complex distributions. Therefore, the Metropolized independence sampler (MIS) and Metropolis-Hastings (MH) algorithm are employed to solve these difficulties. The technical details are given in the Appendix.
4 Numerical examples
In this section, several numerical examples are used to illustrate the preceding proposed method.
4.1 Simulation studies
In this subsection, numerical results with responses coming from the double exponential distribution and Gumbel distribution are used to investigate the performance of the proposed methodology.
We consider the first situation, the response yi (i = 1, 2, ⋯, n) is distributed as the double exponential distribution with probability density function
and the location parameter μi is given by
In other words, we consider this variable selection problem with p = 5. It is assumed that xi = (xi1, xi2, xi3, xi4, xi5)T is independently generated from the standard normal distribution N(0, I5). Furthermore, we assume that the true value of τ = 1.0 and ϑ = (ϑ1, ϑ2, ϑ3, ϑ4, ϑ5)T = (β1γ1, β2γ2, β3γ3, β4γ4, β5γ5)T = (0, 0, 0, 1, 1.2)T, i.e., (4.2) reduces to
then the above specified model (4.1) and (4.3) is a special case of reproductive dispersion linear models which includes model uncertainty. In all simulation studies, we consider that the sample size n equals 100. In the first simulation study, a data set is generated from (4.1) and (4.3), then we apply the preceding proposed method to the data set. Some results based on a sample of 15000 Gibbs samplers are displayed the frequencies of the four highest frequency models in Table 1. From Table 1 we can easily see that our analysis predicts the right model as expected. From the Gibbs samplers, we can also calculate the posterior mean and standard deviation of ϑ. The posterior means are 1.1127 and 1.2727 for ϑ4 and ϑ5 with standard deviations 0.1212 and 0.1215, each standard deviation incorporates our measure of model uncertainty.
High frequency models in the first simulation study
| Model variables | Proportion |
|---|---|
| x4 x5 | 88.29% |
| x3 x4 x5 | 4.11% |
| x1 x4 x5 | 3.80% |
| x2 x4 x5 | 3.11% |
The second simulation is identical to the first simulation study except that xi3(i = 1, 2, ⋯, n) is replaced by
High frequency models in the second simulation study
| Model variables | Proportion |
|---|---|
| x4 x5 | 33.43% |
| x3 x4 | 57.54% |
| x3 x4 x5 | 4.65% |
| x1 x4 x5 | 3.22% |
To demonstrate the practical potential of our procedure for data sets with a relatively large number of covariates, we conduct the following third simulation study. The responses are coming from the double exponential distribution same as the first simulation study, but the location parameter μi is given as follows:
The covarites xj(j = 1, 2, ⋯, 15) are obtained as
High frequency models for false choices in the third simulation study
| Relative frequency | False choice |
|---|---|
| 82.45% | None |
| 6.87% | x1 |
| 4.12% | x5 |
| 2.09% | x3 |
To investigate the performance under various response distributions, we consider another situation and conduct the following simulation studies. In the fourth simulation study, we assume that response yi (i = 1, 2, ⋯, n) is distributed as the Gumbel distribution, which has the following probability density function
and belongs to reproductive dispersion family[1]. The fourth simulation study is conducted by using other settings same as given in the first simulation study. The corresponding results are reported in Table 4. Furthermore, we do the fifth (sixth) simulation study which is identical to the second (third) simulation study except data set is generated by equation (4.5). Summaries of these results are presented in Tables 5 and 6. All these results show that the same conclusion as given in the first situation are drawn, which indicates that the response has little effect on our procedure.
High frequency models in the fourth simulation study
| Model variables | Proportion |
|---|---|
| x4 x5 | 96.45% |
| x3 x4 x5 | 1.13% |
| x1 x4 x5 | 1.08% |
| x2 x4 x5 | 0.88% |
High frequency models in the fifth simulation study
| Model variables | Proportion |
|---|---|
| x4 x5 | 41.12% |
| x3 x4 | 26.35% |
| x2 x3 x4 | 10.13% |
| x2 x4 x5 | 9.75% |
High frequency models for false choices in the sixth simulation study
| Relative frequency | False choice |
|---|---|
| 80.05% | None |
| 8.40% | x1 |
| 5.56% | x5 |
| 4.15% | x4 |
4.2 Real examples
In this subsection, two real data examples are used to illustrate our preceding proposed methodology.
4.2.1 Hald data
Our first real data example is the fimiliar Hald data, which has been reported by Draper and Smith[22] and been used by various authors to illustrate variable selection procedures. The data consist of n = 13 responses and p = 4 independent variables x1, x2, x3, x4. Therefore, 24 = 16 possible subset models are under consideration. Equation (4.1) and
are applied to fit the data. 10000 Gibbs samplers are simulated and higher frequency models are tabulated in Table 7. From Table 7 we can easily see that the results capture the μi = f(xi1, xi2, xi4) model, which is quite similar to the results of those forward selection in Draper and Smith[22].
High frequency models in the hald data analysis
| Model variables | Proportion |
|---|---|
| x1 x2 x4 | 90.35% |
| x2 x3 x4 | 4.99% |
| x1 x2 x3 x4 | 4.22% |
| x2 x4 | 0.44% |
4.2.2 Fitness data
The data for our second real example is fitness data, which has been studied by SAS User’s Guide: Statistics[23]. The data consist of n = 31 responses and p = 6 independent variables. Therefore, 26 = 64 possible subset models are under consideration. We employ equation (4.1) and
to analyze the data, in which xi1, ⋯, xi6 denote age (years), weight (kg), time to run 1.5 miles in minutes (runtime), heart rate while resting (restpulse), heart rate while running (runpulse) and maximum heart rate records while running (maxpulse), respectively. 15000 Gibbs samplers are generated and higher frequency models are tabulated in Table 8. From Table 8 we can easily conclude that the results capture the μi = f(xi1, xi3, xi5,, xi6) model, which agrees with SAS on the variables to be deleted.
High frequency models in the fitness data analysis
| Model variables | Proportion |
|---|---|
| age runtime runpulse maxpulse | 58.56% |
| runtime runpulse maxpulse | 37.19% |
| age runtime restpulse runpulse maxpulse | 2.91% |
| age runpulse maxpulse | 0.38% |
5 Conclusion and discussion
In this article, we extend the Bayesian variable selection methods proposed in [21] to analyze reproductive dispersion linear models which include model uncertainty. A full Bayesian approach is developed to investigate these models’ uncertainty by using Gibbs sampler and MH algorithm. Several numerical examples are used show the efficiency of the proposed Bayesian approach. The results show that the developed Bayesian procedure is highly efficient and computationally fast. Other Bayesian variable selection approaches under reproductive dispersion distribution family will be of great research interest. The authors intend to investigate it in the near future.
Acknowledgements
The authors are grateful to professor Niansheng Tang for many valuable suggestions during the preparation of the manuscript and to the Editor-in-Chief of Journal of Systems Science and Information.
Appendix
The Metropolized independence sampler (MIS) is used to simulate observations from the conditional distribution p(τ|y, x, β, γ) given in (3.3). Following the arguments of Liu[24], the Metropolized independence sampler is implemented as follows: Given the current state τ(m), a candidate τ is drawn from h(τ) and is accepted with probability
where h(τ) is some function with respect to τ. As is pointed out by Liu[24], the efficiency of MIS depends heavily on how close the trial density g(τ) is to the target density p(τ|y, x, β, γ).
Similarly, the MH algorithm for simulating observations from the conditional distribution p(β|y, x, τ, γ) given in (3.5) is implemented as follows: Given the current value β(m), a new candidate β is generated from
where
with
References
[1] Jorgensen B. The theory of dispersion models. Chapman and Hall, London, 1997.Search in Google Scholar
[2] Tang N S, Wei B C, Wang X R. Influence diagnostics in nonlinear reproductive dispersion models. Statistics & Probability Letters, 2000, 46: 59–68.10.1016/S0167-7152(99)00087-5Search in Google Scholar
[3] Chen X D, Tang N S, Wang X R. Influence diagnostics of semiparametric nonlinear reproductive dispersion models. Communications in Statistics — Theory and Methods, 2010, 39: 3021–3040.10.1080/03610920903171838Search in Google Scholar
[4] Fu Y Z, Tang N S, Chen X. Local influence analysis of non-linear structural equation models with non-ignorable missing outcomes from reproductive dispersion models. Computational Statistics and Data Analysis, 2009, 53: 3671–3684.10.1016/j.csda.2009.03.011Search in Google Scholar
[5] Chen X D, Tang N S. Bayesian analysis of semiparametric reproductive dispersion mixed-effects models. Computational Statistics and Data Analysis, 2010, 54: 2145–2158.10.1016/j.csda.2010.03.022Search in Google Scholar
[6] Geman S, Geman D. Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1984, 6: 721–741.10.1016/B978-0-08-051581-6.50057-XSearch in Google Scholar
[7] Metropolis N, Rosenbluth A W, Rosenbluth M N, et al. Equations of state calculations by fast computing machine. Journal of Chemical Physics, 1953, 21: 1087–1091.10.2172/4390578Search in Google Scholar
[8] Hastings W K. Monte Carlo sampling methods using Markov chains and their application. Biometrika, 1970, 57: 97–109.10.1093/biomet/57.1.97Search in Google Scholar
[9] Tang N S, Zhao Y Y. Semiparametric Bayesian analysis of nonlinear reproductive dispersion mixed models for longitudinal data. Journal of Multivariate Analysis, 2013, 115: 68–83.10.1016/j.jmva.2012.09.005Search in Google Scholar
[10] Schwarz G. Estimating the dimension of a model. The Annals of Statistics, 1978, 6: 461–464.10.1214/aos/1176344136Search in Google Scholar
[11] Akaike H. A new look at the statistical identitification model. IEEE Transactions on Automati Control, 1974, 19: 716–723.10.1109/TAC.1974.1100705Search in Google Scholar
[12] Kass R E, Raftery A E. Bayes factors. Journal of the American Statistical Association, 1995, 90: 773–795.10.1080/01621459.1995.10476572Search in Google Scholar
[13] Geisser S, Eddy W F. A predictive approach to model selection. Journal of the American Statistical Association, 1979, 74: 153–160.10.1080/01621459.1979.10481632Search in Google Scholar
[14] Mitchell T J, Beauchamp J J. Bayesian variable selection in linear regression (with discussion). Journal of the American Statistical Association, 1988, 83: 1023–1036.10.1080/01621459.1988.10478694Search in Google Scholar
[15] George E L, McCulloch R E. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 1993, 88: 881–889.10.1080/01621459.1993.10476353Search in Google Scholar
[16] Green P. Reversible jump Markov chain Monte Carlo computation and Bayesian model dertermination. Biometrika, 1995, 82: 711–732.10.1093/biomet/82.4.711Search in Google Scholar
[17] Nott D J, Leonte D. Sampling schemes for Bayesian selection in generalized linear models. Journal of Statistical Computation and Simulation, 2004, 13(2): 362–382.10.1198/1061860043425Search in Google Scholar
[18] Swendsen R H, Wang J S. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, 1987, 58: 86–88.10.1103/PhysRevLett.58.86Search in Google Scholar PubMed
[19] Hans C. Model uncertainty and variable selection in Bayesian lasso regression. Statistics and Computing, 2010, 20(2): 221–229.10.1007/s11222-009-9160-9Search in Google Scholar
[20] Lykou A, Ntzoufras I. On Bayesian lasso variable selection and the specification of the shrinkage parameter. Statistics and Computing, 2013, 23(3): 361–390.10.1007/s11222-012-9316-xSearch in Google Scholar
[21] Kuo L, Mallick B. Variable selection for regression models. Sankhyā, 1998, B60: 65–81.Search in Google Scholar
[22] Draper N, Smith H. Applied regression analysis. 2nd ed. John Wiley, New York, 1981.Search in Google Scholar
[23] SAS Institute. SAS User’s Guide: Statistics ver.5. SAS Institute Inc, 1985.Search in Google Scholar
[24] Liu J S. Monte Carlo strategies in scientific computing. Springer-Verlag, New York, 2001.Search in Google Scholar
© 2014 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- A Study on the Chinese Enterprise Annuity Replacement Rate Problem
- A Novel Intelligence Recommendation Model for Insurance Products with Consumer Segmentation
- Price Increasing Timing of Competitive Perishable Products
- Modeling and Forecasting Morbidity and Disease-mortality of Chinese Impaired Lives with the General Chronic Diseases
- A Dynamic Model of Housing Wealth Effect: Based on the Diversity of Wealth Expectations
- Well-posedness and Stability of the Repairable System with Three Units and Vacation
- Bayesian Subset Selection for Reproductive Dispersion Linear Models
- Study on Evolutionary Algorithm Online Performance Evaluation Visualization Based on Python Programming Language
Articles in the same Issue
- A Study on the Chinese Enterprise Annuity Replacement Rate Problem
- A Novel Intelligence Recommendation Model for Insurance Products with Consumer Segmentation
- Price Increasing Timing of Competitive Perishable Products
- Modeling and Forecasting Morbidity and Disease-mortality of Chinese Impaired Lives with the General Chronic Diseases
- A Dynamic Model of Housing Wealth Effect: Based on the Diversity of Wealth Expectations
- Well-posedness and Stability of the Repairable System with Three Units and Vacation
- Bayesian Subset Selection for Reproductive Dispersion Linear Models
- Study on Evolutionary Algorithm Online Performance Evaluation Visualization Based on Python Programming Language