Home Bayesian Subset Selection for Reproductive Dispersion Linear Models
Article Publicly Available

Bayesian Subset Selection for Reproductive Dispersion Linear Models

  • Yuanying Zhao EMAIL logo , Dengke Xu , Xingde Duan and Yicheng Pang
Published/Copyright: February 25, 2014
Become an author with De Gruyter Brill

Abstract

We propose a full Bayesian subset selection method for reproductive dispersion linear models, which bases on expanding the usual link function to a function that incorporates all possible subsets of predictors by adding indictors as parameters. The vector of indicator variables dictates which predictors to delete. An efficient MCMC procedure that combining Gibbs sampler and Metropolis-Hastings algorithm is presented to approximate the posterior distribution of the indicator variables. The promising subsets of predictors can be identified as those with higher posterior probability. Several numerical examples are used to illustrate the newly developed methodology.

1 Introduction

Reproductive dispersion distribution family proposed by Jorgensen[1], which is a general distribution family and includes normal distribution, Gumbel distribution and exponential family distribution as its special case, has been received a lot of attention in recent years. For example, Tang et al.[2] proposed a nonlinear reproductive dispersion model and studied its influence diagnostics; Chen et al.[3] discussed influence diagnostics of semiparametric nonlinear reproductive dispersion models; Fu et al.[4] investigated local influence analysis for nonlinear structural equation models with non-ignorable missing outcomes from a reproductive dispersion model. In particular, Chen and Tang[5] developed a Bayesian approach to analyze semiparametric reproductive dispersion mixed-effects models on the basis of P-spline estimates of nonparametric components by using the Gibbs sampler[6] and the Metropolis-Hastings algorithm[78]. Recently, Tang and Zhao[9] presented semiparametric Bayesian analysis of nonlinear reproductive dispersion mixed models for longitudinal data.

Various Bayesian methods have been proposed for selecting suitable subset models, those classical methods include Bayeisan information criterion[10], asymptotic information criterion[11], Bayes factor[12] and pseudo-Bayes factor[13], etc. In particular, Mitchell and Beauchamp[14] presented a Bayesian variable selection method assuming the prior distribution of each regression coefficient is a mixture of a point mass at 0 and a diffuse uniform distribution elsewhere. Furthermore, George and McCulloch[15] developed a variable selection procedure via Gibbs sampling. The reversible jump MCMC[16] was proposed for investigating model uncertainty. Nott and Leonte[17] proposed sampling schemes for Bayesian selection in generalized linear models based on the Swendsen-Wang algorithm for the Ising model[18]. Hans[19] addressed model uncertainty for the Bayesian lasso regression by computing the marginal posterior probabilities for small model space. Recently, Lykou and Ntzoufras[20] studied how to select variables using Bayesian lasso. However, to our knowledge, there is little work done on Bayesian subset selection for reproductive dispersion distribution family. Motivated by the idea of Kuo and Mallick[21], the main purpose of this paper is to develop a novel Bayesian approach to select promising subsets under reproductive dispersion linear models by using MCMC technology.

This article is structured as follows. A reproductive dispersion linear model which includes model uncertainty is defined in Section 2. In Section 3, we explicitly describe a MCMC procedure that can identify the promising subsets of predictors under reproductive dispersion linear models. Several numerical examples are used to illustrate our proposed methodology in Section 4. A discussion is given in the final section. Technical details are given in the Appendix.

2 Model and notation

Suppose yi (i = 1, 2, ⋯, n) is the observation for the ith individual, we assume that y1, y2, ⋯, yn are independently distributed and has the following probability density function

p(yi;μi)=a(yi;σ2)exp12σ2d(yi;μi)(2.1)

where μi is the location parameter and may represent the mean of yi, σ2τ−1 ∈ 𝔹(𝔹⊂ℝ+) is referred to as the dispersion parameter which is known or can be estimated separately; a(⋅;⋅) is some suitable known function; d(y; μ) is a unit deviance on ℂ×Ω (Ω ⊆ ℂ ⊆ ℝ are open intervals) and satisfies d(y; y) = 0, ∀y ∈ ℂ and d(y; μ) > 0, ∀ yμ, and is twice continuously differentiable with respect to (y, μ) on ℂ × Ω. The model (2.1) is referred to as a reproductive dispersion distributional family[1] and includes normal distribution, double exponential distribution, Gumbel distribution and exponential family distribution as its special case. Motivated by the idea of Kuo and Mallick[21], we explore a simple method of subset selection. Instead of building a hierarchical model, we embed indicator variable in predictors that incorporates all 2p submodels. Specifically, let γj (j = 1, 2, ⋯, p) be the indicator variable supported at two points 1 and 0. When γj = 1 we include the predictor in the predictor model. When γj = 0, we omit the predictor when building the model. Therefore we assume that the systematic part of the model is specified by

ηi=g(μi)=j=1pxijβjγj(2.2)

where g(⋅) is a known continuously differentiable link function. For simplicity, we only consider the case that g(⋅) is the canonical link in this article, i.e., g(μi) = μi, then (2.2) reduces to

μi=j=1pxijβjγj(2.3)

The model defined by equations (2.1) and (2.3) is referred to as a reproductive dispersion linear model which includes model uncertainty.

Let β = (β1, β2, ⋯, βp)T, γ = (γ1, γ2, ⋯, γp)T, θ = (βT, τ)T and xi =(xi1, xi2, ⋯, xip)T. We assume that the prior of θ is independent on γ and γj is mutually independent for j = 1, 2, ⋯, p. Under the above assumption, the prior density p(θ, γ) of (θ, γ) can be expressed as

p(θ,γ)=p(β,τ,γ)p(τ)p(β|τ)j=1pp(γj).

Based on the recommendations given by Chen and Tang[5] and Kuo and Mallick[21], we consider the following prior distributions for parameters τ, β and γj(j = 1, 2, ⋯, p):

p(τ)Γ(a1,a2),p(β|τ)N(β0,τ1H0),p(γj)B(1,pj)(2.4)

where a1, a2, β0, H0 and pj are hyperparameters whose values are assumed to be given, Γ (a, b) denotes the gamma distribution with parameters a and b, N(β, Σ) denotes the multivariate normal distribution with mean β and covariance Σ, B(1, q) is the Bernoulli distribution with parameter q. In this paper, we set a1 = a2 = 0.01, β0 = 0, H0 = 100Ip and pj = 0.5 in all numerical examples which may present noninformation priors, where Ip is the p × p identity matrix.

3 Bayesian subset selection

In this section, we propose and develop a MCMC procedure that can identify the promising subsets of predictors under reproductive dispersion linear models. Denote y = {y1, y2, ⋯, yn}, x = {x1, x2, ⋯, xn}. The Bayesian inference on θ and γ can be via the posterior distribution log p(θ, γ|y, x)∝ log p(y, x|θ,γ) + log p(θ, γ), where p(y, x|θ, γ) is the likelihood function of the observed data {y, x}. The likelihood p(y, x|θ, γ) involves intractable multiple high dimension integrals. Thus, it is rather difficult and tedious to work directly with this posterior density. Inspired by recent development in statistical computing, the Gibbs sampler[6] is used to generate a sequence of random observations from the joint posterior distribution p(θ, γ|y, x), which is proportional to i=1np(yi, xi|θ, γ)p(θ, γ), i.e.,

i=1na(yi;τ1)expτ2i=1nd(yi;μi)p(θ,γ)(3.1)

In this algorithm, observations {θ, γ} are iteratively simulated from the following conditional distributions: p(θ|y, x, γ) and p(γ|y, x, θ). Then we identify the promising subset of predictors with high posterior probabilities of γ.

3.1 Conditional distributions

In this subsection, we briefly present the conditional distributions required in implementing the Gibbs sampler.

For the conditional distribution of γ, we have

p(γ|y,x,τ,β)j=1pp(γj)i=1na(yi;τ1)expτ2i=1nd(yi;μi)(3.2)

For the conditional distribution p(τ|y, x, θ), it follows from equations (3.1) and (2.4) that

p(τ|y,x,θ)τa1+p21a(y;τ)expτa2+12i=1nd(yi;μi)+12(ββ0)TH01(ββ0)(3.3)

where a(y, τ) = i=1na(yi; τ). Clearly, if a(yi;τ) = f(yi)τq in which f(yi) is a function of yi, then

p(τ|y,x,θ)Γ(ντ,gτ)(3.4)

where ντ=a1+p2+nq, and gτ=a2+12{i=1nd(yi;μi)+(ββ0)TH01(ββ0)}.

For the conditional distribution p(β|y, x, τ, γ), it follows from equations (3.1) and (2.4) that

p(β|y,x,τ,γ)expτ2i=1nd(yi;μi)+(ββ0)TH01(ββ0)(3.5)

3.2 Implementation

Now, we describe how to update the parameter γ. We update the variate γj (j = 1, 2, ⋯, p) given y, x, γj, β, τ by a Bernoulli distribution B(1,cjcj+dj), where γj is γ with the jth element deleted,

cj=pji=1np(yi,xi|β,τ,γ),

and

dj=(1pj)i=1np(yi,xi|β,τ,γ),

in which the vector γ*(γ**) is obtained from γ with the jth element replaced by 1(0). Clearly, sampling observations from conditional distributions given in (3.4) are straightforward and fast because the conditional distribution is a familiar standard distribution. However, it is rather difficult to sample observations from the conditional distributions given in (3.3) and (3.5) because these conditional distributions are unfamiliar standard and complex distributions. Therefore, the Metropolized independence sampler (MIS) and Metropolis-Hastings (MH) algorithm are employed to solve these difficulties. The technical details are given in the Appendix.

4 Numerical examples

In this section, several numerical examples are used to illustrate the preceding proposed method.

4.1 Simulation studies

In this subsection, numerical results with responses coming from the double exponential distribution and Gumbel distribution are used to investigate the performance of the proposed methodology.

We consider the first situation, the response yi (i = 1, 2, ⋯, n) is distributed as the double exponential distribution with probability density function

p(yi;μi)=τ2exp{τ|yiμi|}(4.1)

and the location parameter μi is given by

μi=xi1β1γ1+xi2β2γ2+xi3β3γ3+xi4β4γ4+xi5β5γ5(4.2)

In other words, we consider this variable selection problem with p = 5. It is assumed that xi = (xi1, xi2, xi3, xi4, xi5)T is independently generated from the standard normal distribution N(0, I5). Furthermore, we assume that the true value of τ = 1.0 and ϑ = (ϑ1, ϑ2, ϑ3, ϑ4, ϑ5)T = (β1γ1, β2γ2, β3γ3, β4γ4, β5γ5)T = (0, 0, 0, 1, 1.2)T, i.e., (4.2) reduces to

μi=xi4+1.2xi5(4.3)

then the above specified model (4.1) and (4.3) is a special case of reproductive dispersion linear models which includes model uncertainty. In all simulation studies, we consider that the sample size n equals 100. In the first simulation study, a data set is generated from (4.1) and (4.3), then we apply the preceding proposed method to the data set. Some results based on a sample of 15000 Gibbs samplers are displayed the frequencies of the four highest frequency models in Table 1. From Table 1 we can easily see that our analysis predicts the right model as expected. From the Gibbs samplers, we can also calculate the posterior mean and standard deviation of ϑ. The posterior means are 1.1127 and 1.2727 for ϑ4 and ϑ5 with standard deviations 0.1212 and 0.1215, each standard deviation incorporates our measure of model uncertainty.

Table 1

High frequency models in the first simulation study

Model variablesProportion
x4    x588.29%
x3    x4    x54.11%
x1    x4    x53.80%
x2    x4    x53.11%

The second simulation is identical to the first simulation study except that xi3(i = 1, 2, ⋯, n) is replaced by xi3 = xi5 + 0.15z, where zN(0, 1), yielding the correlation coefficient corr(xi3, xi5) = 0.989. This simulation is used to illustrate how our procedure performs in the presence of extreme collinearity. We conduct the same analysis as the first simulation study. Summaries of results are reported in Table 2, which implies that x4 and either one of x3 or x5 are identified as the promising right models.

Table 2

High frequency models in the second simulation study

Model variablesProportion
x4    x533.43%
x3    x457.54%
x3    x4    x54.65%
x1    x4    x53.22%

To demonstrate the practical potential of our procedure for data sets with a relatively large number of covariates, we conduct the following third simulation study. The responses are coming from the double exponential distribution same as the first simulation study, but the location parameter μi is given as follows:

μi=xi1β1γ1+xi2β2γ2++xi15β15γ15(4.4)

The covarites xj(j = 1, 2, ⋯, 15) are obtained as xj=xj+U, where xji.i.dN(0,I100) and UN(0, I100), which leads to that the pairwise correlation is about 0.5. The coefficients ϑ = (ϑ1, ⋯, ϑ15)T are set at (ϑ1, ⋯, ϑ5) = (0, ⋯, 0), (ϑ6, ⋯, ϑ10) = (1, ⋯, 1) and (ϑ11, ⋯, ϑ15) = (2, ⋯, 2) in which ϑj = βjγj(j = 1, 2, ⋯, 15). A data set is generated by equation (4.1) and (4.4). Summaries of results are presented in Table 3 based on 40000 Gibbs samplers, where False choice is defined to be at least one of the predictors in 1 to 5 is included or at least one of the predictors in 6 to 15 is deleted. From Table 3, 6.87% of the Gibbs samplers contains false choice with x1 included; 4.12% of the Gibbs samplers contains false choice with x5 included; 2.09% of the Gibbs samplers contains false choice with x3 included, etc. So high frequency 82.45% of the Gibbs samplers capture the right promising model, some variation is allowed for lower frequency models too.

Table 3

High frequency models for false choices in the third simulation study

Relative frequencyFalse choice
82.45%None
6.87%x1
4.12%x5
2.09%x3

To investigate the performance under various response distributions, we consider another situation and conduct the following simulation studies. In the fourth simulation study, we assume that response yi (i = 1, 2, ⋯, n) is distributed as the Gumbel distribution, which has the following probability density function

p(yi;μi)=exp{yiμiexp(yiμi)}(4.5)

and belongs to reproductive dispersion family[1]. The fourth simulation study is conducted by using other settings same as given in the first simulation study. The corresponding results are reported in Table 4. Furthermore, we do the fifth (sixth) simulation study which is identical to the second (third) simulation study except data set is generated by equation (4.5). Summaries of these results are presented in Tables 5 and 6. All these results show that the same conclusion as given in the first situation are drawn, which indicates that the response has little effect on our procedure.

Table 4

High frequency models in the fourth simulation study

Model variablesProportion
x4    x596.45%
x3    x4    x51.13%
x1    x4    x51.08%
x2    x4    x50.88%

Table 5

High frequency models in the fifth simulation study

Model variablesProportion
x4    x541.12%
x3    x426.35%
x2    x3    x410.13%
x2    x4    x59.75%

Table 6

High frequency models for false choices in the sixth simulation study

Relative frequencyFalse choice
80.05%None
8.40%x1
5.56%x5
4.15%x4

4.2 Real examples

In this subsection, two real data examples are used to illustrate our preceding proposed methodology.

4.2.1 Hald data

Our first real data example is the fimiliar Hald data, which has been reported by Draper and Smith[22] and been used by various authors to illustrate variable selection procedures. The data consist of n = 13 responses and p = 4 independent variables x1, x2, x3, x4. Therefore, 24 = 16 possible subset models are under consideration. Equation (4.1) and

μi=xi1β1γ1+xi2β2γ2+xi3β3γ3+xi4β4γ4(4.6)

are applied to fit the data. 10000 Gibbs samplers are simulated and higher frequency models are tabulated in Table 7. From Table 7 we can easily see that the results capture the μi = f(xi1, xi2, xi4) model, which is quite similar to the results of those forward selection in Draper and Smith[22].

Table 7

High frequency models in the hald data analysis

Model variablesProportion
x1    x2    x490.35%
x2    x3    x44.99%
x1    x2    x3    x44.22%
x2    x40.44%

4.2.2 Fitness data

The data for our second real example is fitness data, which has been studied by SAS User’s Guide: Statistics[23]. The data consist of n = 31 responses and p = 6 independent variables. Therefore, 26 = 64 possible subset models are under consideration. We employ equation (4.1) and

μi=xi1β1γ1+xi2β2γ2+xi3β3γ3+xi4β4γ4+xi5β5γ5+xi6β6γ6(4.7)

to analyze the data, in which xi1, ⋯, xi6 denote age (years), weight (kg), time to run 1.5 miles in minutes (runtime), heart rate while resting (restpulse), heart rate while running (runpulse) and maximum heart rate records while running (maxpulse), respectively. 15000 Gibbs samplers are generated and higher frequency models are tabulated in Table 8. From Table 8 we can easily conclude that the results capture the μi = f(xi1, xi3, xi5,, xi6) model, which agrees with SAS on the variables to be deleted.

Table 8

High frequency models in the fitness data analysis

Model variablesProportion
age    runtime    runpulse    maxpulse58.56%
runtime    runpulse    maxpulse37.19%
age    runtime    restpulse    runpulse    maxpulse2.91%
age    runpulse    maxpulse0.38%

5 Conclusion and discussion

In this article, we extend the Bayesian variable selection methods proposed in [21] to analyze reproductive dispersion linear models which include model uncertainty. A full Bayesian approach is developed to investigate these models’ uncertainty by using Gibbs sampler and MH algorithm. Several numerical examples are used show the efficiency of the proposed Bayesian approach. The results show that the developed Bayesian procedure is highly efficient and computationally fast. Other Bayesian variable selection approaches under reproductive dispersion distribution family will be of great research interest. The authors intend to investigate it in the near future.


Supported by Grants from the National Natural Science Foundation of China (Grant No. 11301485), Scientific Research Project for Introduced Talents of Guiyang University, Scientific Research Project for Introduced Talents of Guizhou University of Finance and Economics in 2013


Acknowledgements

The authors are grateful to professor Niansheng Tang for many valuable suggestions during the preparation of the manuscript and to the Editor-in-Chief of Journal of Systems Science and Information.

Appendix

The Metropolized independence sampler (MIS) is used to simulate observations from the conditional distribution p(τ|y, x, β, γ) given in (3.3). Following the arguments of Liu[24], the Metropolized independence sampler is implemented as follows: Given the current state τ(m), a candidate τ is drawn from h(τ) and is accepted with probability

min1,p(τ|y,x,β,γ)h(τ(m))p(τ(m)|y,x,β,γ)h(τ),

where h(τ) is some function with respect to τ. As is pointed out by Liu[24], the efficiency of MIS depends heavily on how close the trial density g(τ) is to the target density p(τ|y, x, β, γ).

Similarly, the MH algorithm for simulating observations from the conditional distribution p(β|y, x, τ, γ) given in (3.5) is implemented as follows: Given the current value β(m), a new candidate β is generated from N(β(m),σβ2Ωβ) and is accepted with probability

min1,p(β|y,x,τ,γ)p(β(m)|y,x,τ,γ),

where

Ωβ=τ112i=1nxiTVixi+H011

with Vi=Eyi(2d(yi;μi)/μiμiT)|β=0.σβ2 is chosen such that the average acceptance rate is about 0.25 or more.

References

[1] Jorgensen B. The theory of dispersion models. Chapman and Hall, London, 1997.Search in Google Scholar

[2] Tang N S, Wei B C, Wang X R. Influence diagnostics in nonlinear reproductive dispersion models. Statistics & Probability Letters, 2000, 46: 59–68.10.1016/S0167-7152(99)00087-5Search in Google Scholar

[3] Chen X D, Tang N S, Wang X R. Influence diagnostics of semiparametric nonlinear reproductive dispersion models. Communications in Statistics — Theory and Methods, 2010, 39: 3021–3040.10.1080/03610920903171838Search in Google Scholar

[4] Fu Y Z, Tang N S, Chen X. Local influence analysis of non-linear structural equation models with non-ignorable missing outcomes from reproductive dispersion models. Computational Statistics and Data Analysis, 2009, 53: 3671–3684.10.1016/j.csda.2009.03.011Search in Google Scholar

[5] Chen X D, Tang N S. Bayesian analysis of semiparametric reproductive dispersion mixed-effects models. Computational Statistics and Data Analysis, 2010, 54: 2145–2158.10.1016/j.csda.2010.03.022Search in Google Scholar

[6] Geman S, Geman D. Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1984, 6: 721–741.10.1016/B978-0-08-051581-6.50057-XSearch in Google Scholar

[7] Metropolis N, Rosenbluth A W, Rosenbluth M N, et al. Equations of state calculations by fast computing machine. Journal of Chemical Physics, 1953, 21: 1087–1091.10.2172/4390578Search in Google Scholar

[8] Hastings W K. Monte Carlo sampling methods using Markov chains and their application. Biometrika, 1970, 57: 97–109.10.1093/biomet/57.1.97Search in Google Scholar

[9] Tang N S, Zhao Y Y. Semiparametric Bayesian analysis of nonlinear reproductive dispersion mixed models for longitudinal data. Journal of Multivariate Analysis, 2013, 115: 68–83.10.1016/j.jmva.2012.09.005Search in Google Scholar

[10] Schwarz G. Estimating the dimension of a model. The Annals of Statistics, 1978, 6: 461–464.10.1214/aos/1176344136Search in Google Scholar

[11] Akaike H. A new look at the statistical identitification model. IEEE Transactions on Automati Control, 1974, 19: 716–723.10.1109/TAC.1974.1100705Search in Google Scholar

[12] Kass R E, Raftery A E. Bayes factors. Journal of the American Statistical Association, 1995, 90: 773–795.10.1080/01621459.1995.10476572Search in Google Scholar

[13] Geisser S, Eddy W F. A predictive approach to model selection. Journal of the American Statistical Association, 1979, 74: 153–160.10.1080/01621459.1979.10481632Search in Google Scholar

[14] Mitchell T J, Beauchamp J J. Bayesian variable selection in linear regression (with discussion). Journal of the American Statistical Association, 1988, 83: 1023–1036.10.1080/01621459.1988.10478694Search in Google Scholar

[15] George E L, McCulloch R E. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 1993, 88: 881–889.10.1080/01621459.1993.10476353Search in Google Scholar

[16] Green P. Reversible jump Markov chain Monte Carlo computation and Bayesian model dertermination. Biometrika, 1995, 82: 711–732.10.1093/biomet/82.4.711Search in Google Scholar

[17] Nott D J, Leonte D. Sampling schemes for Bayesian selection in generalized linear models. Journal of Statistical Computation and Simulation, 2004, 13(2): 362–382.10.1198/1061860043425Search in Google Scholar

[18] Swendsen R H, Wang J S. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, 1987, 58: 86–88.10.1103/PhysRevLett.58.86Search in Google Scholar PubMed

[19] Hans C. Model uncertainty and variable selection in Bayesian lasso regression. Statistics and Computing, 2010, 20(2): 221–229.10.1007/s11222-009-9160-9Search in Google Scholar

[20] Lykou A, Ntzoufras I. On Bayesian lasso variable selection and the specification of the shrinkage parameter. Statistics and Computing, 2013, 23(3): 361–390.10.1007/s11222-012-9316-xSearch in Google Scholar

[21] Kuo L, Mallick B. Variable selection for regression models. Sankhyā, 1998, B60: 65–81.Search in Google Scholar

[22] Draper N, Smith H. Applied regression analysis. 2nd ed. John Wiley, New York, 1981.Search in Google Scholar

[23] SAS Institute. SAS User’s Guide: Statistics ver.5. SAS Institute Inc, 1985.Search in Google Scholar

[24] Liu J S. Monte Carlo strategies in scientific computing. Springer-Verlag, New York, 2001.Search in Google Scholar

Received: 2013-11-27
Accepted: 2013-12-26
Published Online: 2014-2-25

© 2014 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 20.11.2025 from https://www.degruyterbrill.com/document/doi/10.1515/JSSI-2014-0077/html
Scroll to top button