Combinatorial Mixtures of Multiparameter Distributions: An Application to Bivariate Data

Valeria Edefonti; Giovanni Parmigiani

doi:10.1515/ijb-2015-0064

Article Publicly Available

Combinatorial Mixtures of Multiparameter Distributions: An Application to Bivariate Data

Valeria Edefonti and Giovanni Parmigiani

Published/Copyright: February 16, 2017

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal The International Journal of Biostatistics Volume 13 Issue 1

Abstract:

We introduce combinatorial mixtures – a flexible class of models for inference on mixture distributions whose components have multidimensional parameters. The key idea is to allow each element of the component-specific parameter vectors to be shared by a subset of other components. This approach allows for mixtures that range from very flexible to very parsimonious and unifies inference on component-specific parameters with inference on the number of components. We develop Bayesian inference and computational approaches for this class of distributions, and illustrate them in an application. This work was originally motivated by the analysis of cancer subtypes: in terms of biological measures of interest, subtypes may be characterized by differences in location, scale, correlations or any of the combinations. We illustrate our approach using publicly available data on molecular subtypes of lung and prostate cancers.

Keywords: Bayesian inference; clustering; Markov chain Monte Carlo; mixture models with unknown number of components

1 Introduction

Since the beginning of the last century [1, 2], finite mixture distributions have received attention as tools for modelling population heterogeneity or for building flexible finite-parameter distributions. Monographs on finite mixtures include the classical Titterington, Smith, and Makov, McLachlan and Basford, and McLachlan and Peel [3, 4, 5]. Böhning and Seidel [6] is a review with emphasis on nonparametric maximum likelihood, whereas Marin et al. [7] is an introduction from a Bayesian perspective. For a complete treatment of several aspects of finite mixture models from both perspectives, see Frühwirth-Schnatter [8].

One of the important remaining challenges of mixture modeling is to develop approaches that achieve a practical compromise between flexibility and parsimony, especially for mixtures whose component distributions are themselves characterized by multiple parameters. In this setting, one has the option of allowing each component to have its own parameter vector, or to share a subset of the vector elements across components. For example, in the context of normal mixtures, it is common to assume either component-specific locations and variances, or a common variance, or a common mean. These three choices are extremes of a richer and useful set of patterns in which one shares some of the parameters in some of the components. Here we develop this idea formally, proposing a parametric solution to this class of problems.

Our approach is to allow each element of the component-specific parameter vector to be either different or equal to that of other components. A positive probability is put on every possible combination of equalities, whence the name combinatorial mixtures. This partial sharing allows for greater generality and flexibility in comparison with traditional approaches to mixture modeling, while still allowing to assign mass to models that are more parsimonious than the general mixture case, in which no sharing takes place. One of the implications of our setting is that, once a maximum number of components is specified, inference on the parameters and the number of components is subsumed by the inference on combinatorial patterns. If there is complete sharing among two components, then the effective number of components is reduced by one. Therefore, assigning a prior on sharing patterns implies assigning a prior on the effective number of mixture components.

This development was originally motivated by applications in molecular biology, where one deals with continuous measures, such as RNA or protein levels, which vary across unknown biological subtypes. In some cases, subtypes are characterized by an increase in the level of the marker measured, while in others they are characterized by variability in otherwise tightly controlled processes, or by the lack of otherwise strong correlations [9, 10]. Also, several mechanisms can coexist. In addition, in recent years, it has been recognized that bimodally distributed genes can identify promising candidates for clinically relevant prognostic or predictive signatures [11]. In breast cancer, ER (estrogen receptor), PR (progesterone receptor), and HER2 (human epidermal growth factor receptor 2) genes can be considered molecular switches, with distinct on/off expression states defining clinically unique subtypes with different prognoses and different response to different therapeutic strategies [12]. Another relevant example of application of mixture models is in the modelling of the shape of the hazard function via two- or three-component mixtures in tumor dormancy phenomena [13]. In these contexts, the main goals of a mixture model analysis are to: (a) estimate the number of subgroups in a sample; (b) make inferences about the assignment of samples to these subgroups; and (c) generate hypotheses about which mechanisms are likely to characterize the subgroups. Our paper adds a new tool to Bayesian mixture models that contributes to answering these questions.

Combinatorial mixtures relates to product partition [14, 15, 16, 17] and Dirichlet process mixture (DPM) [18, 19] models, which allow for parameters to be clustered in subsets. Variants of the standard formulations of these models may be considered that are similar in spirit to combinatorial mixtures. However, our approach is more oriented to provide experts with direct tools for elicitation of a priori information and visualization of a posteriori evidence from the fitted mixture models. This is a key feature as mixture models have been increasingly used in recent years to model problems with a direct application in clinical practice.

The remainder of the paper is organized as follows. In Section 2 we introduce the notation and missing data formulation for Bayesian mixtures. In Section 3 we propose a general formulation for combinatorial mixtures. In Section 4combinatorial mixtures are introduced in the context of normal mixture models. The performance of the methodology is then illustrated in an application to gene expression data from lung and prostate cancer patients in Section 5.

2 Bayesian methods for mixtures: notation and missing data formulation

In a finite mixture model, observations xn=(x1,…,xn) are assumed to be conditionally independent as follows

(1)xi|K,θ,ω∼p(xi|K,θ,ω)=∑k=1Kωkp(xi|θk),i=1,…,n,

where K is the number of components, ω=(ω1,…,ωK) are the mixture weights – constrained to be non-negative and to sum to unity – and θ=(θ1,…,θK) is the component-specific parameter vector. A fully Bayesian analysis [48] requires specification of a prior distribution on the parameters (K,θ,ω). The number of components and the mixture components parameters are modeled jointly and inference about these quantities is based on their posterior distributions.

In the missing data formulation of mixture models, each observation xi,i=1,…,n, comes from a specific, but unknown, component, zi, of the mixture. Model (1) can be written in terms of the missing data, with z1,…,zn assumed to be realizations of conditionally independent and identically distributed discrete random variables, z1,…,zn, with probability mass function:

p(zi=k|θ,ω)=ωk,fori=1,…,n,k=1,…,K.

Conditional on the zs, x1,…,xn are independent observations from densities:

p(xi|zi=k,θ,ω)=p(xi|θk),i=1,…,n,k=1,…,K.

Integrating out z1,…,zn yields model (1).

Group belonging indicators may be known, with the so called supervised model representing K distinct subpopulations with specified distributions, as opposed to the general unsupervised case with unknown class indicators. Either settings are amenable to the modelling strategy that we propose. In the supervised case, the main goal of the analysis is simply that of generating hypotheses about which mechanisms are likely to characterize different and known subpopulations in the data.

3 Combinatorial mixtures

With the term combinatorial mixtures we refer to a general class of mixture models in which elements of the parameter vector can be shared across any subset of the components, and positive mass is put on every possible combination of sharing patterns. To define this class, assume the following mixture model for the (J×1) column vector, xi,J≥1:

(4)xi|θ,ω∼p(xi|θ,ω)=∑k=1K∗ωkp(xi|θk),i=1,…,n,

where the k-th component-specific parameter θk=(θk1,…,θkd,…,θkD)T is a column vector listing all the D parameters characterizing the k-th component distribution. Here, K∗ represents the maximum number of components, and it is fixed, though, as we will see, the actual number of components K is still an unknown parameter.

After assuming prior independence between parameters ω and θ, and the standard Dirichlet distribution as the prior for the weights:ω∼Dir(a0),a0=(a1,0,…,aK∗,0), one may specify the following prior distribution on any θd=(θ1d,…,θkd,…,θK∗d), d=1,…,D, that allows for degeneracy along equality constraints:

(5)θd=(θ1d,…,θkd,…,θK∗d)|γd,Hi.i.d.∼∏k:θkd∈θ˜d(γd)p(θkd|H),d=1,…,D,

and

(6)γd|πdi.i.d.∼Multi(1,πd),πd:πld≥0,∑l=0BK∗−1πld=1,d=1,…,D,

where θ˜d∈Unique(θd) – meaning that θ˜d is (one of) the Ud-dimensional row vectors similar to θd but with the duplicate elements suppressed – θ˜d(γd) indicates that the product is carried out over the different elements of θd, as induced by the θd-sharing pattern variable γd, H is a set of suitable hyperparameters, Multi(⋅|1,πd) indicates a multinomial distribution with parameters equal to 1 and πd, and BK∗ is the Bell exponential number representing the number of possible degeneracy patterns. Note that:

The product is carried out over Ud probability distributions;
Each value of γd is associated with more than one Ud-dimensional Unique(θd) vector; we indicate with θ˜d one of these vectors (see Section 4.1 for details on our specific choice).

With an alternative notation, if we indicate with γld any value of γd, eqs (3) and (4) may be jointly written as:

(7)θd=(θ1d,…,θkd,…,θK∗d)|Hi.i.d.∼∑l=0BK∗−1πld∏k:θkd∈θ˜d(γld)p(θkd|H),d=1,…,D,

where the mixture form of the prior distribution is in evidence.

Well-known mixture modeling scenarios included in our formulation are the following ones:

No sharing: for every d, Ud=K∗;
Complete sharing: for every d, Ud=1, and therefore K=1;
Partial sharing (mixed version of the previous cases): Ud=K∗ for some ds and Ud=1 for some others.

However, extra scenarios are provided by our approach in the “Partial sharing” set-up in the following directions:

For every d, d;
For different K∗s, subgroups of shared parameters may be different.

Consider the following example. Let the maximum number of groups, D, be equal to 5 and the dimensionality of the parameter space, θd, be equal to 2. An element of the combinatorial mixtures class may put a priori some mass in the following way: say parameter vectors θd′ and θd=(λ1,λ2,λ1,λ3,λ1) assume the following values θd=(λ1,λ2,λ1,λ3,λ1) and θd′=(λ1′,λ2′,λ3′,λ2′,λ4′). In this case:

θ˜d=(λ1,λ2,λ3)andθ˜d′=(λ1′,λ2′,λ3′,λ4′),

meaning that some elements are shared for each dimension. The collection of sets with shared components indices, Ehd={k∈{1,…,K∗}:θkd=θ˜hd}, are given by:

E1d={1,3,5},E2d={2},E3d={4},

and:

E1d′={1},E2d′={2,4},E3d′={3},E4d′={5}

respectively. Dimensions d and d′ have a different number of shared elements, and the sharing occurs in different mixture components.

The Bell number, Bn, is defined as the number of partitions of a set A0 of size n and, therefore, it counts all the possible comparisons involving the elements of θd across the K∗ groups. For any number of groups, the Bell number is finite, though it could be very large.

Prior independence between parameters θkd, for fixed d, is assumed, but it is not a key aspect of combinatorial mixtures. Details on variance-covariance matrix decompositions and on the prior distributions of the remaining parameters are given in the context of normal mixtures in the next section.

4 Combinatorial mixtures of normal distributions

4.1 Model specification

Data xn=(x1,…,xn) are assumed to be independent observations from a mixture density with K∗ bivariate normal components:

xi|μ,Σ,ω∼∑k=1K∗ωkN2(xi|μk,Σk),xi=xi1xi2,i=1,…,n,

where:

μ=(μ1,…,μK∗),Σ=(Σ1,…,ΣK∗),μk=μk1μk2,Σk=σ11k2σ12kσ21kσ22k2,k=1,…,K∗,

and N2(⋅|μk,Σk) indicates a bivariate normal distribution with expectation vector μk and variance-covariance matrix Σk, Σk positive definite.

In this case, J=2, D=5, and d∈{m1,m2,v1,v2,c}, meaning that we potentially allow all γd-induced patterns of sharing of means (in the following indicated with m1 and m2 for j=1 and j=2, respectively), variances (similarly, indicated with v1 and v2) and covariances (indicated with c), respectively, among the K∗ components.

For the modeling of the variance-covariance structure, we adopt the following direct decomposition proposed by Barnard et al. [20]:

Σk=diag(Sk)Rkdiag(Sk),Sk=Sk1Sk2,Rk=1rkrk1,k=1,…,K∗,

where each Sk is a vector of standard deviations, diag (Sk) is the diagonal matrix with diagonal elements in Sk, each Rk is a correlation matrix, and accordingly:

S=(S1,…,SK∗),R=(R1,…,RK∗).

This separation has a relevant practical motivation as most practitioners are trained to think in terms of standard deviations and correlations; the standard deviations are on the original scale, and the correlations are scale free.

After the following prior independence assumptions on the new parameter set (ω,μ,S,R):

ω⊥(μ,S,R),μ⊥(S,R),S⊥R,

combinatorial mixtures for normal mixture models may be specified through the following priors on the component-specific parameters:

μj=(μ1j,…,μK∗j)|γjm,η2i.i.d.∼∏k:μkj∈μ˜j(γjm)N(μkj|η2),j=1,2,

logSj=(logS1j,…,logSK∗j)|γjv,e2i.i.d.∼∏k:Skj∈S˜j(γjv)N(logSkj|e2),j=1,2,

r=(r1,…,rK∗)|γci.i.d.∼∏k:rk∈r˜(γc)U(−1,1)(rk),

and on the γd indicators of sharing of means, standard deviations, and/or correlations among the mixture components:

γjm|πmi.i.d.∼Multi(1,πm),γjv|πvi.i.d.∼Multi(1,πv),γc|πc∼Multi(1,πc),j=1,2,

or, alternatively, in the univariate case, we opt for a slightly different specification:

σ2|γv,c,d∼∏k:σk2∈σ2˜(γv)IG(σk2|c,d)andγv|πv∼Multi(1,πv).

We assume prior independence between each component-specific parameter, μ, S and r, and the γs corresponding to the other ones. In addition, we have that log(Skj|0,e2)∼N(Skj|0,e2) indicates that Skj is distributed according to a lognormal distribution with parameters equal to 0 and e2, IG(⋅|c,d) indicates an inverse-gamma distribution with parameters c and d, c,d>0 (using the parameterization in which the mean is d/(c−1), c>1), U(−1,1)(⋅) indicates a uniform distribution on (−1,1), and η2, c, d and e2 are known, πm, πv, πc are known, non-negative and they sum to unity. Note that for Rk be positive definite, rk≠±1. We specify our priors on the standard deviations and correlations following practical suggestions in Barnard et al. [20], who suggested the independent log-normal priors to be more flexible than the scaled inverted chi-squared distributions in one of their applications, too. In addition, prior means of 0 are assumed for the normal and lognormal distributions of the component-specific means and standard deviations, whereas the corresponding variances are chosen to express a vague prior information. However, our approach applies equally to different forms of the component-specific distribution and different choices of the hyperparameters.

This model specification derives from that of Section 3 after posing, for j=1,2, and k=1,…,K∗:

θd=θmj=μj,θkd=θkmj=μkj,θd=θvj=Sj,θkd=θkvj=Skj,

θd=θc=r,θkd=θkc=rk,

γd=γmj=γjm,γd=γvj=γjv,πd=πmj=πjm=πm,πd=πvj=πjv=πv.

Now, assume K∗=3. For any d, we can re-express the corresponding γd values looking at the (3×3) matrix, Pd, showing all the possible pairwise comparisons between elements in θd across the three components, where pk,k′=0iffθkd=θk′d, pk,k′=1iffθkd≠θk′d for k,k′=1,2,3, d fixed. From the corresponding upper triangular matrix one can obtain a (1×3) vector after deleting the diagonal elements (which are all equal to 0) and juxtaposing the remaining rows of the matrix. For instance, we represent the matrix

011101110with the vector(1,1,1),

and, consistently, we pose (see Section 3):

γld=4iffγld={p1,2,p1,3,p2,3}={1,1,1}=111,

and, similarly, for the remaining γlds:

0=000,1=110,2=101,3=011.

For example, in the univariate set-up, we have:

μ1,μ2,μ3|γm,η2∼{N(μ1|0,η2)ifγm=000N(μ1|0,η2)N(μ2|0,η2)ifγm=110N(μ1|0,η2)N(μ2|0,η2)ifγm=101N(μ1|0,η2)N(μ3|0,η2)ifγm=011N(μ1|0,η2)N(μ2|0,η2)N(μ3|0,η2)ifγm=111,

σ12,σ22,σ32|γv,c,d∼{IG(σ12|c,d),ifγv=000IG(σ12|c,d)IG(σ22|c,d)ifγv=110IG(σ12|c,d)IG(σ22|c,d)ifγv=101IG(σ12|c,d)IG(σ32|c,d)ifγv=011IG(σ12|c,d)IG(σ22|c,d)IG(σ32|c,d)ifγv=111,

γm|πm∼Multi(1,πm),γv|πv∼Multi(1,πv),

where the following shorthand notation:

y1,y2|y3∼{p(y1)ify3=0p(y1)p(y2)ify3=1

stands for:

ify3=0y1=y2|y3∼p(y1);ify3=1y1,y2|y3∼p(y1)p(y2),

with y1,y2,y3 univariate random variables, y3 assuming values in {0,1}.

Note that we implicitly opt for the following specification of θ˜d (see Section 3 for details):

ifγd=000thenθ˜d=(θ1d)ifγd=110thenθ˜d=(θ1d,θ2d)ifγd=101thenθ˜d=(θ1d,θ2d)ifγd=011thenθ˜d=(θ1d,θ3d)ifγd=111thenθ˜d=(θ1d,θ2d,θ3d),

although other specifications of a priori collapsed components, like:

ifγd=000thenθ˜d=(θ1d)ifγd=110thenθ˜d=(θ1d,θ3d)ifγd=101thenθ˜d=(θ1d,θ2d)ifγd=011thenθ˜d=(θ1d,θ3d)ifγd=111thenθ˜d=(θ1d,θ2d,θ3d),

are equivalent with respect to our formulation of combinatorial mixtures. This renaming of the γds values equally applies to the bivariate case. Although his usefulness is limited to K∗=3 or K∗=4, it has a relevant practical motivation, as practitioners may directly identify the number and indexes of collapsed components for any variable.

4.2 Computation

Bayesian inference for mixture models is carried out via Markov Chain Monte Carlo (MCMC) methods (see Gottardo and Raftery [21] for a general framework where to use MCMC algorithms when the target distribution is a mixture of mutually singular distributions (i.e. of several distributions of different dimensions), like in our case). A sampled realization of the Markov chain Φ(t), with posterior distribution p(ϕ|xn) as its stationary distribution, is produced generating iteratively the parameters, ϕ(t), and the missing data, zn(t)=(z1(t),…,zn(t)), according to p(ϕ|xn,zn(t)) and p(zn|xn,ϕ(t+1)) respectively.

In the bivariate normal mixture model, we have: p(ϕ|xn)=p(μ1,γ1m,μ2,γ2m,S1,γ1v,S2,γ2v,r,γc,ω|xn) and: ϕ(t)=(μ1(t),γ1m(t),μ2(t),γ2m(t),S1(t),γ1v(t),S2(t),γ2v(t),r(t),γc(t),ω(t)), and we make use of seven move types:

Updating the vectors (μj,γjm),j=1,2;
Updating the vectors (Sj,γjv),j=1,2;
Updating the vector (r,γc);
Updating the weights ω;
Updating the allocation variables vector zn.

Full conditional distributions of the variables given all the others exist in a closed form for (μj,γjm), j=1,2, ω and zn (see Appendix Section 8.1 and Section 8.2).

Move types 1., 2., and 3. follow from combinatorial mixtures assumptions and involve a change in dimension. We applied the Gibbs sampler for drawing (μj,γjm), j=1,2. At each iteration, we updated γjm integrating out μj from the posterior distribution of (μj,γjm); then we updated μj conditioning on γjm using conjugacy [22]. We applied three Metropolis-Hastings steps for drawing (Sj,γjv), j=1,2, and (r,γc) (see Appendix Section 8.1 and Section 8.2 for details). Move types 4. and 5. follow [22].

In the univariate normal mixture model, vector (σ2,γv) is updated via Gibbs sampler, as described for (μj,γjm) in the bivariate set-up.

4.3 Label switching

Meaningful parametric inference in mixture models requires to tackle the label switching problem [8, 23, 24, 25], which is the invariance of representation eq. (1) under permutations of the class labels. In a Bayesian context, this issue becomes more acute, because class labels may permute during the simulation run. If symmetric priors (where a priori all components have the same distribution) are placed upon the parameters of a mixture model, then the resulting posterior distribution is invariant to permutations in the labelling of the parameters. As a result, the usual practices of summarizing joint posterior distributions by marginal distributions and estimating quantities of interest by their posterior means are inappropriate.

An elementary solution to this issue [26] is to summarize inference by looking at pairs of observations and counting how often they are assigned to the same mixture component. In this way, one obtains a nearness measure that can be used descriptively or provide the basis for clustering. As a by-product, we implement a graphical representation of the corresponding matrix of relative frequencies (named the “O’Hagan” matrix in the following), to visualize clusters in the data.

However, when the goals of the analysis include both parameter estimation and clustering and the prior structure is symmetric (as in our case), an effective posterior inference requires to use some of the relabelling methods that have been proposed to solve the label switching issue in Bayesian inference (see Jasra et al. [27], Zhu and Fan [28]). In the current paper, we implement four available relabelling methods, the “Data-based” [29], the “ECR iterative” version 1 and 2 [30], and the “Stephens” Kullback-Leibler based relabelling algorithm [24]. We also propose a fifth relabelling algorithm, the “Class president”, which adapts an idea of Chung et al. [31]. A priori, we identify K∗−1 “class president” observations. When all K classes have at least one observation, each president is assigned to a different class with probability one. When postprocessing, we relabel observations consistent with this constraint: at each iteration, if a president’s allocation variable has changed, we replace the allocation variables of its sample, as well as those of all samples who are in the same group at that iteration, with the original label of the president. Chung et al. [31] considers a fixed number of components. We also need to consider the cases in which the number of components becomes smaller than K∗ and the cases in which presidents may be assigned to the same class. To choose presidents, if K∗=3, we use two samples at the extremes of the empirical distribution of the data. This labels the two most extreme groups while the third, if present, is labeled by exclusion.

In our application, we separately apply each of the relabelling algorithm to each subset of the parameter space that is jointly sampled in the MCMC procedure (i.e. (μj,γjm), (Sj,γjv), with j=1,2, (r,γc), and ω). In the case of a component-specific parameter subset, we apply the relabelling algorithm to the component-specific parameter set (say, μ1) and we consistently re-define the corresponding γd indicator (say, γ1m) in accordance with the permuted component-specific parameters. We finally derive the posterior estimates of the parameters from the MCMC chains after each relabelling algorithm is applied.

Combinatorial mixtures apply, as a special case, to the supervised context, with (obviously) no label switching issues.

4.4 Implications of combinatorial mixtures

In the following, we describe two aspects that follow from our definition of combinatorial mixtures. We refer to the univariate case with K∗=3 for semplicity.

Parameters γm and γv are connected with the unknown number of mixture components K. We, indeed, derive the corresponding prior distribution on K from the joint prior distribution of (γm,γv) by listing the (γm,γv) combinations associated with any number of components and summing their probabilities. In the case of this section, this leads to:

P{K=1}=π000mπ000v,

P{K=2}=π000mπ110v+π000mπ101v+π000mπ011v+π110mπ000v+π110mπ110v++π101mπ000v+π101mπ101v+π011mπ000v+π011mπ011v,

P{K=3}=π000mπ111v+π110mπ101v+π110mπ011v+π110mπ111v+π101mπ110v++π101mπ011v+π101mπ111v+π011mπ110v+π011mπ101v+π011mπ111v++π111mπ000v+π111mπ110v+π111mπ101v+π111mπ011v+π111mπ111v.

In addition, in the presence of a symmetric prior structure and with no constraints posed by relabelling algorithms on single parameters or combinations of them,label switching applies equally to inference on γm and γv: combinations of (γm,γv) with different names actually correspond to the same mixture model for the data. Figure 1 depicts in a stylized fashion all the available combinations of (γm,γv) for K∗=3 and marks the equivalent ones with the same color. The original twenty-five possible (γm,γv) combinations are mapped into ten effective ones, identified using the same color. The corresponding density estimation plots and number of components are superimposed. Corresponding prior (posterior) probabilities are summed when reporting results.

$Figure 1 Mixture model templates corresponding to each (γm,γv)$$(\gamma^{m},\gamma^{v})$$ pair for combinatorial mixtures priors in a normal mixture model with K∗$$K^{*}$$ equal to 3. The number of components and a typical shape are superimposed. (a) Color-coded cells identify the ten mixture models with the same shape; (b) The grey cells identify the twelve extra models that combinatorial mixtures allow to fit, in comparison with traditional approaches to mixture models.$

Figure 1

Mixture model templates corresponding to each (γm,γv) pair for combinatorial mixtures priors in a normal mixture model with K∗ equal to 3. The number of components and a typical shape are superimposed. (a) Color-coded cells identify the ten mixture models with the same shape; (b) The grey cells identify the twelve extra models that combinatorial mixtures allow to fit, in comparison with traditional approaches to mixture models.

4.5 Comparison with available mixture models

Combinatorial mixtures allow for greater generality and flexibility in comparison with traditional approaches to mixture modeling, while still allowing to assign mass to models that are more parsimonious than the general mixture case in which no sharing takes place. In the univariate set-up, Figure 1 shows the thirteen white cells available to traditional approaches, and highlights in grey the twelve extra models that combinatorial mixtures allow. The thirteen white cells correspond to the following scenarios:

Complete sharing: (γm,γv)=(000,000);
Partial sharing: means:
1. K=2: (γm,γv)=(000,110), (γm,γv)=(000,101), (γm,γv)=(000,011);
2. K=3: (γm,γv)=(000,111);
Partial sharing: variances:
1. K=2: (γm,γv)=(110,000), (γm,γv)=(101,000), (γm,γv)=(011,000);
2. K=3: (γm,γv)=(111,000);
No sharing:
1. K=2: (γm,γv)=(110,110), (γm,γv)=(101,101), (γm,γv)=(011,011);
2. K=3: (γm,γv)=(111,111).

The twelve grey cells correspond to the following scenarios:

Three-component mixture, with two components sharing the means and the other two sharing the variances (light-blue cells in Figure 1);
Three-component mixture, with two different means and three different variances (red cells in Figure 1);
Three-component mixture, with three different means and two different variances (purple cells in Figure 1).

The same argument applies to single variables in the bivariate set-up.

Traditional mixture models with an unknown number of components generally specify a prior directly on the unknown number of mixture components K. We previously showed that parameters γm and γv are connected with K (see Section 4.4 for an example in the univariate set-up), so there is no difference in specifying a prior on K or specifying a prior on the γ indicators. In addition, traditional bivariate mixture models do not generally assume any decomposition of the variance-covariance matrix. The available options are sharing/no sharing of the variance-covariance matrix across components. The former option implies sharing of both variances and covariances: γ1v=000, γ2v=000, and γc=000 in our context. The latter option includes all the (very different) remaining cases. The adoption of a decomposition allows to share the variances, without necessarily sharing the covariances (or viceversa), and represents the easiest way to a free modeling of the variance-covariance structure in traditional mixture models too. However, even when a decomposition is assumed for the variance-covariance matrix, one still has the following possibilities: complete sharing/no sharing of the variances or complete sharing/no sharing of the covariances across components. In any of these cases, all the variances (or covariances) should be either equal or different across components. Combinatorial mixtures go further in this direction of modeling each variance or covariance in a freerer way, at the expense of dealing with extra complexity in model specification. In detail, the bivariate normal extension might allow to model an interesting phenomenon observed in microarray analysis when two variables have the same mean and variance but opposite correlations in diseased and normal samples [9].

Combinatorial mixtures relates to product partition and DPM models. Both parametric and nonparametric product partition models are probability models for the estimation of a set of parameters, a subset of which is allowed to be equal. However, they are usually proposed and implemented focusing on one dimension only. Our class of combinatorial normal mixture models can be seen as a multi-partition generalization of the product partition model [14, 15] of Crowley [16], where we have two or three sets of partitions, one for the means, one for the variances/standard deviations, and, eventually, one for the correlations. Our priors were specified directly on equality events, though this will induce priors on partition sets Ehd. A similar comment applies to nonparametric product partition models as proposed by Dahl [17], which require for the univariate normal-normal DPM model that the variance is constant and known for each component. However, the requirement is imposed to satisfy a technical condition (Condition 1) concerning the partition likelihood and related to the mode-finding algorithm applicable to those models. The paper does not investigate the general case where variances are not constant and unknown.

Mixtures of Dirichlet Process priors [32], as typically implemented for normal models (see Escobar and West [18] and West and Turner [19]), allow for parameters to be clustered in subsets [33]. In the original form, the number of possible patterns is smaller than that of combinatorial mixtures, as only bidimensional parameters, θi=(μi,σi2) are potentially shared. However, one can specify (say, in the univariate case) two separate Dirichlet process priors, G and H, with μi|G∼G and σi2|H∼H, i=1,…,n, with a formulation that results in two separate partitions of the n observations, one based on equality of means and the other based on equality of variances.

5 Gene expression in lung and prostate cancers

Figure 2

Histograms of the expression levels for the genes TRIM29, MSN, and ITGB5 in the lung cancer dataset.

Next, we fit the normal mixture model of Section 4.1 to data on the molecular classification of lung and prostate cancers [34, 35, 36, 37]. The former dataset comes from the web-based information supporting the published manuscript Garber et al. [38]. The study is performed by scientists at the Dana Farber Cancer Center and the Massachusetts Institute of Technology and used Affymetrix oligonucleotide arrays Hu95A representing 12,600 transcripts to profile 203 samples, including 186 lung tumor samples of various histologic patterns and 17 normal samples. The latter dataset comes from the effort of “The Cancer Genome Atlas Research Network” who has comprehensively characterized 333 primary prostate carcinomas using seven genomic platforms, to gain further insight into the molecular-genetic heterogeneity of primary prostate cancer and to establish a molecular taxonomy of the disease [39]. Here, we consider the mRNA expression data available for download at https://tcga-data.nci.nih.gov (last access: October, 30th, 2016). For these data, gene level RPKM values from RNA-seq were log2 transformed and filtered to remove low-variability genes (bottom 25% removed, based on interdecile range). For both datasets, our objectives are: (a) estimating the number of subgroups in the sample; (b) making inference about the assignment of samples to these subgroups; and (c) generating hypotheses about which mechanisms are likely to characterize the subgroups. As we were seeking a novel molecular classification of tumor subtypes, we removed the 17 normal samples from the lung cancer dataset.

We searched for genes whose expression data provide interesting scenarios to fit combinatorial mixtures. After visual inspection of the heatmap, we carried out some preliminary analyses according to the Fraley and Raftery approach [40] and the R package mclust [41]. In addition, we plotted the scatter graphs of a few promising pairs of genes to improve the presentation of the bivariate application of the mixture model. In the lung cancer dataset, we selected genes with HUGO names: TRIM29, MSN, ITGB5, and CSTB. In the prostate cancer dataset, we selected the ERG and STOM genes. In the univariate analyses, the selected genes showed evidence of one, two, or three groups in the data, with possibly unequal variances in the groups. In the bivariate analyses, both pairs of genes showed evidence of two groups.

In the following, we present separate univariate models for the TRIM29, MSN, and ITGB5 genes from the lung cancer dataset. Results for CSTB and STOM expression data are equivalent to those of the ITGB5 gene and of limited interest for our analysis; results for the ERG expression data are equivalent to those of the MSN gene (data not shown). They are not described in the current application. We then present two applications of the bivariate model to the MSN and CSTB genes from the lung cancer dataset and to the ERG and STOM genes from the prostate cancer dataset. In the former case, we expect the component-specific correlation coefficients between genes to be close to 0, whereas, in the latter case, we expect them to be markedly different in their sign.

Figure 3

Two-dimensional kernel density estimation plot for the bivariate distribution of the MSN and CSTB genes in the lung cancer dataset and of the ERG and STOM genes in the prostate cancer dataset.

In the univariate analysis, Figure 2 shows the histograms of the expression levels for the three genes selected from the lung cancer dataset. TRIM29 and MSN genes are promising for classification of lung cancer patients, showing long tails on the right and on the left side of the distribution, respectively. Accordingly, results from the Fraley and Raftery approach identify as the best BIC (Bayesian Information Criterion) model for TRIM29 a three-component normal mixture with unequal variances, and for MSN a two-component normal mixture with unequal variances. On the other hand, the ITGB5 distribution seems more similar to a single normal distribution, as suggested by the BIC values in the Fraley and Raftery approach too. In the bivariate application, Figure 3 shows the two-dimensional kernel density estimation plots representing the joint distribution of the MSN and CSTB genes in the lung cancer dataset (left) and that of the ERG and STOM genes in the prostate cancer dataset (right). For both pairs of genes, the plot suggests the existence of two groups. Results from the Fraley and Raftery approach point to a two-component “diagonal, varying volume and shape” mixture model for the former pair and to a two-component “ellipsoidal, varying volume, shape, and orientation” for the latter one, according to the best BIC values [41]. In addition, the assumption of a maximum number of components equal to 3 is not a stringent limitation in the examined set-ups.

For each application, we run one chain at a time, doing approximately 50 simulations that appear to converge to the same distribution. We report results from a representative chain among them, corresponding to a total of 2,000,000 iterations with a burn-in of 500,000 iterations. Standard convergence diagnostics, including the use of the autocorrelation function, and those by Geweke, Heidelberger and Welch, and Raftery and Lewis, pointed to our final choice for the total number of iterations and number of burn-in iterations [42].

In both applications, all our runs start with the γs being equal to 000 and the starting values for the means, standard deviations, and correlations being equal to the corresponding sample values, in the hypothesis of one group in the data.

We consider Bayesian estimation in the case where we do not have strong prior information on the parameters.

In the univariate application, the following choice of hyperparameters results in weakly informative priors on the means μ1,μ2,μ3, and variances σ12,σ22,σ32: η2=5000, c=0.75, d=0.15. Similarly, in the bivariate application, we choose the following hyperparameters for the prior distributions of the means and standard deviations: 1. η2=5000, e2=4, for the lung cancer dataset; 2. η2=3600, e2=2.25, for the prostate cancer dataset. The hyperparameters of the priors on the standard deviations (variances) are chosen looking at the range of the standard deviations (variances) of typical microarrays (Affymetrix Hu95A) and RNA-Seq (Illumina HiSeq) gene expression data and adding some extra variability. In both the applications, the Dirichlet prior on the weights, ω, is symmetric with a0=(1,1,1) and the multinomial priors on the γs are vague in the sense of giving the same a priori probability to all the possible values of πm, πv, and πc: πm=πv=πc=(0.2,0.2,0.2,0.2,0.2). The effect of different a priori scenarios for ω and the γs is investigated inthe sensitivity analysis presented in Section 5.3 and Appendix Section 8.3 for the univariate model.

For the implementation of the “Class president” relabelling algorithm, we choose the following class presidents among the available samples:

Univariate application:
1. TRIM29: AD119=5.18; AD295=8.00;
2. MSN: COID3580=8.46; SQ8T1=8.98;
3. ITGB5: AD183=7.42; AD10=7.69.
Bivariate application:
1. MSN and CSTB: COID3580=(8.46, 8.43); SQ13T1=(10.10,10.05);
2. ERG and STOM: ID183=(9.99, 27.36); ID159=(86.35, 58.17),

where the samples are indicated as in the original dataset, the corresponding gene expression levels are provided for each gene under consideration, and, in the bivariate application, the former expression level is for gene MSN (or ERG) and the latter for gene CSTB (or STOM).

We present the estimated joint distributions (%) of (γm,γv) for each gene, together with the marginal distribution of γc in the bivariate case. We marked in boldface the frequencies of those combinations that are only available with combinatorial mixtures, in comparison with traditional approaches. We plot the marginal posterior distributions of the component-specific parameters, together with those of the corresponding γ indicators. We summarize posterior inference on the relabelled chains through the posterior medians of the parameters calculated on the entire chain and conditional on the ten effective cells showing the same color. Moreover, we report the “O’Hagan” matrices to present clusters of subjects in the data. Relative frequencies of the co-occurrence of two samples in the same group are plotted in a black-to-white color scale and sorted according to a non-decreasing ordering of the raw data (non-decreasing ordering of the first variable, in the case of bivariate data). Dark and light blocks, with proportions similar to the estimated weights of the mixture, identify different groups of samples.

Calculations are performed using the open-source statistical computing environment R [43, 44], packages MCMCpack [45], label.switching [46], and coda [47] and a specialized code reflecting the procedure described in Section 4.2 and the results provided in Appendix Section 8.1 and Section 8.2.

5.1 Application of the univariate model

$Figure 4 Joint posterior probabilities (%) of (γm,γv)$$(\gamma^{m},\gamma^{v})$$ for each gene in the univariate application to the lung cancer dataset, as obtained by one run of the simulation. For a correct interpretation, one has to sum those frequencies indicated in the cells with the same color. The frequencies of those extra combinations fitted using combinatorial mixtures, in comparison with traditional approaches, are emphasized using boldface.$

Figure 4

Joint posterior probabilities (%) of (γm,γv) for each gene in the univariate application to the lung cancer dataset, as obtained by one run of the simulation. For a correct interpretation, one has to sum those frequencies indicated in the cells with the same color. The frequencies of those extra combinations fitted using combinatorial mixtures, in comparison with traditional approaches, are emphasized using boldface.

Figure 4 shows the estimated joint posterior probabilities of (γm,γv) for the selected genes from the lung cancer dataset. The TRIM29 chain spends ∼51% of the sweeps on (111,111) (grey cell), which represents a three-component mixture with different means and different variances. The second most probable combination (∼39% of the sweeps) is given by (111,110) or (111,101) or (111,001) (purple cells). The corresponding mixture model has three components, with three different means and two variances. The third most probable combination (5% of the sweeps), (110,111) or (101,111) or (011,111) (red cells), implies a mixture model with three components, with two different means and three different variances. A posteriori, a mixture model with three components with at least two different means or variances is suitable for TRIM29. The chain spends ∼47% of the sweeps on the twelve combinations of (γm,γv) that would not be possible to select using traditional approaches to mixture models. The majority of the 47% is given by two (purple and red) of the three top combinations. As traditional approaches would place a higher portion of the overall mass on the (111,111) combination, as compared to combinatorial mixtures, our approach allows for a gain in parsimony in this case. This gain is represented by the purple and red cells, where either a mean or a variance is collapsed to another one.

The MSN chain spends ∼33% of the sweeps on a mixture model with two components, with one shared mean and two different variances (orange cells). The second most probable combination (∼19%) is given by (111,110) or (111,101) or (111,001) (purple cells), which represent a three-component mixture model with three different means and two different variances. The third most probable combination (∼19%) happens on the six light-blue cases: (γm,γv)=(110,101), or (γm,γv)=(110,011), or (γm,γv)=(101,110), or (γm,γv)=(101,011), or (γm,γv)=(011,110), or (γm,γv)=(011,101). The corresponding mixture has three components, with two of them sharing the means and the other two sharing the variances. A posteriori, a two- or three-component mixture model is suitable for MSN. The chain spends ∼50% of the iterations on the twelve combinations of (γm,γv) that would not be possible to select using traditional approaches. The majority of this 50% is due to the second and the third most probable combinations (purple and light-blue cells). Compared to traditional three-component mixtures, combinatorial mixtures still allow for a more parsimonious solution.

The ITGB5 chain spends ∼55% of the sweeps on (000,110) or (000,101) or (000,011), which imply a mixture model with two components having the same mean but two different variances (orange cells). The second most probable combination (19%) is given by the light-blue cells ((γm,γv)=(110,101), or (γm,γv)=(110,011), or (γm,γv)=(101,110), or (γm,γv)=(101,011), or (γm,γv)=(011,110), or (γm,γv)=(011,101)), corresponding to a three-component mixture model, with two components sharing the means and the other two sharing the variances. The third most probable combination (∼11%) is (000,111) (brown cell), corresponding to a three-component mixture model with one shared mean and three different variances. The chain spends ∼66% of the sweeps on γm=000 and groups are eventually created by differences in variances. Combinations where γv=000 are not supported by our model. The chain spends ∼31% of the sweeps on the twelve combinations of (γm,γv) that would not be possible to select using traditional approaches, with most of these sweeps represented by the light-blue combinations. Traditional approaches are not restrictive in this case.

Figure 5

Trace plots of the parameters for the TRIM29 gene in the lung cancer dataset, as estimated by one run of the simulation. Component means and corresponding standard deviations for each iteration are color-coded in the same way across plots.

Figure 6

Estimated posterior medians of the parameters for each of the five relabelling algorithms implemented in the univariate application to the lung cancer dataset for the TRIM29 gene. These estimates were calculated on the overall chain and conditional to those cases where the chain spends more simulation time (cumulative % of simulation time spent ≥ 90%).

Figure 5 and Figure 6 provide additional results for the TRIM29 gene. Figure 5 shows the trace plots of the parameters μ1,μ2,μ3 and σ12,σ22,σ32, respectively, plotted along with those of the corresponding γs. An iteration every one-hundred is reported and points smaller than the 5th percentile and bigger than the 95th percentile are suppressed to allow efficient display of the results. Component means and corresponding standard deviations are color-coded across plots. Figure 6 shows the estimated posterior medians of the parameters for each of the five relabelling algorithms implemented. These estimates were calculated on the overall chain and conditional to those cases where the chain spends more simulation time (cumulative % of simulation time spent ≥90%). The estimated posterior distributions are highly compatible with the explorative plots presented in this Section. Suppose (γm,γv) at iteration t implies a three-component mixture, with three different means and standard deviations ((γm,γv)=(111,111) – grey cell in Figure 4). According to Figure 5 and Figure 6, the three means have estimated posterior medians at 5.32, 6.25 and 8.11, while the corresponding estimated medians for the standard deviations are 0.08, 0.17, and 0.14, respectively. If the combination (111,110) or (111,101) or (111,011) (purple cells in Figure 4) is given at the next iteration, the posterior medians of the means are similar to those of the previous case, but one standard deviation is shared between two components (posterior median ∼0.06), and the corresponding mixture model has three components with two different standard deviations. Similarly, if the combination (110,111) or (101,111) or (011,111) (red cells in Figure 4) is given at the next iteration, two means collapsed, with a shared posterior median at ∼5.50. The corresponding standard deviations are fairly different, with a posteriori medians of 0.03 and ∼0.80. The remaining mean has a median slightly smaller than the 8.11 in the grey case, but still close to 8.00, and the median of the corresponding standard deviation is still ∼0.14. The component (5.32, 0.08) has a weight close to 0.70. The remaining ones have weights ranging from 0.10 to 0.20. The alternative relabelling algorithms provide similar posterior medians of the parameters. Similar considerations hold for the estimated posterior distributions of the parameters in the case of MSN and ITGB5 genes (data not shown).

Figure 7

”O’Hagan” matrices from the univariate analysis of the lung cancer dataset. Relative frequencies of occurrence of two samples being in the same group are plotted in a black-to-white color scale and sorted according to a non-decreasing ordering of the raw data. Dark and light blocks, with proportions similar to the estimated weights of the mixture, identify different groups of observations.

Figure 7 depicts the “O’Hagan” matrix for each of the three genes. For TRIM29 expression data, we identify three groups of patients. The first on the left is the one with the smallest mean and standard deviation and the highest weight (posterior medians of the mean, standard deviation, and weight 5.32, 0.08, and ∼0.70 in Figure 5 and Figure 6). The intermediate block represents the component with a mean of approximately 6 (posterior medians of the mean, standard deviation, and weight ∼6.20, 0.12, and ∼0.20 in Figure 5 and Figure 6). The transition from the first to the second component is very smooth indicating uncertainty in the classification of intermediate points. The component on the right of the picture is the one with a mean around 8 (posterior medians of the mean, standard deviation, and weight 8.11, ∼0.12, and 0.13 in Figure 5 and Figure 6). The transition from the second to the third component is less smooth than the previous one.

For MSN expression data, we identify three alternative scenarios. In the first one (dark and light cross in the center of the plot) there is one central group of patients characterized by a posterior median of the shared mean of 9.71, posterior medians of the standard deviations of ∼0.01 and ∼1 and posterior medians of the weights of ∼0.90 and ∼0.10, respectively (orange and brown cells in Figure 4). In a few iterations, a third group is fitted with the same shared mean, a posterior median of the standard deviation of ∼0.10 (brown cell) or more (orange cells); its weight ranges between ∼0.05 and ∼0.25, and the weight of the (9.71, ∼0.01) component is correspondingly lower than in the previous case. In the second scenario (three separate dark and light blocks), there are three groups of patients characterized by posterior medians of the three means given by ∼8.25, 9.65, and 10.37, posterior medians of the standard deviations of ∼0.10, 0.05, and 0.05, and posterior medians of the weights of ∼0.20, ∼0.40, and ∼0.40, respectively (purple cells in Figure 4). In the third scenario (dark and light cross at the right of the previous cross), there is still a central group of patients characterized by the highest posterior median of the shared mean of ∼9.88 (sharing between two components), posterior medians of the standard deviations of ∼0.01 and ∼0.70 (light-blue cells) or ∼1 (red cells), and posterior medians of the weights of ∼0.80 (or ∼0.90) and ∼0.10, respectively (light-blue and red cells) (see Figure 4 for cell colors). Another group is rarely fitted with a smaller posterior median of the mean, which may be either between ∼8.25 and ∼9.00 or ∼9.70 (data not shown).

For ITGB5 expression data, we identify one central group of patients characterized by a posterior median of the mean of ∼7.50, a posterior median of the standard deviation ranging between 0.63 and 0.73 and the smallest weight of ∼0.10. Partially overlapping with this one, there is a second central group of patients characterized by a similar posterior median of the mean, a definitely smaller posterior median of the standard deviation of ∼0.02 and the highest posterior median of the weight. The posterior median of this weight ranges from ∼0.50 to ∼0.90 depending on the possible presence/absence of a third group characterized by a different mean (posterior median of the mean equal to ∼7.60 or ∼7.70, with corresponding standard deviation being equal to ∼0.02) or standard deviation (posterior median of the standard deviation equal to ∼0.10) (data not shown).

In conclusion, for each univariate set-up, combinatorial mixtures allow to: 1. identify a variable number of distinct groups of samples with sensible posterior medians of the component-specific parameters and corresponding weights, and 2. list a small set of visited competing scenarios of groups and corresponding parameters which account, if any (see MSN gene), for uncertainty in classification. The main conclusions are in accordance with those from the Fraley and Raftery approach. However, with combinatorial mixtures, the chains are allowed to spend some simulation time on more parsimonious solutions. For instance, the TRIM29 gene spend 40% of the simulation time on the red cells, where one of the three variances is shared, as well as 50% of the simulation time on the expected grey cells that represent the more general model with three different means and variances. We also gain some extra flexibility in the modelling of the component-specific variances when both the approaches agree that the mean is shared across one or two components, as with the ITGB5 gene.

5.2 Application of the bivariate model

$Figure 8 Joint posterior probabilities (%) of (γjm,γjv)$$(\gamma_{j}^{m},\gamma_{j}^{v})$$, j=1,2$$j=1,2$$, and posterior probability (%) of γc$$\gamma^{c}$$ for the MSN and CSTB genes in the bivariate application to the lung cancer dataset, as obtained by one run of the simulation. For a correct interpretation, one has to sum those frequencies indicated in the cells with the same color. The frequencies of those extra combinations fitted using combinatorial mixtures, in comparison with traditional approaches, are emphasized using boldface.$

Figure 8

Joint posterior probabilities (%) of (γjm,γjv), j=1,2, and posterior probability (%) of γc for the MSN and CSTB genes in the bivariate application to the lung cancer dataset, as obtained by one run of the simulation. For a correct interpretation, one has to sum those frequencies indicated in the cells with the same color. The frequencies of those extra combinations fitted using combinatorial mixtures, in comparison with traditional approaches, are emphasized using boldface.

In this subsection, we report results derived from the application of the normal mixture model of Section 4.1 to data on the molecular classification of lung and prostate cancers. We start reporting detailed results for the joint modelling of the MSN and CSTB genes in the lung cancer dataset.

Figure 8 shows the estimated joint posterior probabilities (%) of (γjm,γjv), j=1,2, and the estimated posterior probability (%) of γc. The MSN marginal chain spends ∼80% of the iterations on either (110,111) or (101,111) or (001,111) (red cells). The second most probable combination (∼9%) is (111,111) (grey cell). A posteriori, a mixture model with three components – with two or three different means and three different standard deviations – is suitable for MSN. The CSTB marginal chain spends ∼65% of the iterations on either (110,111) or (101,111) or (001,111) (red cells). The second most probable combination (∼20%) is given by the six light-blue cases:(γm,γv)=(110,101), or (γm,γv)=(110,011), or (γm,γv)=(101,110), or (γm,γv)=(101,011) or (γm,γv)=(011,110), or (γm,γv)=(011,101). The third most probable combination (∼7%) is (111,111) (grey cell). A posteriori, a mixture model with three components with two/three different means and two/three different standard deviations is suitable for CSTB.

In both cases, there is not so much evidence in favor of those cases allowed by traditional mixture models. The chain spends more than 85% of the iterations on nine out of the twelve combinations of (γ1m,γ1v) and (γ2m,γ2v) that would not be possible to select using traditional approaches to mixture models. The majority of this 85% is due to the most probable combinations (red and light-blue cells). Combinatorial mixtures might end up either in extra flexibility in modeling standard deviations or in a more parsimonious solution, depending if one assumes the traditional two-component or the three-component mixture for comparison.

Finally, the chain spends almost all of the iterations either on the case of one shared correlation across components (∼70%) (dark-brown cells) or on the intermediate cases (∼25%) of two different correlations (ocher cells). Traditional mixture models where a decomposition of the variance-covariance matrix is assumed allow to fit all the five cases listed in Table 8(c), depending on how many different components are assumed; so, no cases are emphasized using boldface.

Figure 9

Trace plots of the parameters from the bivariate mixture model applied to the MSN and CSTB genes from the lung cancer dataset, as estimated by one run of the simulation. Component means, corresponding standard deviations, and correlations for each iteration are color-coded in the same way across plots.

$Figure 10 Estimated posterior medians of the parameters for each of the five relabelling algorithms implemented in the bivariate application to the lung cancer dataset. These estimates were calculated on the overall chain and conditional to those cases where the chain spends more simulation time (cumulative % of simulation time spent ≥90%$$\geq90\%$$).$

Figure 10

Estimated posterior medians of the parameters for each of the five relabelling algorithms implemented in the bivariate application to the lung cancer dataset. These estimates were calculated on the overall chain and conditional to those cases where the chain spends more simulation time (cumulative % of simulation time spent ≥90%).

Figure 9 shows the trace plots of the parameters μ1, μ2, S1, S2, and r, along with those of the corresponding γs, for the MSN and CSTB genes in the lung cancer dataset. An iteration every one-hundred is reported, and points smaller than the 5th percentile and bigger than the 95th percentile are removed from the plot to allow efficient display of the results. Corresponding estimates for each iteration are represented using the same color across plots. Figure 10 shows the estimated posterior medians of the parameters for each of the five relabelling algorithms implemented in the bivariate application to the lung cancer dataset. These estimates were calculated on the overall chain and conditional to those cases where the chain spends more simulation time (cumulative % of simulation time spent ≥90%). According to Figure 9 and Figure 10, the MSN gene has a marginal posterior distribution characterized by two means, with posterior medians around 8.50 and 10.00. The posterior medians of the corresponding standard deviations are ∼0.90 and ∼0.50, respectively. For most of the simulation time (∼80% – red cells in Figure 8), the mean centered in 10 is shared between two components and is associated with two similar standard deviations. The corresponding weight has a posterior median of 0.80, which comes from the sum of the two corresponding medians of the posterior weights. The posterior median of the weight of the other component is 0.20. For a few iterations (∼10% – grey cells in Figure 8), a third different mean is fitted, with a posterior median of ∼9.00, but the corresponding weight is close to 0.

Similarly, the CSTB gene has a marginal posterior distribution characterized by two means, with posterior medians around 8.70 and 9.90, and corresponding posterior medians of the standard deviations of ∼0.55 and ∼0.60, respectively. For most of the simulation time (∼85% – red and light-blue cells in Figure 8), the mean centered in 9.90 is shared between two components and is associated with slightly higher (but similar) medians of the standard deviation, as compared to the component with posterior median of the mean at 8.70. The corresponding weight has a posterior median of 0.80. The posterior median of the weight of the other component is 0.20. For a few iterations (∼7% – grey cells in Figure 8), a third different mean is still fitted, with a posterior median of ∼9.20, but the corresponding weight is equal to 0. As to the corresponding correlation coefficients, if γc=000, the component correlations are collapsed at a posterior median of 0.01; if γc=110 or γc=101 or γc=011, the component correlations have posterior medians of −0.04 and −0.05, with −0.04 coming more frequently from sharing of components. In either case, the correlation coefficients between the two genes are approximately 0 in any component of the mixture model. A posteriori, the component with posterior medians of the mean vector around (8.50, 8.75), posterior medians of the standard deviations of (0.90, 0.55) and posterior medians of the correlation equal to 0 has a posterior median of the weight of 0.20. The component with posterior medians of the mean vector around (10.00, 9.90), posterior medians of the standard deviations around (0.50, 0.60) and posterior medians of the correlation equal to 0 has a posterior median of the weight of ∼0.80. There is no difference in the posterior medians estimated by the different relabelling methods.

Figure 11

''O'Hagan'' matrix for the bivariate normal mixture model fitted to the lung cancer dataset. Relative frequencies of occurrence of two samples being in the same group are plotted in a black-to-white color scale and sorted according to a non-decreasing ordering of the MSN gene. Dark and light blocks, with proportions similar to the estimated weights of the mixture, identify different groups of observations.

The “O’Hagan” matrix is reported in Figure 11. The plot identifies two distinct groups with different mean vectors. The group on the left has the smallest mean vector (posterior medians of the mean vector around (8.50, 8.75) in Figure 9 and Figure 10) and weight (posterior medians of the weight equal to 0.20 in Figure 9 and Figure 10). The transition from the first to the second group is characterized by some uncertainty in the classification of intermediate points. The group on the right of the picture is the one with posterior medians of the mean vector around (10.00, 9.90) and the biggest weight of 0.80 (see Figure 9 and Figure 10). In conclusion, combinatorial mixtures allow to identify two groups of samples with very different weights − 0.20 and 0.80, respectively – located at around (8.50, 8.75) and (10.00, 9.90), with corresponding medians of the standard deviations equal to (0.90, 0.55) and (0.50, 0.60). The posterior correlation coefficients between the two genes are close to 0 in both groups for most of the simulation time.

$Figure 12 Estimated posterior medians of the parameters for each of the five relabelling algorithms implemented in the bivariate application to the prostate cancer dataset. These estimates were calculated on the overall chain and conditional to those cases where the chain spends more simulation time (cumulative % of simulation time spent ≥90%$$\geq90\%$$).$

Figure 12

Estimated posterior medians of the parameters for each of the five relabelling algorithms implemented in the bivariate application to the prostate cancer dataset. These estimates were calculated on the overall chain and conditional to those cases where the chain spends more simulation time (cumulative % of simulation time spent ≥90%).

Now, we briefly present the results of the application of combinatorial mixtures to the ERG and STOM bivariate distribution.

Figure 12 shows the estimated posterior medians of the parameters for each of the five relabelling algorithms implemented in the bivariate application to the prostate cancer dataset. The ERG gene has a marginal posterior distribution characterized by three means. No matter of the visited color, two of them have posterior medians being equal to 14.55 and ∼83. The third mean has a posterior estimate of ∼43 or ∼48, depending if the purple or the grey cell is visited. For ∼56% (purple cells) of the simulation time, the highest standard deviation (posterior median of the standard deviation being equal to 16.90) is shared between the components with the two highest posterior medians of the mean, whereas the lowest one is smaller (posterior median of the standard deviation being equal to 3.56). For ∼43% (grey cells) of the simulation time, a third standard deviation is fitted (posterior median being equal to ∼20) and it corresponds to the posterior estimate of the mean of 48; the shared posterior median of the standard deviation around 16.90 is slightly smaller (∼16.10%) in this case. The component with posterior medians of the mean and standard deviation being equal to 14.55 and 3.60 have a posterior median of the weight close to 0.55; the component with posterior medians of the mean and standard deviation being around 83 and 16 (or 17) have a posterior median of the weight close to 0.40. The component with the intermediate posterior estimate of the mean has the remaining weight of ∼0.10.

The STOM gene has a marginal posterior distribution characterized by one mean, with posterior median around 45.30. The corresponding posterior medians of the standard deviations are either one, two, or three (∼11%, ∼65%, and 12.30% of the simulation time, respectively), depending on the visited cell combination. When there is one standard deviation, its posterior estimate is at 12.38; where there are two of them, the posterior estimates are at 7.85 and 12.73; when there are three of them, their posterior estimates are still similar to the previous ones, being equal to 7.60, 12.60, and 12.80. Thecomponent with a posterior median of the standard deviation around 7.85 has a posterior median of the weight slightly smaller than 10%, whereas the weights of the remaining components (with posterior medians of the standard deviations around 12.73) sum up and contribute to the highest weight. For a few iterations (∼4% – green cells), two different means are fitted, with posterior medians of 45.02 and 47.60, but the corresponding estimates of the standard deviations and weights are still similar to the previous cases. Two components share the same correlation coefficient between the two genes, with a posterior median of −0.60; the remaining component has a strong positive correlation, with a posterior median of 0.75. A posteriori, the component with posterior medians of the mean vector around (14.55, 45.32), posterior medians of the standard deviations of (3.57, 12.68) and posterior median of the correlation being equal to 0.74 has a posterior median of the weight of 0.54. The component with posterior medians of the mean vector around (82.63, 45.31), posterior medians of the standard deviations around (16.60, 12.65) and posterior median of the correlation being equal to −0.56 has a posterior median of the weight of 0.38. The component with posterior medians of the mean vector around (43.98, 45.40), posterior medians of the standard deviations around (17.15, 8.35) and posterior median of the correlation being equal to -0.56 has a posterior median of the weight of 0.08. There are limited differences in the posterior medians estimated by the different relabelling methods.

Figure 13

''O'Hagan'' matrix for the bivariate normal mixture model fitted to the prostate cancer dataset. Relative frequencies of occurrence of two samples being in the same group are plotted in a black-to-white color scale and sorted according to a non-decreasing ordering of the ERG gene. Dark and light blocks, with proportions similar to the estimated weights of the mixture, identify different groups of observations.

In addition, we present the “O’Hagan” matrix in Figure 13. The plot suggests the existence of two extreme groups of prostate cancer samples. The group on the left has the smallest mean for the ERG gene (posterior medians of the mean vector around (14.55, 45.32) in Figure 12), a positive correlation of 0.74 and the highest weight (posterior medians of the weight equal to 0.54 in Figure 12). The group on the right of the picture has the highest mean for the ERG gene (posterior medians of the mean vector around (82.63, 45.32), the negative correlation of -0.56 and the posterior median of the weight of 0.38 (see Figure 12). The transition between the two groups is managed by fitting the third group with posterior median of the weight of 0.08. In conclusion, combinatorial mixtures allow to identify two groups of samples with similar weights – 0.54 and 0.38, respectively – located at around (14.55, 45.32) and (82.63, 45.32), with corresponding medians of the standard deviations equal to (3.57, 12.68) and (16.60, 12.65). These groups are characterized by very different posterior correlation coefficients between the two genes. A third group with a negligible weight allows to model the expression data of the samples in between the previous groups.

5.3 Sensitivity to model specification and other issues

Reporting substantive results requires exploring sensitivity to model specification, especially regarding the prior assumptions, and verifying the proper convergence of the MCMC.

Sensitivity of the posterior distribution of the parameters needs to be investigated. For an extensive treatment on sensitivity analysis in the context of normal mixture models with an unknown number of components, see Richardson and Green [48] and Stephens [49].

Here, we are concerned on sensitivity of the number of components to different values of the πs. We present the results for the univariate application. To improve prior elicitation and interpretation of the sensitivity analysis, we provide this in terms of the equivalent prior on the number of components, K, introduced in Section 4.4. The default vague set-up proposed in the analysis, πm=(0.2,0.2,0.2,0.2,0.2) and πv=(0.2,0.2,0.2,0.2,0.2), is equivalent to assume: P{1 subgroup}=0.04, P{2 subgroups}=0.36 and P{3 subgroups}=0.60 (say, case 3 hereafter) . Three groups are a priori suggested by our default set-up. For this reason, in the sensitivity analysis we assess the effects on the posterior estimates of other two scenarios, which a priori favor either the case of one group or all three possibilities in almost the same way:

πm=πv=(0.8,0.05,0.05,0.05,0.05)↔ P{1 subgroup}=0.64, P{2 subgroups}≈ 0.25 and P{3 subgroups}≈ 0.11 (say, case 1 hereafter);
πm=πv=(0.6,0.1,0.1,0.1,0.1)↔ P{1 subgroup}=0.36, P{2subgroups}=0.39 and P{3 subgroups}=0.25 (say, case 2 hereafter).

Appendix Figure 14 summarizes results of the sensitivity analysis to different values of πm and πv in the univariate application for the available genes. Each table presents the posterior probabilities of observing one, two orthree groups, respectively, given the three different a priori set-ups. The prior probabilities corresponding to each scenario are added in the left column for comparison. The general conclusions of our analysis with regard to the number of components are that it appears to be robust to the prior specification. For each of the three genes, a posteriori the number of groups with the highest probability is the same, no matter of the prior distribution chosen. Deviations of the probabilities from each other are within a sensible range and are compatible with the prior structure imposed.

In addition, we report results from the sensitivity analysis assessing the effect of different choices of a0, the hyperparameter of the Dirichlet prior on the weights, in the univariate case. Alternative scenarios are: a0=0.1 and a0=4 [50, 51], while the baseline is a0=1 (see, for example, Diebolt and Robert [22]). Appendix Figure 15 shows the percentage of simulation time spent by the different chains on the most-visited colors (cumulative % of simulation time spent ≥90%) for each gene and a0 value. Our analysis still appears to be robust to the prior specification for the weights. Even in the case of overfitting mixtures, like with MSN and ITGB5 genes, the color ranking is essentially unchanged across the different scenarios. The percentage of time spent on each color and the posterior estimates of the parameters reflect the following results [50, 51]:

ak,0=0.1 favors for overfitting mixtures empty groups,
ak,0=D2+3=4 favors similar non-empty components.

In detail, as ak,0 increases, the MSN and ITGB5 chains spend an increasingly higher percentage of time on the most-visited orange cell (a two-component mixture model with one mean and two different variances). When the other colors are visited and a0=0.1, the third component with an extra mean or standard deviation is generally empty; when a0=4, there generally exist similar components which share part of the overall weight.

An essential element of the performance of the MCMC is its ability to move between different values of each γ, that is to mix over the number of components. Plots of the changes in the γs against the number of sweeps were provided in our application. They show that the MCMC mixes well over the γs. In addition, the percentage of simulation time spent by the chains is similar for those cells indicated with the same color and this reassures against potential convergence issues. We detected no influence of different sets of starting values in either univariate or bivariate analysis.

6 Discussion

We introduce a class of parametric mixture models that we call combinatorial mixtures for mixture distributions whose components have multidimensional parameters. The approach allows each element of the component-specific parameter vector to be shared by any subset of other components. For any dimension of the vector, it is thus possible that some components share the same value of the parameter, while others do not. Moreover, for different dimensions, subgroups of shared parameters may be different, either because of a different cardinality or of different shared components. From a Bayesian perspective, prior specification has a key role in building combinatorial mixtures: inference can proceed by assigning a prior directly to the space of parameters, allowing for degeneracy along equality constraints for each element. Although our focus is on Bayesian analysis, combinatorial mixtures can also be useful for non-Bayesian modeling.

Bayesian inference and computation approaches were illustrated in a setting based on the normal model and applied to data on molecular subtypes of cancer. Because cancer subtypes may be characterized by differences in location, scale, correlations or any of the combinations, effective procedures for carrying out simultaneously variable selection and clustering (see, for instance, Kim et al. [52]) may benefit from a flexible set of different criteria to address these issues. Combinatorial mixtures allow to cover all the possible comparisons between relevant parameters across groups, and are potentially useful for applications in a range of application areas. For example, in nutritional epidemiology, the identification of clusters of subjects at increased cancer risk might benefit from methods where differences in the correlation structure of the nutrients are accounted for directly in the clustering procedure. The usual practice of performing a principal component analysis before a cluster analysis to account for known higher correlations between nutrients [53] may be suboptimal [54] and cancer risk/protection associated with the identified clusters may be underestimated due to this issue.

Although it would be possible to achieve a combinatorial mixture similar to ours by using DPMs based on separate Dirichlet process priors for the means and variances parameters, our approach is a more direct and, we hope, intuitive extension of popular mixture approaches. In addition, the γ indicators (as presented in Figure 1) can be a tool to guide experts in their elicitation of prior information. In detail, specifying a prior on the different cell colors, instead on the number of components, may help to focus the available prior knowledge on single gamma combinations. Similarly, the visualization of the corresponding fitted mixture models through the same figure, together with the posterior estimates, may facilitate the communication of results [13].

A challenge to the implementation of combinatorial mixtures is that the number of K∗-way comparisons between elements of the component-specific parameter vectors increases rapidly with the maximum number of groups. For K∗=3 the Bell number is B(K∗)=5, for K∗=4, it is B(K∗)=15, and, for K∗=7, it is B(K∗)=877. It follows that combinatorial mixtures are an effective tool for clustering in potentially complicated scenarios, but when the number of potential groups in the data is relatively small. In addition, the paper currently deals with the bivariate case (D=2) only. This is a limitation, as a generalization of combinatorial mixtures to higher dimensions would enhance applicability [55].

We consider Bayesian estimation in the case where we do not have strong prior information on the parameters. There are cases where subjective priors are preferable, and our prior setting could be modified accordingly. In addition, we suggest to consider the simplest independence prior structure for the means, variances/standard deviations and correlations. It seems to us that for most purposes of the model there is a case for keeping to this simplest independence prior structure and defining weakly informative priors. However, modelling dependence might be relevant in biological applications. For example, if the mean in one group is bigger than that of another, one might expect their variances to be different as well. Accounting for this kind of dependence poses challenges that are not fully explored in the paper. We previously explored dependence among component-specific parameters, conditional to (γm,γv)≠(000,000), in a univariate supervised set-up with two known groups. Our prior set-up presented a re-parametrisation of the dependent parameters (say, means or variances) in terms of two independent parameters, a general parameter and a difference (or a ratio for the variances) between the original dependent parameters. We then assumed a mixture of singular distributions for the prior on the independent parameters and derived the corresponding posterior distribution via MCMC. For further details on how to deal with dependence among component-specific parameters and how to apply combinatorial mixtures in a supervised context, we refer to Edefonti’s Ph.D. Thesis [56], which is available upon request.

Funding statement: Università degli Studi di Milano (Grant / Award Number: ‘Young Investigator Grant Program 2014’, ‘Young Investigator Grant Program 2015’), National Cancer Institute (Grant / Award Number: ‘P30CA006516’).

7 Acknowledgement

The results show here are in whole or part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/, last access: October, 30th, 2016.

8 Appendix

8.1 Markov Chain Monte Carlo algorithm: details for the univariate model

Full conditional distributions for the univariate normal mixture model are proposed in the following. The shorthand notations:

−(μ,γm)={σ2,γv,ω},−(σ2,γv)={μ,γm,ω},

−(μ)={γm,σ2,γv,ω},−(σ2)={γv,μ,γm,ω},

−(ω)={θ}={μ,γm,σ2,γv},

nk=#{i:zi=k},xˉk=1nk∑i:zi=kxi,k=1,2,3,

0=000,1=110,2=101,3=011,4=111,

π0d=π000d,π1d=π110d,π2d=π101d,π3d=π011d,π4d=π111d,d=m,v,

π0d∗=π000d∗,π1d∗=π110d∗,π2d∗=π101d∗,π3d∗=π011d∗,π4d∗=π111d∗,d=m,v,

are used hereafter.

From the factorization of the joint distribution of all the variables:

we get:

p(μ,γm|−(μ,γm),xn,zn)=p(μ|−(μ),xn,zn)p(γm|−(μ,γm),xn,zn),

with:

p(μ|−(μ),xn,zn)={N(μ1|μ123∗,η2123∗)ifγm=0N(μ1|μ1∗,η21∗)N(μ2|μ23∗,η223∗)ifγm=1N(μ1|μ13∗,η213∗)N(μ2|μ2∗,η22∗)ifγm=2N(μ1|μ12∗,η212∗)N(μ3|μ3∗,η23∗)ifγm=3N(μ1|μ1∗,η21∗)N(μ2|μ2∗,η22∗)N(μ3|μ3∗,η23∗)ifγm=4,

where:

μ123∗=η2123∗∑k=13nkσk2xˉk,η2123∗=1η2+∑k=13nkσk2−1,

μk∗=η2k∗nkσk2xˉk,η2k∗=1η2+nkσk2−1,k=1,2,3,

μk′k′′∗=η2k′k′′∗nk′σk′2xˉk′+nk′′σk′′2xˉk′′,η2k′k′′∗=1η2+nk′σk′2+nk′′σk′′2−1,k′,k′′=1,2,3,k′≠k′′,k′,k′′≠k,

and:

p(γm|−(μ,γm),xn,zn)=Multi(1,πm∗),

where πm∗ is such that:

π0m∗=π0mAη123∗exp12μ123∗η123∗2,

πkm∗=πkmA1ηηk∗exp12μk∗ηk∗2ηk′k′′∗exp12μk′k′′∗ηk′k′′∗2,k,k′,k′′=1,2,3,k′≠k′′,k′,k′′≠k,

π4m∗=π4mA1η2∏k=13ηk∗exp12μk∗ηk∗2,k=1,2,3,

where:

A=π0mη123∗exp12μ123∗η123∗2++π1m1ηη1∗exp12μ1∗η1∗2η23∗exp12μ23∗η23∗2++π2m1ηη2∗exp12μ2∗η2∗2η13∗exp12μ13∗η13∗2++π3m1ηη3∗exp12μ3∗η3∗2η12∗exp12μ12∗η12∗2++π4m1η2∏k=13ηk∗exp12μk∗ηk∗2.

Moreover, from the joint distribution it follows for the variances:

p(σ2,γv|−(σ2,γv),xn,zn)=p(σ2|−(σ2),xn,zn)p(γv|−(σ2,γv),xn,zn),

with:

p(σ2|−(σ2),xn,zn)={IG(σ12|c123∗,d123∗),ifγv=0IG(σ12|c1∗,d1∗)IG(σ22|c23∗,d23∗)ifγv=1IG(σ12|c13∗,d13∗)IG(σ22|c2∗,d2∗)ifγv=2IG(σ12|c12∗,d12∗)IG(σ32|c3∗,d3∗)ifγv=3IG(σ12|c1∗,d1∗)IG(σ22|c2∗,d2∗)IG(σ32|c3∗,d3∗)ifγv=4,

where:

c123∗=c+n2,d123∗=d+12∑k=13∑i:zi=k(xi−μk)2,

ck∗=c+nk2,dk∗=d+12∑i:zi=k(xi−μk)2,k=1,2,3,

ck′k′′∗=c+nk′+nk′′2,dk′k′′∗=d+12∑i:zi=k′(xi−μk′)2+∑i:zi=k′′(xi−μk′′)2,k′,k′′=1,2,3,k′≠k′′,k′,k′′≠k,

and:

p(γv|−(σ2,γv),xn,zn)=Multi(1,πv∗),

where πv∗ is such that:

π0v∗=π0vFΓ(c123∗)d123∗c123∗,

πkv∗=πkvFdcΓ(c)Γ(ck∗)dk∗ck∗Γ(ck′k′′∗)dk′k′′∗ck′k′′∗,k,k′,k′′=1,2,3,k′≠k′′,k′,k′′≠k,

π4v∗=π4vFdcΓ(c)2∏k=13Γ(ck∗)dk∗ck∗,k=1,2,3,

and:

F=π0vΓ(c123∗)d123∗c123∗+π1vdcΓ(c)Γ(c1∗)d1∗c1∗Γ(c23∗)d23∗c23∗+π2vdcΓ(c)Γ(c2∗)d2∗c2∗Γ(c13∗)d13∗c13∗++π3vdcΓ(c)Γ(c3∗)d3∗c3∗Γ(c12∗)d12∗c12∗+π4vdcΓ(c)2∏k=13Γ(ck∗)dk∗ck∗.

Finally, for the weights one has:

p(ω|−(ω),xn,zn)=Dir(a1,0+n1,a2,0+n2,a3,0+n3),

and for the generic group label, zi, i=1,…,n:

p(zi=k|ωk,−(ω),xi)=ωk1σk2πexp−12σk2(xi−μk)2∑k=13ωk1σk2πexp−12σk2(xi−μk)2,k=1,2,3.

8.2 Markov Chain Monte Carlo algorithm: details for the bivariate model

Available full conditional distributions for the bivariate normal mixture model are proposed in the following. The shorthand notations:

−(μj,γjm)={μj′,γj′m,Sj,γjv,Sj′,γj′v,r,γc,ω},

−(Sj,γjv)={μj,γjm,μj′,γj′m,Sj′,γj′v,r,γc,ω},

−(μj)={γjm,μj′,γj′m,Sj,γjv,Sj′,γj′v,r,γc,ω},

−(Sj)={μj,γjm,μj′,γj′m,γjv,Sj′,γj′v,r,γc,ω},

j,j′=1,2,j≠j′,

−(r,γc)={μ1,γ1m,μ2,γ2m,S1,γ1v,S2,γ2v,ω},

−(r)={μ1,γ1m,μ2,γ2m,S1,γ1v,S2,γ2v,γc,ω},

−(ω)={θ}={μ1,γ1m,μ2,γ2m,S1,γ1v,S2,γ2v,r,γc},

nk=#{i:zi=k},xˉkj=1nk∑i:zi=kxij,k=1,2,3,j=1,2,

0=000,1=110,2=101,3=011,4=111,

π0d=π000d,π1d=π110d,π2d=π101d,π3d=π011d,π4d=π111d,d=m,v,c,

π0d∗=π000d∗,π1d∗=π110d∗,π2d∗=π101d∗,π3d∗=π011d∗,π4d∗=π111d∗,d=m,v,c,

are used hereafter.

From the factorization of the joint distribution of all the variables:

we get:

p(μj,γjm|−(μj,γjm),xn,zn)=p(μj|−(μj),xn,zn)p(γjm|−(μj,γjm),xn,zn),

j=1,2,

with:

p(μj|−(μj),xn,zn)={N(μ1j|μ123∗j,η2123∗j)ifγjm=0N(μ1j|μ1∗j,η21∗j)N(μ2j|μ23∗j,η223∗j)ifγjm=1N(μ1j|μ13∗j,η213∗j)N(μ2j|μ2∗j,η22∗j)ifγjm=2N(μ1j|μ12∗j,η212∗j)N(μ3j|μ3∗j,η23∗j)ifγjm=3N(μ1j|μ1∗j,η21∗j)N(μ2j|μ2∗j,η22∗j)N(μ3j|μ3∗j,η23∗j)ifγjm=4,

where:

qk=Skj′xˉkj+Skjrk(μkj′−xˉkj′)Skj′,pkj=nk(1−rk2)Skj2,

k=1,2,3,j,j′=1,2,j≠j′,

μ123∗j=η2123∗j∑k=13pkjqk,η2123∗j=1η2+∑k=13pkj−1,j=1,2,

μk∗j=η2k∗jpkjqk,η2k∗j=1η2+pkj−1,k=1,2,3,j=1,2,

μk′k′′∗j=η2k′k′′∗jpk′jqk′+pk′′jqk′′,η2k′k′′∗j=1η2+pk′j+pk′′j−1,

k′,k′′=1,2,3,k′≠k′′,k′,k′′≠k,j=1,2,

and:

p(γjm|−(μj,γjm),xn,zn)=Multi(1,πm∗),j=1,2,

where πjm∗ is such that:

π0jm∗=Ljπ0mMη123∗jexp12μ123∗jη123∗j2,

πkjm∗=LjπkmM1ηηk∗jexp12μk∗jηk∗j2ηk′k′′∗jexp12μk′k′′∗jηk′k′′∗j2,

k,k′,k′′=1,2,3,k′≠k′′,k′,k′′≠k,

π4jm∗=Ljπ4mM1η2∏k=13ηk∗jexp12μk∗jηk∗j2,k=1,2,3,

where:

M=Lj[π0mη123∗jexp12μ123∗jη123∗j2++π1m1ηη1∗jexp12μ1∗jη1∗j2η23∗jexp12μ23∗jη23∗j2++π2m1ηη2∗jexp12μ2∗jη2∗j2η13∗jexp12μ13∗jη13∗j2++π3m1ηη3∗jexp12μ3∗jη3∗j2η12∗jexp12μ12∗jη12∗j2++π4m1η2∏k=13ηk∗jexp12μk∗jηk∗j2],

and:

Lj=1η12πn∏k=131Sk1Sk21−rk2nk∗∗exp−12(1−rk2)∑k=13∑i:zi=k(xij′−μkj′)2Skj′2+Skj′(xij)2−2Skjrk(xijxij′−xijμkj′)Skj2Skj′,

j=1,2.

For the weights one has:

p(ω|−(ω),xn,zn)=Dir(a1,0+n1,a2,0+n2,a3,0+n3),

and for the generic group label, zi, i=1,…,n: p(zi=k|ωk,{θ},xi)=

=ωk12πSk1Sk21−rk2exp−12(1−rk2)(xi1−μk1)2Sk12+(xi2−μk2)2Sk22−2rk(xi1−μk1)(xi2−μk2)Sk1Sk2∑k=13ωk12πSk1Sk21−rk2exp−12(1−rk2)(xi1−μk1)2Sk12+(xi2−μk2)2Sk22−2rk(xi1−μk1)(xi2−μk2)Sk1Sk2,

k=1,2,3.

We refer to the Metropolis-Hastings algorithm for the update of (Sj,γjv), j=1,2, and (r,γc). For (Sj,γjv), j=1,2, the target distributions π(Sj(t),γjv(t)), where (Sj(t),γjv(t)) indicates the current position of the chain at the t-th iteration, t>1, are represented by the marginal posterior distributions of (Sj,γjv):

p(Sj,γjv|−(Sj,γjv),xn,zn)=p(Sj|γjv)p(γjv)p(xn|{θ},ω,zn)∝p(Sj|γjv)p(xn|{θ},ω,zn),

j=1,2,

which assume five different expressions for each j depending on the value of γjv. The proposal distribution is built accordingly. If (yj,y4j) indicates the proposed candidate for the (t+1)-th iteration, with yj=(y1j,y2j,y3j), one may sample y4j from a multinomial distribution with parameters N=1, p=(0.2,0.2,0.2,0.2,0.2). Conditioning on y4j, one may sample the corresponding vector yj using the usual structure of degeneracy along equality constraints proposed with the prior distributions. The building block distribution within this structure is assumed to be a normal distribution centered at the current position of the chain, Sj(t), and having a standard deviation ϵ=0.05 for the lung cancer dataset or ϵ=0.10 for the prostate cancer dataset. A constraint is added to guarantee that the proposed standard deviations, yj, are non-negative. This gives the following expression:

q((Sj(t),γjv(t)),(yj,y4j))=p(y4j)p(yj|y4j)∝p(yj|y4j),

with:

p(yj|y4j)∝{N(y1j|S1j(t),ϵ2),ify4j=0N(y1j|S1j(t),ϵ2)N(y2j|S2j(t),ϵ2)ify4j=1N(y1j|S1j(t),ϵ2)N(y2j|S2j(t),ϵ2)ify4j=2N(y1j|S1j(t),ϵ2)N(y3j|S3j(t),ϵ2)ify4j=3N(y1j|S1j(t),ϵ2)N(y2j|S2j(t),ϵ2)N(y3j|S3j(t),ϵ2)ify4j=4,

j=1,2.

For the updating of (r,γc), the target distribution π(r(t),γc(t)), with (r(t),γc(t)) indicating the current position of the chain at the t-th iteration, t>1, is represented by the marginal posterior distribution of (r,γc):

p(r,γc|−(r,γc),xn,zn)=p(r|γc)p(γc)p(xn|{θ},ω,zn)∝p(r|γc)p(xn|{θ},ω,zn),

which assumes five different expressions depending on the value of γc. The proposal distribution is built as before. If (y,y4) indicates the proposed candidate for the (t+1)-th iteration, with y=(y1,y2,y3), y4 is sampled from a multinomial distribution with parameters N=1 and p=(0.2,0.2,0.2,0.2,0.2). Conditioning on y4, y is sampled as p(yj|y4j) before. However, we now assume the following mixture distribution as our building block distribution for yi:

(097)p(yi|α0,β0)=β0∗1α0−1α0+(1−β0)∗(−1α0−1α0)∗ri(t).

This comes from the choice of updating each correlation as: yi=α∗β∗ri(t), where α∼U(1/α0,α0) and

β∼{−11−β01β0,

with: i=1,2,3, α0=1.10 for the application on the lung cancer dataset or α0=1.01 in the prostate cancer dataset, and β0=0.95, for both sets of data. A constraint is also added to guarantee that the proposed correlations, y, lie in (−1,1). This gives the following expression:

q((r(t),γc(t)),(y,y4))=p(y4)p(y|y4)∝p(y|y4),

with:

p(y|y4)∝{p(y1|α0,β0),ify4=0p(y1|α0,β0)p(y2|α0,β0)ify4=1p(y1|α0,β0)p(y2|α0,β0)ify4=2p(y1|α0,β0)p(y3|α0,β0)ify4=3p(y1|α0,β0)p(y2|α0,β0)p(y3|α0,β0)ify4=4,

where p(yi|α0,β0), i=1,2,3, was defined as in eq. (6).

8.3 Results of the sensitivity analysis

$Figure 14 Results of the sensitivity analysis to different values of the hyperparameter of the priors on the sharing pattern indicators γm$$\gamma^{m}$$ and γv$$\gamma^{v}$$ , πm$$\boldsymbol{\pi}^{m}$$ and πv$$\boldsymbol{\pi}^{v}$$: prior and posterior probabilities (%) of observing one, two or three groups for TRIM29, MSN, and ITGB5 genes in the lung cancer dataset under cases 1, 2, and 3. Case 1 is such that: P{1 subgroup}=64%, P{2 subgroups} ≈$$\approx$$ 25% and P{3 subgroups}≈$$\approx$$ 11%; case 2 is such that: P{1 subgroup}=36%, P{2 subgroups}=39% and P{3 subgroups}=25%; case 3 is such that: P{1 subgroup}=4%, P{2 subgroups}=36% and P{3 subgroups}=60%.$

Figure 14

Results of the sensitivity analysis to different values of the hyperparameter of the priors on the sharing pattern indicators γm and γv , πm and πv: prior and posterior probabilities (%) of observing one, two or three groups for TRIM29, MSN, and ITGB5 genes in the lung cancer dataset under cases 1, 2, and 3. Case 1 is such that: P{1 subgroup}=64%, P{2 subgroups} ≈ 25% and P{3 subgroups}≈ 11%; case 2 is such that: P{1 subgroup}=36%, P{2 subgroups}=39% and P{3 subgroups}=25%; case 3 is such that: P{1 subgroup}=4%, P{2 subgroups}=36% and P{3 subgroups}=60%.

$Figure 15 Results of the sensitivity analysis to different values of the hyperparameter of the Dirichlet prior on the weights,a0$$\boldsymbol{a}_{0}$$: percentages of simulation time spent on the most-visited colors (cumulative%$$\%$$ of simulation time spent ≥90%$$\geq90\%$$ ) for TRIM29, MSN, and ITGB5 genes in the lung cancer dataset under the following scenarios: 1. a0=0.1$$\boldsymbol{a}_{0}=0.1$$, which favours for overfitting mixtures empty groups; 2. a0=1$$\boldsymbol{a}_{0}=1$$, which is the standard uniform prior; 3. a0=D2+3=4$$\boldsymbol{a}_{0}=\frac{D}{2}+3=4$$ , which favours similar non-empty components.$

Figure 15

Results of the sensitivity analysis to different values of the hyperparameter of the Dirichlet prior on the weights,a0: percentages of simulation time spent on the most-visited colors (cumulative% of simulation time spent ≥90% ) for TRIM29, MSN, and ITGB5 genes in the lung cancer dataset under the following scenarios: 1. a0=0.1, which favours for overfitting mixtures empty groups; 2. a0=1, which is the standard uniform prior; 3. a0=D2+3=4 , which favours similar non-empty components.

References

1. A generalized theory of the combination of observations so as to obtain the best result. Am J Math. 1886;8:343–366.10.2307/2369392Search in Google Scholar

2. Contribution to the mathematical theory of evolution. Phil Trans R Soc Lond A (1887–1895). 1894;185:71–110.10.1098/rsta.1894.0003Search in Google Scholar

3. Titterington DM, Smith AF, Makov UE. Statistical analysis of finite mixture distributions. New York: Wiley; 1985.Search in Google Scholar

4. McLachlan GJ, Basford KE. Mixture models: inference and applications to clustering. New York: Marcel Dekker; 1988.Search in Google Scholar

5. McLachlan GJ, Peel D. Finite mixture models. New York: John Wiley & Sons; 2000.10.1002/0471721182Search in Google Scholar

6. Editorial: recent developments in mixture models. Comput Stat Data Anal. 2003;41:349–357.10.1016/S0167-9473(02)00161-5Search in Google Scholar

7. Marin J, Mengersen K, Robert CP. Bayesian modelling and inference on mixtures of distributions. In: Dey D, Rao CR, editors. Handbook of statistics Vol. 25. Amsterdam: Elsevier-Sciences; 2005:459–507.10.1016/S0169-7161(05)25016-2Search in Google Scholar

8. Frühwirth-Schnatter S. Finite mixture and Markov switching models. New York: Springer Science & Business Media; 2006.Search in Google Scholar

9. Searching for differentially expressed gene combinations. Genome Biol. 2005;6:R88.10.1186/gb-2005-6-10-r88Search in Google Scholar PubMed PubMed Central

10. Shedden K, Taylor J. Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas. In: Shoemaker JS, Lin SM, editors. Methods of microarray data analysis IV. New York: Springer; 2004:121–132.10.1007/0-387-23077-7_10Search in Google Scholar

11. Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes. BMC Bioinformatics. 2010;11:276.10.1186/1471-2105-11-276Search in Google Scholar PubMed PubMed Central

12. Genes with bimodal expression are robust diagnostic targets that define distinct subtypes of epithelial ovarian cancer with different overall survival. J Mol Diagn. 2012;14:214–222.10.1016/j.jmoldx.2012.01.007Search in Google Scholar PubMed

13. Tumor dormancy and surgery-driven interruption of dormancy in breast cancer: learning from failures. Nat Rev Clin Oncol. 2007;4:699–710.10.1038/ncponc0999Search in Google Scholar PubMed

14. Partition models. Commun Stat Theory Methods. 1990;19:2745–2756.10.1080/03610929008830345Search in Google Scholar

15. Barry D, Hartigan JA. Product partition models for change point problems. Ann Stat. 1992;20:260–279.10.1214/aos/1176348521Search in Google Scholar

16. Crowley EM. Product partition models for normal means. J Am Stat Assoc. 1997;92:192–198.10.1080/01621459.1997.10473616Search in Google Scholar

17. Dahl DB. Modal clustering in a class of product partition models. Bayesian Anal. 2009;4:243–264.10.1214/09-BA409Search in Google Scholar

18. Escobar MD, West M. Bayesian density estimation and inference using mixtures. J Am Stat Assoc. 1995;90:577–588.10.1080/01621459.1995.10476550Search in Google Scholar

19. West M, Turner DA. Deconvolution of mixtures in analysis of neural synaptic transmission. Statistician. 1994;43:31–43.10.2307/2348930Search in Google Scholar

20. Barnard J, McCulloch RE, Meng XL. Modeling covariance matrices in terms of standard deviations and correlations, with applications to shrinkage. Stat Sin. 2000;10:1281–1311.Search in Google Scholar

21. Gottardo R, Raftery AE. Markov Chain Monte Carlo with mixtures of mutually singular distributions. J Comput Graph Stat. 2008;17:949–975.10.1198/106186008X386102Search in Google Scholar

22. Diebolt J, Robert CP. Estimation of finite mixture distributions through Bayesian sampling. J R Stat Soc Ser B. 1994;56:363–375.10.1111/j.2517-6161.1994.tb01985.xSearch in Google Scholar

23. Kadane JB. The role of identification in Bayesian theory. In: Fienberg SE, Zellner A, editors. Studies in Bayesian econometrics and statistics. North-Holland Publishing: Elsevier; 1975:175–191.Search in Google Scholar

24. Stephens M. Dealing with label-switching in mixture models. J R Stat Soc Ser B. 2000;62:795–809.10.1111/1467-9868.00265Search in Google Scholar

25. Frühwirth-Schnatter S. Markov Chain Monte Carlo estimation of classical and dynamic switching and mixture models. J Am Stat Assoc. 2001;96:194–209.10.1198/016214501750333063Search in Google Scholar

26. O’Hagan A. Contribution to the discussion of Richardson and Green (1997), on Bayesian analysis of mixtures with an unknown number of components. J R Stat Soc Ser B. 1997;59:772.Search in Google Scholar

27. Jasra A, Holmes CC, Stephens DA. Markov Chain Monte Carlo methods and the label switching problem in Bayesian mixture modelling. Stat Sci. 2005;20:50–67.10.1214/088342305000000016Search in Google Scholar

28. Zhu W, Fan Y. Relabelling algorithms for mixture models with applications for large data sets. J Stat Comput Simul. 2016;86:394–413.10.1080/00949655.2015.1015129Search in Google Scholar

29. Rodriguez CE, Walker S. Label switching in Bayesian mixture models: deterministic relabeling strategies. J Comput Graph Stat. 2014;23:25–45.10.1080/10618600.2012.735624Search in Google Scholar

30. Papastamoulis P, Iliopoulos G. An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. J Comput Graph Stat. 2010;19:313–331.10.1198/jcgs.2010.09008Search in Google Scholar

31. Chung H, Loken E, Schafer JL. Difficulties in drawing inferences with finite-mixture models. Am Stat. 2004;58:152–158.10.1198/0003130043286Search in Google Scholar

32. Antoniak CE. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat. 1974;2:1152–1174.10.1214/aos/1176342871Search in Google Scholar

33. Blackwell D, MacQueen JB. Ferguson distributions via Polya urn schemes. Ann Stat. 1973;1:353–355.10.1214/aos/1176342372Search in Google Scholar

34. Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002;8:816–824.10.1038/nm733Search in Google Scholar PubMed

35. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Nat Acad Sci USA. 2001;98:13790–5.10.1073/pnas.191502998Search in Google Scholar PubMed PubMed Central

36. Miura K, Bowman ED, Simon R, Peng AC, Robles AI, Jones RT, et al. Laser capture microdissection and microarray expression analysis of lung adenocarcinoma reveals tobacco smoking- and prognosis-related molecular profiles. Cancer Res. 2002;62:3244–3250.Search in Google Scholar

37. Wigle DA, Jurisica I, Radulovich N, Pintilie M, Rossant J, Liu N, et al. Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. Cancer Res. 2002;62:3005–3008.Search in Google Scholar

38. Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna- Gengelbach M, et al. Diversity of gene expression in adenocarcinoma of the lung. Proc Nat Acad Sci USA. 2001;98:13784–9.10.1073/pnas.241500798Search in Google Scholar PubMed PubMed Central

39. The Cancer Genome Atlas Research Network et al. The molecular taxonomy of primary prostate cancer. Cell. 2015;163:1011–1025.10.1016/j.cell.2015.10.025Search in Google Scholar PubMed PubMed Central

40. Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002;97:611–631.10.1198/016214502760047131Search in Google Scholar

41. Fraley C, Raftery AE, Murphy TB, Scrucca L. mclust Version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation; 2012.Search in Google Scholar

42. Robert C, Casella G. Introducing Monte Carlo Methods with R. New York: Springer Science & Business Media; 2009.10.1007/978-1-4419-1576-4Search in Google Scholar

43. R Core Team. R: a language and environment for statistical computing. 2015;R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.Search in Google Scholar

44. Ihaka R, Gentleman R. R: a language for data analysis and graphics. J Comput Graph Stat. 1996;5:299–314.Search in Google Scholar

45. Martin AD, Quinn KM, Park JH. MCMCpack: Markov Chain Monte Carlo in R. J Stat Soft 2011;42:22. http://www.jstatsoft.org/v42/i09/.10.18637/jss.v042.i09Search in Google Scholar

46. Papastamoulis P. label.switching: relabelling MCMC outputs of mixture models. R package version 1.3. 2014 . Available at:http://CRAN.R-project.org/package=label.switching.Search in Google Scholar

47. Plummer M, Best N, Cowles K, Vines K. CODA: convergence diagnosis and output analysis for MCMC. R News. 2006;6:7–11.Search in Google Scholar

48. Richardson S, Green PJ. On Bayesian analysis of mixtures with an unknown number of components, with discussion. J R Stat Soc Ser B. 1997;59:731–792.10.1111/1467-9868.00095Search in Google Scholar

49. Stephens M. Bayesian analysis of mixture models with an unknown number of components – an alternative to reversible jump methods. Ann Stat. 2000;28:40–74.10.1214/aos/1016120364Search in Google Scholar

50. Rousseau J, Mengersen K. Asymptotic behaviour of the posterior distribution in overfitted mixture models. J R Stat Soc Ser B Stat Methodol. 2011;73:689–710.10.1111/j.1467-9868.2011.00781.xSearch in Google Scholar

51. Frühwirth-Schnatter S. Dealing with label switching under model uncertainty. In: Mengersen K, Robert CP, Titterington DM, editors. Mixtures estimation and application.. New York: John Wiley & Sons, Ltd; 2011:213–239.10.1002/9781119995678.ch10Search in Google Scholar

52. Kim S, Tadesse MG, Vannucci M. Variable selection in clustering via Dirichlet process mixture models. Biometrika. 2006;93:877–893.10.1093/biomet/93.4.877Search in Google Scholar

53. Edefonti V, Randi G, Decarli A, La Vecchia C, Bosetti C, Franceschi S, et al. Clustering dietary habits and the risk of breast and ovarian cancers. Ann Oncol. 2009;20:581–590.10.1093/annonc/mdn594Search in Google Scholar PubMed

54. Chang WC. On using principal components before separating a mixture of two multivariate normal distributions. Appl Stat. 1983;32:267–275.10.2307/2347949Search in Google Scholar

55. McLachlan G, Flack L, Ng S, Wang K. Clustering of gene expression data via normal mixture models. In: Yakovlev YA, Klebanov L, Gaile D, editors. Statistical Methods for Microarray Data Analysis: Methods and Protocols 2013. New York: Springer:103–119.10.1007/978-1-60327-337-4_7Search in Google Scholar PubMed

56. Edefonti V. Integrating supervised and unsupervised learning in genomics applications. Ph.D. thesis, Bocconi University, Milan, Italy. 2006.Search in Google Scholar

Published Online: 2017-2-16

Articles in the same Issue

https://doi.org/10.1515/ijb-2015-0064

Keywords for this article

Bayesian inference; clustering; Markov chain Monte Carlo; mixture models with unknown number of components