Home Leveraging external information by guided adaptive shrinkage to improve variable selection in high-dimensional regression settings
Article Open Access

Leveraging external information by guided adaptive shrinkage to improve variable selection in high-dimensional regression settings

  • Mark A. van de Wiel EMAIL logo and Wessel N. van Wieringen
Published/Copyright: September 8, 2025
Become an author with De Gruyter Brill

Abstract

Variable selection is challenging for high-dimensional data, in particular when sample size is low. It is widely recognized that external information in the form of complementary data on the variables, ‘co-data’, may improve results. Examples are known variable groups or p-values from a related study. Such co-data are ubiquitous in genomics settings due to the availability of public repositories, and is likely equally relevant for other applications. Yet, the uptake of prediction methods that structurally use such co-data is limited. We review guided adaptive shrinkage methods: a class of regression-based learners that use co-data to adapt the shrinkage parameters, crucial for the performance of those learners. We discuss technical aspects, but also the applicability in terms of types of co-data that can be handled. This class of methods is contrasted with several others. In particular, group-adaptive shrinkage is compared with the better-known sparse group-lasso by evaluating variable selection. Moreover, we demonstrate the versatility of the guided shrinkage methodology by showing how to ‘do-it-yourself’: we integrate implementations of a co-data learner and the spike-and-slab prior for the purpose of improving variable selection in genetics studies. We conclude with a real data example.

1 Introduction

The advance of high-throughput technology has lead to a vast increase of data sources that are of a high-dimensional nature. Two challenges of such high-dimensional data are low signal-to-noise ratio and multicollinearity. All prediction and variable selection models face those challenges. Here, we focus on regression-based models, which employ regularization by introducing shrinkage either through a penalty or an informative prior. We first elaborate on these two challenges and their implications before discussing potential solutions.

First, many variables are likely irrelevant for the response. As, by default, shrinkage parameters are shared by all variables, this may lead to overshrinkage for relevant variables. On its turn, this may harm the predictive accuracy of the resulting model. The second one is multicollinearity, which is particularly relevant for high-dimensional data, as often variables are highly correlated due to shared properties. In a predictive model variables are competing with one another. Selecting one variable will likely de-select another one if the two are strongly correlated. Then, small fluctuations in the data set drive the variable selection, rendering it instable.

Three solutions for the two challenges come to mind. First, the low signal-to-noise ratio is countered by enforcing sparsity, e.g. by a lasso penalty or a horseshoe prior, to benefit variable selection. This certainly helps, although posterior selection for dense methods like ridge regression can be competitive [1]. Second, the instable variable selection due to multicollinearity may be countered by stability selection [2]. In a nutshell, one generates many random copies of the data set, e.g. by bootstrapping, and then composes the ultimate set of selected variables from those which are selected in a large proportion of those copies. The third solution is to bring in external information. The promises of this solution are clear: external information allows for better modelling of the shrinkage and can break the strong competition between variables. Naturally, this solution can be combined with the former two. External information is under-used for variable selection, which is why we focus on it in this review.

One reason for the under-use is pragmatism: it takes time to compile external data and requires thinking on what to include. Moreover, leveraging external knowledge can be achieved in many different ways, rendering it difficult to have an overview and pick an algorithm for one’s needs. Our aim is two-fold: on one hand convince potential users that leveraging external information may be well worth the effort and on the other hand provide guidance on which algorithms to use for which settings.

Our focus lies on methods that allow guided adaptive shrinkage, which models and adapts the shrinkage parameter(s) as a function of the external information on the variables. We refer to the latter as ‘co-data’ [3], 4], ‘complementary data’. Alternatively, the term ‘features of features’ has been coined [5]. Examples of such co-data are: variable-specific p-values from a related, external study or a pre-defined grouping, such as data modalities or genes grouped according to chromosome. The central aim of this paper is to convince researchers of the usefulness of guided adaptive shrinkage. For that, this paper consists of three parts. First, a short review of available guided adaptive shrinkage methods. Second, a comparison with a very well-known competitor: the (non-adaptive) group lasso, to show the potential benefit of adaptivity for the purpose of variable selection. Third, a do-it-yourself solution to enhance the use of guided adaptive shrinkage for other priors/penalties than those provided by current software. We start by presenting a general formulation of guided adaptive shrinkage, and reviewing several methods. In particular, we consider what types of co-data can be handled, what types of response are accommodated, and what strategy is used to estimate hyperparameters. In terms of applications, we focus on genomics, but much of the methodology is relevant to other fields in which use of penalized regression methods has become common, such as economic forecasting [6] and psychometrics [7].

The guided adaptive shrinkage approach is part of a larger family of methods that account for external information. Popular alternatives are methods based on (structured) penalties, in particular group penalties, such as the (sparse) group-lasso. Therefore, the contrast between group-adaptive lasso and group-lasso is particularly relevant, and so we use simulations to compare the two approaches for a varying number of variable groups in terms of variable selection performance. Throughout, we focus on evaluating variable selection, as the potential benefit of using co-data for the purpose of prediction has already been demonstrated by many of the discussed works.

We show the versatility of the approach by discussing a ‘do-it-yourself’ solution for one’s favorite model. We use it to illustrate the benefit of co-data for variable selection in a spike-and-slab model, which is a popular Bayesian model for selecting variables in large scale genomics studies [8]. Finally, we present a co-data example that predicts therapy response for colorectal cancer patients from genomics data.

2 Guided adaptive shrinkage

Here, we introduce and illustrate the general framework before reviewing a variety of guided adaptive shrinkage methods.

2.1 Examples of co-data

Co-data are properties of the variables, not of the samples. That is, each co-data source is represented as a vector of length p, the number of variables. Below we list several co-data examples that have been used in the literature, with examples from genomics applications.

  1. Modality. Multi-modal data consists of multiple types of variables, which defines a categorical co-data source. A canonical example in medicine is the mix of two types of variables: clinical and genomic ones, e.g. for prediction of survival in breast cancer patients [9]. Here, adaptation of the penalty to the type of variable may be very beneficial, as the amount of shrinkage adapts to the predictive strength of each of the two types.

  2. Grouping within one type of variable. Often, variables can be grouped in multiple ways according to external information. In genomics, example groupings are: biological pathway or chromosome, used in [10] to aid in distinguishing cancer types.

  3. External numerical summaries, such as p-values or regression coefficients. Often, data from similar studies are available, but cannot be directly used for training, also because it is unknown how relevant these co-data are. Then, guided adaptive shrinkage is useful, because it adapts to the informativeness of such co-data for the data at hand. An example is prediction for two different time horizons with two sets of patients using the same type of proteomics data. Here, regression coefficients from one setting are used as co-data for the other one [5].

  4. A prior set of potentially relevant variables, often called a ‘signature’ in genomics. Such information implies a binary co-data source, with entries that indicate whether a variable is in this set or not. An example is tumor-specificity (yes/no) of molecular variables for predicting therapy response in colon cancer patients [11].

Guided adaptive shrinkage methods differ in what types of co-data (grouped, numerical, multiple) they can incorporate, what type of shrinkage they employ, and how they incorporate the co-data to adapt the shrinkage. Below we cast these methods in a common framework.

2.2 General framework

Let y = ( y i ) i = 1 n be the response of interest, X = ( x i j ) i = 1 , j = 1 n , p the data matrix with x ij : the value of variable j for sample i, β = ( β j ) j = 1 p the regression coefficients, and θ = (θ1, …, θ k ) denote the nuisance parameters (such as noise variance σ2). We assume that the regression model linking X to y by β defines a likelihood function L . Other fit functions could be used as well. Then, the guided adaptive shrinkage framework is summarized by:

L ( β , θ ; X , y ) ( Likelihood ) P λ ( β ) ( Shrinkage ) λ j = f α ( Z j ) ( Guided adaptation ) ,

with penalty vector λ = (λ1, …, λ p ), and matrix Z = ( z c j ) c , j = 1 C , p containing the information of C co-data sources. Here, Z comprises of rows Zc· that corresponds to the cth co-data source and of columns Z·j corresponding to variable j. P λ ( β ) is either a penalty function in a classical setting, or a prior in a Bayesian setting. In fact, when the penalty function is formulated as a log-prior, maximization of the penalized likelihood renders the Bayesian posterior mode estimate. This equivalence facilitates switching between the two paradigms. A crucial ingredient of adaptive shrinkage methods is the dependence of the penalties λ j , j = 1, …, p, on the co-data, the strength of which is modeled through hyper-parameter vector α . Besides the choice of paradigm and type of shrinkage, methods differ in terms of how they estimate these penalties as a function f of co-data matrix or vector Z. We emphasize the relative low dimension of α (w.r.t. p): this facilitates stable estimation of the high-dimensional penalty vector λ . Figure 1 illustrates guided adaptive shrinkage.

Figure 1: 
Illustration of guided adaptive shrinkage using co-data 


Z
=







Z


1
⋅




Z


2
⋅








$Z=\left(\begin{matrix}\hfill {Z}_{1\cdot }{Z}_{2\cdot }\hfill \end{matrix}\right)$


. Categorical co-data source Z1· contains prior grouping information on the variables (e.g. three different types of genomic variables); continuous co-data source Z2· contains prior numerical information (visualized by grey-scale) on the variables, such as external p-values; 
α
: hyper-parameters that tune the importance of Z1· and Z2·; f: adaptation function; 
λ
: penalty vector; P: shrinkage via penalty or prior; 
β
: regression parameters; X: data (visualized by heatmap); 
θ
: nuisance parameters; 


L

$\mathcal{L}$


: likelihood; y: response.
Figure 1:

Illustration of guided adaptive shrinkage using co-data Z = Z 1 Z 2 . Categorical co-data source Z contains prior grouping information on the variables (e.g. three different types of genomic variables); continuous co-data source Z contains prior numerical information (visualized by grey-scale) on the variables, such as external p-values; α : hyper-parameters that tune the importance of Z and Z; f: adaptation function; λ : penalty vector; P: shrinkage via penalty or prior; β : regression parameters; X: data (visualized by heatmap); θ : nuisance parameters; L : likelihood; y: response.

2.3 Methods

Guided adaptive shrinkage methods may be classified in multiple ways. The likelihood L , largely determined by the type of response y (e.g. continuous/binary/survival), and the type of penalty/prior P λ are obvious classifiers. Moreover, the type(s) of co-data Z allowed by co-data function f α classifies methods in two ways: first, the measurement scale (grouped, continuous) and second, the ability to handle multiple co-data sources. Finally, the methods crucially differ in how hyperparameters are estimated. Here, we distinguish (1) Cross-validation: hyperparameters are tuned by cross-validation; (2) Empirical Bayes: hyperparameters are tuned by empirical Bayes techniques; (3) Full Bayes: hyperparameters are endowed with an hierarchical prior; (4) Joint estimation: Hyperparameters and regression parameters are estimated jointly. Table 1 classifies several co-data guided shrinkage methods according to these criteria; we focus on methods that incorporate variable selection. Below we provide more specific details on each of these methods.

Table 1:

Co-data adaptive shrinkage methods.

Method Referencea Software Likelihoodb Shrinkage Co-datac Hyp.d
Multi-pen. PLS Tai et al. [12] Bin Lasso Group Uni CV
Weighted lasso Bergensen et al. [13] Gauss, Bin, Cox Lasso Cont Uni CV
Group-regul. ridge vd Wiel et al. [4] GRridge Gauss Bin, Cox Ridge Group Multi EB
Integrated pen. factors lasso Boulesteix et al. [9] ipflasso Gauss Bin, Cox Lasso Group Uni CV
Group-adapt. pen. regr. Velten et al. [14] graper Gauss Bin Lasso, Ridge Spike & Slab Group Uni FB
Co-data adapt. ridge Van Nee et al. [11] ecpc Gauss Bin, Cox Ridge Mixed Multi EB
Incorp. prior info pen. regr. Zeng et al. [15] xtune Gauss Bin, Mult Elastic Net Mixed Multi EB
Group-adapt. elastic net Van Nee et al. [16] squeezy Gauss Bin, Cox Elastic Net Group Uni EB
Feature-weight. elastic net Tay et al. [5] fwelnet Gauss Bin Elastic net Mixed Multi Joined
Co-data adapt. Horseshoe regr. Busatto et al. [17] infHS Gauss Probit Horseshoe Mixed Multi FB
  1. aFirst author only. bLikelihood implies the accommodated outcome type(s), with Gauss: Gaussian, continuous; Bin: Binomial, binary; Cox: Proportional hazards, survival. cCo-data refers to type(s) of co-data that are accommodated: grouped, continuous or mixed; one or multiple sources (Uni/Multi). dHyp refers to type of hyperparameter estimation, with CV, cross-validation; EB:, empirical bayes; FB, full bayes.

Multi-penalty PLS [12] is a pioneering method on the adaptation of penalties based on grouped co-data. It models co-data function f α simply by α = (λ1, …, λ G ). The authors cast their method in a partial least squares setting, but they show the correspondence to ordinary regression. The method uses soft-thresholding with adaptive thresholds, which is strongly related to group-adaptive lasso regression. It accommodates only one grouped co-data source. It introduces an heuristic to reduce computing time for tuning λ by cross-validation, using a weighting function.

Weighted lasso [13] models the co-data function f α = fλ,q, which depends on à priori defined similarity weights between X and Z or between y and Z. The weighting function is somewhat arbitrary and limited in flexibility. It relies on two tuning parameters: λ, the lasso penalty, and q, which tunes the importance of the weights. An example of a similarity weight is the correlation between an mRNA gene expression variable (X) and its DNA counterpart (Z). Here, the advantage over simply adding co-data Z to X is that Z is not needed for new samples, as it only impacts the regularization during training. This method is simple, easy to extend and implement. Moreover, it is relatively fast, because only two hyper-parameters are tuned. It accommodates only one co-data source, though, that needs to be continuous.

GRridge and ecpc [4], 11] are both based on ridge regression, but the latter extends on the former by allowing non-grouped co-data using a regression parametrization, f α = Z·j α . In addition, ecpc implements hyperparameter shrinkage, which is useful when a co-data source consists of many variable groups. The methods share the methodology for hyperparameter estimation by moment-based empirical Bayes. The estimation procedure is modular, which provides computational efficiency and flexibility in terms of implementation, but does not propagate uncertainty as full Bayesian procedures do. The setting is not sparse, but posterior variable selection is implemented and shown to be competitive to lasso-based methods.

ipflasso [9] is a group-adaptive method, so f α is simply parameterized by α = (λ1, …, λ G ), with G: the number of groups of variables. The method is simple and easy to extend. As it tunes λ by full-blown multi-grid cross-validation it may be slow when the number of variable groups increases. It accommodates only one grouped co-data source. The methods is extended in [18] to allow for hierarchical groups.

graper [14] is the first fully Bayesian co-data method that is flexible in terms of penalization and incorporates, next to lasso and ridge priors, the spike-and-slab. It only allows for grouped co-data, so it is less versatile than the methods that allow multiple, mixed co-data types. Computational scalability is achieved by developing a variational Bayes approximation. As a fully Bayesian method it provides uncertainty quantification, although this was not evaluated by the authors and may be compromised by the use of variational Bayes.

xtune [15] supports the use of generic, mixed-type co-data by modeling f α = Z·j α . It provides efficient estimation of hyperparameters α by a Gaussian approximation of the elastic net prior. Moreover, it uses a majorization technique to speed up optimizing of the marginal likelihood to tune α by empirical Bayes. Hence, it is computationally efficient and versatile in its use. Results are presented for linear response only, but the software supports binary and multi-class response as well.

squeezy [16] is similar in spirit to xtune, as it also approximates the marginal likelihood by a Gaussian one. It compliments this approximation, however, with a proof based on the multivariate central limit theorem. It is computationally very efficient as it makes use of many shortcuts available in the Gaussian setting. The main implementation supports grouped co-data only, but it can directly use the output of ridge-based ecpc (discussed above) to handle mixed co-data and hyperparameter shrinkage.

fwelnet [5] models, conditionally on a global penalty parameter λ, variable-specific penalties by f α = λw j ( α ), with weights w j ( α ) = exp ( Z j α ) / j = 1 p exp ( Z j α ) 1 . The approach uniquely optimizes hyperparameters α and regression parameters β jointly. The optimization algorithm alternates between updating α and β , using gradient search for α and an elastic net solver for β . A potential identification problem is circumvented by normalizing the weights. Hence, it is a non-hierarchical formulation, which adapts weights instead of shrinkage.

infHS [17] is a fully Bayesian method that uses the popular horseshoe prior to encode sparsity into the predictive model. It uses a regression parametrization (Z·j α ) to allow mixed co-data types. It modifies the prior mean of the local regularization parameters, thereby particularly facilitating high-dimensional settings with many small signals and a few outlying large ones. It is suitable for variable selection, and the variational Bayes approximation of the posteriors provides computational scalability of the method. Binary outcome is accommodated by a probit formulation.

We note that direct comparison of these methods is somewhat hindered by their difference in functionality, in particular in terms of what outcome and co-data types they can handle. Nonetheless, in particular [14], 16], 17] provide extensive comparisons with several of the other methods.

3 A comparison with the (sparse) group-lasso

Guided adaptive shrinkage relates to structured regularization as both frameworks allow to incorporate external information in the regularization of regression models. Well-known examples of the latter are the group-lasso and the hierarchical lasso, but many more methods are available; see [19] for a extensive review, and [20] for a Bayesian perspective.

As both the sparse group-lasso (and variations thereof) [21] and the group-adaptive lasso [15], 16] may be applied to co-data groups, and are therefore interchangeably used in practice, we performed a comparison between those two types of methods. The essential difference between the two is adaptivity. Sparse group-lasso extends the lasso by augmenting the 1 penalty function on the regression coefficients of the p variables, ( β j ) j = 1 p , with a group-penalty on the G groups of variables, whereas group-adaptive lasso employs different penalties across groups of variables. The penalty functions P for group-lasso and its adaptive counterpart are:

(1) P λ , λ ( β ) = λ j = 1 p | β j | + λ g = 1 G β g 2 sparse group lasso

(2) P λ ( β ) = g = 1 G λ g j G g | β j | group adaptive lasso ,

with norm  β g 2 = k G g β k 2 1 / 2 . Here, co-data matrix Z consists of only one row vector Z containing categorical entries that correspond to the G groups of variables. Penalized regression can also be cast in an equivalent constraint optimization setting, where the constraints are one-to-one linked to the penalties. Figure 2 illustrates these constraints for a toy example with two groups of two variables. Clearly, the constraint adapts to the strength of a group of variables in the group-adaptive setting, whereas the constraint is essentially the same for both groups in the sparse group-lasso setting. Those constraints are very important in high-dimensional settings, because variables compete to be selected. Hence, depending on the situation, the two methods may perform very differently.

Figure 2: 
Parameter constraints for two groups of two variables, G1 = {1, 2}, G2 = {3, 4}. The admissible regions for (β1, β2) and (β3, β4) are depicted. Top-row: group-adaptive lasso; bottom-row: the sparse group lasso.
Figure 2:

Parameter constraints for two groups of two variables, G1 = {1, 2}, G2 = {3, 4}. The admissible regions for (β1, β2) and (β3, β4) are depicted. Top-row: group-adaptive lasso; bottom-row: the sparse group lasso.

The sparse group-lasso focuses on selecting groups and variables, whereas the group-adaptive lasso selects only variables while adapting to different group strengths. Hence, the former may be more suitable in settings with many groups, of which a large part is not relevant, while the latter is more flexible in settings with few groups. As an illustration, we briefly study this claim in a simulation setting.

In a linear regression setting, y = X β + ϵ , we simulate n = 200 samples for p = 2,000 variables divided over G = 3, 6, 9, 15, 24, 39, 60, 99 equally-sized groups. Of these groups, 1/3rd contains non-zero coefficients. To allow some variation between the non-zero groups, the proportion of non-zero coefficients in those groups is sampled from a Beta(2,6) distribution, averaging to 1/4 non-zeros per group. Non-zero β′s are generated from a scaled t3 distribution, such that the total explained variation of the variables equals that of the Gaussian noise, ϵ i N(0, σ2 = 1). Variables x ij are independently sampled from a standard Gaussian.

On the simulated training data, the group-adaptive lasso and the sparse group-lasso were fitted using the R packages squeezy and SGL, respectively, using the known variable groups as input and using defaults for other parameters. We focus on variable selection by evaluating Matthew’s correlation coefficient (mcc) and the F1-score, the harmonic mean of precision and recall. As such scores are somewhat incomparable for models of different size, we opt to fix the number of selected non-zero variables, psel. Both methods allow this as they produce a regularization path that may be used for this purpose. For each group-size G, simulations were repeated 25 times. Figure 3 shows the correlation coefficients for psel = 25, 50.

Figure 3: 
Matthew’s correlation coefficients (mcc; y-axis) for variable selection (psel = 25, 50) for group-adaptive lasso (green) and sparse group-lasso (blue) for G = 3, …, 99 variable groups (x-axis). Based on 25 simulations per G.
Figure 3:

Matthew’s correlation coefficients (mcc; y-axis) for variable selection (psel = 25, 50) for group-adaptive lasso (green) and sparse group-lasso (blue) for G = 3, …, 99 variable groups (x-axis). Based on 25 simulations per G.

The simulations clearly support the claim: for a small to intermediate number of groups, group-adaptive lasso outperforms sparse group-lasso, whereas the latter becomes superior when rather many groups are used. Within one setting, group-adaptive lasso is generally somewhat more variable in performance across repeats, possibly due to the higher number of hyperparameters that need to be estimated. The results on the F1-scores (Supplementary Figure 1) support these claims.

Supplementary Figure 2 shows an extra simulation scenario that is more ‘group-sparse’: p = 10,000, G = 60, 99 and 5 groups with non-zero coefficients. These results support our conclusion above: for a large number of groups (G = 99) sparse group-lasso is somewhat superior to group-adaptive lasso, but the latter is competitive for an intermediate number of groups (G = 60), even in this fairly group-sparse scenario. Finally, the Supplementary Material also provides a solution to shrink hyperparameters in the group-adaptive lasso setting. This is shown to be particularly useful when the number of groups is large and when these groups are not informative (the ‘null-setting’).

4 Do-it-yourself for your favorite model

The guided adaptive shrinkage papers reviewed in Table 1 focus on the most popular penalties/priors, in particular variations of the elastic net. Many other penalties and priors have been proposed, which triggers the question whether these can easily be modified to allow for adaptive shrinkage. For many methods, this is indeed the case. For MCMC-based methods, one generic solution is extensively discussed in [22], building upon an algorithm in [23]. They show that the hyperparameters that adapt the shrinkage by linking the co-data to the priors can be estimated by alternating MCMC with likelihood-based optimization. The method is conceptually straightforward, but can be very time-consuming, as it requires multiple MCMC runs. Below, we focus on a ‘do-it-yourself’ solution that is computationally much more efficient.

Our solution applies when (a) the penalty corresponds to a prior with finite variance, such as the lasso penalty corresponding to a Laplace prior; and (b) the data matrix X is non-sparse. Under these conditions a Gaussian approximation of the prior is proven to be asymptotically appropriate to estimate the hyperparameters α [16]. This corresponds to ridge regression for which very efficient algorithms are available to determine the hyperparameters by maximizing the marginal likelihood. This method has been implemented for the elastic net in the R-packages xtune [15] and squeezy [16]. It can, however, also be used to estimate other penalties/priors using the following algorithm:

  1. Determine co-data sources Z and define Gaussian (= ridge) variances v j R = λ j 1 , with f(λ j ) = Z·j α

  2. Estimate α , and hence v j R , in the Gaussian setting using xtune or squeezy

  3. Equate the theoretical variance v j of the desired prior to v ̂ j R to obtain variable-specific prior parameters

  4. Estimated the high-dimensional model using those variable-specific prior parameters

Next, we give an example for the spike-and-slab prior. This prior is a versatile, natural and powerful prior for high-dimensional settings, in particular useful for Bayesian variable selection [8], 14], 24]. We illustrate the ‘do-it-yourself’ principle in this setting: the co-data guided spike-and-slab prior. For this purpose, we focus on the simplest formulation of the spike-and-slab prior:

(3) β j ( 1 q ) δ 0 + q N ( 0 , τ 2 ) ,

with an appropriate prior on the global parameter τ2. A computationally efficient variational Bayes algorithm to approximate posteriors of β is derived in [8], rendering its implementation, varbvs, very suitable for variable selection in very high-dimensional genetic studies. Now suppose one wishes to incorporate co-data sources Z 1 = ( z 1 j ) j = 1 p and Z 2 = ( z 2 j ) j = 1 p to modify the prior inclusion probability, q. E.g. Z presents a prior grouping of the genetic variables, while Z represents log-p-values for those in a previous study. Then, we may modify the prior to:

(4) β j ( 1 q j ) δ 0 + q j N ( 0 , τ 2 ) ,

where the variable specific inclusion probability depends on the co-data for variable j: Z·j = (z1j, z2j). This prior is available in varbvs, but requires specification of ( q j ) j = 1 p . We now explain how to estimate this quantity with the existing software tools following the algorithm above.

First, we model the ridge penalties λ j = f α (Z·j) = α1z1j + α2z2j. Second, we estimate (α1, α2) by using the R-package xtune. The reciprocal penalties render ridge-based variances v ̂ j R . Third, we compute the theoretical prior variances v j from (4). For that, denote the latent indicator I { β j = 0 } by I j . Then, we have for the prior variance of β j :

v j = V ( β j ) = E I j [ V [ β j | I j ] ] + V I j [ E [ β j | I j ] ] = ( 1 q j ) 0 + q j τ 2 + 0 = q j τ 2 .

Equating v j to v ̂ j R renders relative estimates of q j , as we have q j = Cv j , with C an unknown constant. Our benchmark is prior model (3), fitted by varbvs [8], which requires to specify q. We set q = 0.01, implying that we expect a fairly sparse signal with a prior 99 % probability for β j to equal 0. Then, to ensure a meaningful comparison with the benchmark model, we set q ̄ = p 1 j = 1 p q j = 0.01 , which determines C, resulting in absolute estimates of q j . Fourth, these estimates are then used to define the variable-specific priors (4) and the spike-and-slab model is fit using the varbvs package.

As a proof of concept, we show the benefit of moderating the prior inclusion probabilities by co-data in a high-dimensional genetics setting, simulated as in [8]. Here, single nucleotide polymorphisms (SNPs), a much studied type of mutation, are simulated in two steps. First, the frequency of the polymorphism on either of the two parental alleles is generated. Then, the count over both alleles is simulated, which defines the variables. We focus on simulated data as this allows us to evaluate variable selection. For samples i = 1, …, 500 and variables j = 1, …, p = 10,000, generate:

M j = 0.05 + 0.45 U j ; U j ~ iid U [ 0,1 ] Frequencies x i j ~ iid Bin ( 2 , M j ) Allele counts ( β j ) j = 1 150 ~ iid N ( 0 , τ 2 = 0.25 ) ; ( β j ) j = 151 p = 0 Coefficients Y i = j = 1 10000 β j x i j Response

Furthermore, the co-data is generated as follows. The grouping Z contains two groups: a small group of size 500 variables of which 100 variables correspond to those with non-zero β j ’s, and 400 correspond to those with β j ’s equalling 0; and a large group of the remaining 9,500 variables, 50 of which correspond to non-zero β j ’s. Therefore, the first group is enriched in signal. The second co-data source Z consists of log(external) p-values. The first 150 p-values are generated from a balanced mixture of two beta distributions: B ( 0.1 , 10 ) and B ( 1,5 ) , and the remaining ones where generated from a uniform distribution. Hence, this co-data source should also be informative, as the non-zero β j ’s tend to correspond to relatively small external p-values.

For this simulated data set the co-data is indeed very informative. The estimated hyper-parameters ( α ̂ 1 , α ̂ 2 ) render estimated prior inclusion probabilities ( q ̂ ) j = 1 150 with median: 0.136 and quartiles: (0.0096, 0.306) for relevant variables. These are considerably higher than the summaries for the irrelevant ones, ( q ̂ ) j = 151 p , with median: 0.0031 and quartiles: (0.0028, 0.0078). Figure 4 shows the results for one simulated data set (as results are qualitatively very similar among multiple data sets). Not surprisingly, the improvement in variable selection performance with respect to the benchmark (no co-data) model is noticeable when either or both co-data sources are used.

Figure 4: 
ROC curves for variable selection using spike-and-slab models. X-axis: specificity, y-axis: sensitivity.
Figure 4:

ROC curves for variable selection using spike-and-slab models. X-axis: specificity, y-axis: sensitivity.

5 Data example

Here, we discuss a data example to show that use of co-data can select genomic features that are supported by multiple data sources. The data and co-data were extensively discussed in [25]; here, we summarize those. The data X contains p = 1,689 variables, which are microRNA expression features, for n = 88 patients with metastasized colorectal cancer. The binary therapy response Y contrasts clinical benefit versus progressive disease. Co-data consists of three continuous sources: first, ‘abundance’ of the feature, which is simply the mean of the feature across samples. Here, the hypothesis is that more abundant features may be more relevant for explaining the response. Next, the two other co-data sources are measures of significance (quantified by false discovery rate) of the features in two related comparisons on different patients. The first (FDR1) contrasts the features between metastases and adjacent normal tissue, while the second (FDR2) contrasts the features between the primary tumor and adjacent normal, colorectal tissue. Their incorporation as co-data stems from the believe that tumor- or metastasis specific features (i.e., significant in either comparison) may be prognostically informative for therapy response.

We show the results using ecpc [25], as this package can handle multiple co-data sources and allows flexible modelling of the three continuous co-data sources using monotone splines with hyperparameter shrinkage. Figure 5 shows (a) the estimated inverse ridge penalty factors λ 1 = λ j 1 j = 1 p as a function of each the continuous co-data sources; (b) the estimated density of the numerical co-data across all features and across 50 selected features, using the posthoc selection procedure based on elastic net described in [25]. The figure shows that all three co-data sources are relevant, and that the inverse penalties are in line with the hypotheses: features that are more abundant or more significant in the external data (larger value of −log FDRk, k = 1, 2) are penalized less. Moreover, the bottom row of Figure 5 shows that the selected features are indeed enriched in terms of beneficial co-data properties, and hence are supported by both the data and the co-data. Note that for real data the ‘true set’ of non-zero features is of course unknown, so the accuracy of the selection can not be evaluated. For several data sets, including the current one, it was shown, however, that co-data guided selection leads to more stable feature sets [25], which is a desirable property.

Figure 5: 
Top row: Inverse ridge penalty factors as a function of the three co-data sources. Bottom row: Densities of co-data for all (black) and 50 selected (grey) features.
Figure 5:

Top row: Inverse ridge penalty factors as a function of the three co-data sources. Bottom row: Densities of co-data for all (black) and 50 selected (grey) features.

Supplementary Figure 5 corresponds to Figure 5, but with the three co-data sources randomized by shuffling the indices, thereby breaking the link with the data. This acts as a ‘null-case’ for which the co-data should have relatively little impact on the estimated penalties. Indeed, we observe this for the first (abundance) and the third co-data source (−log FDR2), as the penalty factors are very flat. Moreover, the density of selected features resembles that of all features in terms of co-data characteristics. For the second co-data source (−log FDR1), one may argue that the co-data model slightly overfits, although the penalty factors are much less extreme than for the real co-data. The added value of co-data for feature selection cannot be assessed for real data, as the truth is unknown. We advise to verify whether the co-data adaptation improves or deteriorates the prediction as compared to the baseline method. In this example, the real (non-randomized) co-data reduces the cross-validated MSE of the prediction by 6 %, whereas the randomized co-data increases it by 5 %. This indicates that the real co-data is potentially relevant, whereas the ‘fake’, randomized co-data is indeed not useful and can be discarded.

6 Software and data availability

R-scripts to reproduce the results in this manuscript are available via: https://github.com/markvdwiel/CodataReview/, which also includes the real data and the simulations to produce the synthetic data underlying this article.

7 Discussion

We reviewed methods that implement guided adaptive shrinkage, including group-adaptive methods. The latter framework was contrasted with group-regularized methods such as the sparse group-lasso. Multiple sources of co-data of mixed types, such as external p-values and groupings of variables, are often available. This setting is nowadays conveniently accommodated by several methods via a regression-type parametrization that links those co-data to the hyperparameters. We emphasize that the combination of multiple co-data sources is a very powerful tool to improve variable selection, as was illustrated for the simulated SNP data. The intuition behind this improvement is that high-dimensional data is, by definition, highly collinear. While sparse regularization may help to pinpoint the needles in the haystack, these needles become more ‘visible’ when they are also highlighted by co-data sources, as opposed to their collinear competitors. So, co-data makes the needles stick out more, which improves variable selection [17], 25].

Therefore, tools that accommodate automatic retrieval of co-data are a welcome addition to the methodology. In the genomics setting, Comprior [26] was developed for this purpose. It provides tools for extraction of gene and/or pathway scores or lists from several well-known genomic data bases. Moreover, it integrates with xtune to use such co-data for lasso-based variable selection. In addition, a module for automatic retrieval of gene-centered co-data from scientific articles is available [27]. In short, the authors use a convolutional neural net based text analysis to learn a gene score for its relation to the disease of interest. This score may then be used as co-data for a given study. The latter approach is generic, so likely also applies to non-genomics applications.

An important criticism on guided adaptive shrinkage methods is that they may be prone to overfitting when the number of hyperparameters (size of α ) is large. Full Bayesian methods like graper [14] and infHS [17] may counter this by using a (weakly) informative prior, whereas ecpc [25] allows to regularize the empirical Bayes moment equations to stabilize the estimation of α . For the group-adaptive lasso setting we provide a potential solution in the Supplementary Material, based on targeted hyperparameter shrinkage. We advise to compare predictive accuracy of the guided adaptive shrinkage method with its baseline counterpart, i.e. the same model without co-data guided shrinkage. Inferior predictive performance of the former may indicate overfitting. Then, one should either use a simpler co-data model with fewer hyper-parameters, or consider discarding the co-data.

We limited our study to guided adaptive shrinkage methods, and their main ‘competitor’: the non-adaptive (sparse) group-lasso. We should note that these are part of a much bigger ecosystem of methods that allow to account for prior or structural information. For example, transfer learning can be applied to transfer knowledge on high-dimensional regression coefficients learnt in an external domain to the primary domain of interest [28]. Or, prioritization may be relevant when the structural information are groups of variables representing data modalities: priority-lasso [29] handles such groups of variables, giving priority to some groups over others for practical reasons. Moreover, graphical structure, either known or estimated, can be incorporated in prediction algorithms to account for associations between variables [30]. Finally, variable selection may benefit from accounting for measurement error, for example in boosting algorithms [31], 32].

We primarily focused on evaluating accuracy of variable selection, as most of the reviewed guided adaptive shrinkage methods contain extensive results on the potential of these methods for improving prediction. Moreover, it has been demonstrated that also the stability of the selected variable set improves with the use of co-data [25].

Finally, we note that the use of prior information is fundamental to science: ‘science builds on science’. The plethora of guided shrinkage methods reviewed here provides researchers the means to structurally do so in high-dimensional settings.


Corresponding author: Mark A. van de Wiel, Department of Epidemiology and Data Science, Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands, E-mail: 

  1. Research ethics: Not applicable.

  2. Informed consent: Not applicable.

  3. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Use of Large Language Models, AI and Machine Learning Tools: None declared.

  5. Conflict of interest: The author states no conflict of interest.

  6. Research funding: None declared.

  7. Data availability: The data that support the findings of this study are openly available in github at https://github.com/markvdwiel/CodataReview/.

References

1. Bondell, HD, Reich, BJ. Consistent high-dimensional bayesian variable selection via penalized credible regions. J Am Stat Assoc 2012;107:1610–24. https://doi.org/10.1080/01621459.2012.716344.Search in Google Scholar PubMed PubMed Central

2. Meinshausen, N, Bühlmann, P. Stability selection. J Roy Stat Soc B Stat Methodol 2010;72:417–73. https://doi.org/10.1111/j.1467-9868.2010.00740.x.Search in Google Scholar

3. Neuenschwander, B, Capkun-Niggli, G, Branson, M, Spiegelhalter, DJ. Summarizing historical information on controls in clinical trials. Clin Trials 2010;7:5–18. https://doi.org/10.1177/1740774509356002.Search in Google Scholar PubMed

4. Van de Wiel, MA, Lien, TG, Verlaat, W, van Wieringen, WN, Wilting, SM. Better prediction by use of co-data: adaptive group-regularized ridge regression. Stat Med 2016;35:368–81. https://doi.org/10.1002/sim.6732.Search in Google Scholar PubMed

5. Tay, JK, Aghaeepour, N, Hastie, T, Tibshirani, R. Feature-weighted elastic net: using “features of features” for better prediction. Stat Sin 2023;33:259. https://doi.org/10.5705/ss.202020.0226.Search in Google Scholar PubMed PubMed Central

6. Smeekes, S, Wijler, E. Macroeconomic forecasting using penalized regression methods. Int J Forecast 2018;34:408–30. https://doi.org/10.1016/j.ijforecast.2018.01.001.Search in Google Scholar

7. Greenwood, CJ, Youssef, GJ, Letcher, P, Macdonald, JA, Hagg, LJ, Sanson, A, et al.. A comparison of penalised regression methods for informing the selection of predictive markers. PLoS One 2020;15:e0242730. https://doi.org/10.1371/journal.pone.0242730.Search in Google Scholar PubMed PubMed Central

8. Carbonetto, P, Stephens, M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal 2012;7:73–108. https://doi.org/10.1214/12-ba703.Search in Google Scholar

9. Boulesteix, AL, De Bin, R, Jiang, X, Fuchs, M. IPF-LASSO: integrative-penalized regression with penalty factors for prediction based on multi-omics data. Comput Math Methods Med 2017;2017:1–14.10.1155/2017/7691937Search in Google Scholar PubMed PubMed Central

10. Novianti, PW, Snoek, BC, Wilting, SM, van de Wiel, MA. Better diagnostic signatures from RNAseq data through use of auxiliary co-data. Bioinformatics 2017;33:1572–4. https://doi.org/10.1093/bioinformatics/btw837.Search in Google Scholar PubMed

11. van Nee, MM, Wessels, LFA, van de Wiel, MA. Flexible co-data learning for high-dimensional prediction. Stat Med 2021;40:5910–25. https://doi.org/10.1002/sim.9162.Search in Google Scholar PubMed PubMed Central

12. Tai, F, Pan, W. Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms. Bioinformatics 2007;23:1775–82. https://doi.org/10.1093/bioinformatics/btm234.Search in Google Scholar PubMed

13. Bergersen, LC, Glad, IK, Lyng, H. Weighted lasso with data integration. Stat Appl Genet Mol Biol 2011;10:1–29. https://doi.org/10.2202/1544-6115.1703.Search in Google Scholar PubMed

14. Velten, B, Huber, W. Adaptive penalization in high-dimensional regression and classification with external covariates using variational bayes. Biostatistics 2021;22:348–64. https://doi.org/10.1093/biostatistics/kxz034.Search in Google Scholar PubMed PubMed Central

15. Zeng, C, Campbell Thomas, D, Lewinger, JP. Incorporating prior knowledge into regularized regression. Bioinformatics 2021;37:514–21. https://doi.org/10.1093/bioinformatics/btaa776.Search in Google Scholar PubMed PubMed Central

16. van Nee, MM, van de Brug, T, van de Wiel, MA. Fast marginal likelihood estimation of penalties for group-adaptive elastic net. J Comput Graph Stat 2023;32:950–60. https://doi.org/10.1080/10618600.2022.2128809.Search in Google Scholar PubMed PubMed Central

17. Busatto, C, van de Wiel, MA. Informative co-data learning for high-dimensional horseshoe regression. arXiv preprint arXiv:2303.05898 2023.Search in Google Scholar

18. Zhao, Z, Zucknick, M. Structured penalized regression for drug sensitivity prediction. J Roy Stat Soc C Appl Stat 2020;69:525–45. https://doi.org/10.1111/rssc.12400.Search in Google Scholar

19. Vinga, S. Structured sparsity regularization for analyzing high-dimensional omics data. Briefings Bioinf 2021;22:77–87. https://doi.org/10.1093/bib/bbaa122.Search in Google Scholar PubMed

20. Zhu, L, Huo, Z, Ma, T, Oesterreich, S, Tseng, GC. Bayesian indicator variable selection to incorporate hierarchical overlapping group structure in multi-omics applications. Ann Appl Stat 2019;13:2611–36. https://doi.org/10.1214/19-aoas1271.Search in Google Scholar

21. Simon, N, Friedman, J, Hastie, T, Tibshirani, R. A sparse-group lasso. J Comput Graph Stat 2013;22:231–45. https://doi.org/10.1080/10618600.2012.681250.Search in Google Scholar

22. Van de Wiel, MA, te Beest, DE, Münch, M. Learning from a lot: empirical Bayes in high-dimensional prediction settings. Scand J Stat 2018:1–24.10.1111/sjos.12335Search in Google Scholar PubMed PubMed Central

23. Casella, G, George, EI. Empirical Bayes Gibbs sampling. Biostatistics 2001;2:485–500. https://doi.org/10.1093/biostatistics/2.4.485.Search in Google Scholar PubMed

24. Newcombe, PJ, Raza Ali, H, Blows, FM, Provenzano, E, Pharoah, PD, Caldas, C, et al.. Weibull regression with Bayesian variable selection to identify prognostic tumour markers of breast cancer survival. Stat Methods Med Res 2014;26:1–23. https://doi.org/10.1177/0962280214548748.Search in Google Scholar PubMed PubMed Central

25. van Nee, MM, Wessels, LFA, van de Wiel, MA. ecpc: an r-package for generic co-data models for high-dimensional prediction. BMC Bioinf 2023;24:172. https://doi.org/10.1186/s12859-023-05289-x.Search in Google Scholar PubMed PubMed Central

26. Perscheid, C. Comprior: facilitating the implementation and automated benchmarking of prior knowledge-based feature selection approaches on gene expression data sets. BMC Bioinf 2021;22:1–15. https://doi.org/10.1186/s12859-021-04308-z.Search in Google Scholar PubMed PubMed Central

27. Wang, F, Liang, D, Li, Y, Ma, S. Prior information-assisted integrative analysis of multiple datasets. Bioinformatics 2023;39:btad452. https://doi.org/10.1093/bioinformatics/btad452.Search in Google Scholar PubMed PubMed Central

28. Rauschenberger, A, Landoulsi, Z, van de Wiel, MA, Glaab, E. Penalized regression with multiple sources of prior effects. Bioinformatics 2023;39:btad680. https://doi.org/10.1093/bioinformatics/btad680.Search in Google Scholar PubMed PubMed Central

29. Klau, S, Jurinovic, V, Hornung, R, Herold, T, Boulesteix, AL. Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinf 2018;19:322. https://doi.org/10.1186/s12859-018-2344-6.Search in Google Scholar PubMed PubMed Central

30. Chen, LP, Tsao, HS. GUEST: an R package for handling estimation of graphical structure and multiclassification for error-prone gene expression data. Bioinformatics 2024;40:btae731. https://doi.org/10.1093/bioinformatics/btae731.Search in Google Scholar PubMed PubMed Central

31. Chen, LP. BOOME: a Python package for handling misclassified disease and ultrahigh-dimensional error-prone gene expression data. PLoS One 2022;17:e0276664. https://doi.org/10.1371/journal.pone.0276664.Search in Google Scholar PubMed PubMed Central

32. Chen, LP, Qiu, B. SIMEXBoost: an R package for analysis of high-dimensional error-prone data based on boosting method. R J 2024;15:5–20. https://doi.org/10.32614/rj-2023-080.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/ijb-2024-0108).


Received: 2024-11-28
Accepted: 2025-05-19
Published Online: 2025-09-08

© 2025 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 24.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/ijb-2024-0108/html?lang=en
Scroll to top button