Highly adaptive Lasso for estimation of heterogeneous treatment effects and treatment recommendation

Sohail Nizam; Allison Codi; Elizabeth Rogawski McQuade; David Benkeser

doi:10.1515/jci-2023-0085

Article Open Access

Highly adaptive Lasso for estimation of heterogeneous treatment effects and treatment recommendation

Sohail Nizam , Allison Codi , Elizabeth Rogawski McQuade and David Benkeser

Published/Copyright: August 20, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Journal of Causal Inference Volume 13 Issue 1

Abstract

The estimation of conditional average treatment effects (CATEs) is an important problem in many applications. Many machine learning-based frameworks for such estimation have been proposed, including meta-learning, causal trees, and causal forests. However, few of these methods are interpretable, while those that do emphasize interpretability often suffer in terms of performance. Here, we propose several methods that build on existing meta-learning algorithms to produce CATE estimates that can be represented as trees. We also describe new methods for the estimation of optimal treatment policies (OTPs), an area where interpretable, auditable treatment decision rules are often desirable. We introduce this method for settings with an arbitrary number of treatment arms. We provide regret rates for the proposed methods and show that they outperform popular methods, both interpretable and not. Finally, we demonstrate the use of our method on both simulated and real data from the Antibiotics for Children with severe Diarrhea trial to create OTPs for antibiotic treatment.

Keywords: policy recommendation; machine learning; antibiotic resistance; causal inference

MSC 2010: 62D20; 62G20; 68T99

1 Introduction

Methods for the estimation of heterogeneous treatment effects have been widely studied in recent years [1–3]. The development of such methods is vital in fields like healthcare where treatments may have different effects when administered to people with different characteristics. Developing treatment policies based on population-averaged effects could lead to unnecessary or even harmful treatment decisions for certain individuals. Ideally, we would like to derive optimal guidelines for practitioners in recommending the best treatment based on the clinical presentation of the patient.

A common estimand for addressing this question is the conditional average treatment effect (CATE), which measures the difference in counterfactual outcomes under two treatments conditional on a set of covariates. A natural extension of the CATE is the optimal treatment policy (OTP), which gives the treatment that will maximize the counterfactual outcome given a set of covariates. The OTP can be easily estimated given an estimate of the CATE.

One popular set of methods for estimating the CATE that has emerged is called meta-learning. Meta-learning encompasses methods that repurpose off-the-shelf machine-learning algorithms for the estimation of the CATE. These include the S, T, X, R, and DR learners [4–6]. Athey and Imbens [7] introduced causal trees, a method for CATE estimation that involves recursive partitioning of the covariate space. That partitioning can then be represented as an interpretable decision tree. They extended this method to causal forest [8], which improves performance but loses interpretability.

With the exception of causal trees, there are limited nonparametric methods for CATE estimation that emphasize interpretaibility. However, interpretability in such estimates is often of paramount importance, especially if they are to be used to guide treatment decisions. If an estimated policy suggests to a doctor that a patient should (not) be treated, the doctor should be able to discern why that decision was rendered. If the policy is a black box, it is impossible to tell what characteristics the decisions are based on and whether those decisions are fair and logical.

In this article, we introduce methods for CATE and OTP estimation that are both interpretable and high performing. Our methods combine the S learner and DR learner with the highly adaptive Lasso (HAL) algorithm. We are able to show that the resultant estimates achieve fast regret rates. Previous work has also showed that HAL regression fits can be represented as trees [9]. We adapt those representation algorithms for the cases of CATE and OTP estimation.

Our contributions can thus be summarized as new algorithms for developing flexible, yet interpretable estimates of heterogeneous treatment effects and OTP. An additional strength of our proposed methods is that we allow for any number of treatment arms and estimation based on any subset of covariates.

2 Problem setup and estimation frameworks

2.1 Setup

Assume we observe an independent and identically distributed sample of observations O i = ( W i , A i , Y i ) ∼ P 0 ∈ ℳ , where P 0 is an unknown distribution that lies in non-parametric model ℳ . W ∈ R p are covariates, A ∈ A = { 0 , 1 , … , K } is a treatment, and Y ∈ R is a real-valued outcome. Furthermore, let V ⊆ W be some subset of covariates with V ∈ R q for q ≤ p . For each a ∈ A , we define a counterfactual data unit X i ( a ) = ( W i , Y i ( a ) ) ∼ P 0 , a corresponding to the data that would have been observed on unit i if, possibly counter to fact, the unit had been assigned treatment a . The conditional average treatment effect (CATE) is defined as E 0 , a [ Y ( a ) ∣ V = v ] − E 0 , a [ Y ( 0 ) ∣ V = v ] , where for a P 0 , a -measurable function f , we write the expectation of f ( O ) as E 0 , a [ f ( O ) ] = ∫ f d P 0 , a . We note that our notation for the CATE explicitly makes level 0 of the treatment a referent level against which all other treatments are compared and we define A 0 ≔ A \ { 0 } .

The CATE quantifies the expected difference in counterfactual outcome under treatment a ∈ A 0 versus treatment 0 for data units with V = v . Also of interest is the OTP, d 0 , ⋅ ( v ) ≔ argmax a ∈ A E 0 , a [ Y ( a ) ∣ V = v ] . Given an estimate ψ ˆ a ( v ) of ψ 0 , a ( v ) , we can construct an estimate of the OTP as

d ˆ ( v ) ≔ 0 , if ψ ˆ a ( v ) < 0 ∀ a ∈ A 0 , arg max a ∈ A 0 ψ ˆ a ( v ) , else .

Treatment policies could also be derived from estimators of the CATE by giving level 0 of the treatment unless the additional benefit exceeds a policy-relevant threshold t (in the above, t = 0 ). In our real data analysis mentioned in the following, we consider performance of estimators under various thresholds for defining treatment policies. Estimation of the CATE and OTP requires a set of causal assumptions that allow us to rewrite the parameter without unobservable counterfactuals. One set of such assumptions is as follows.

Positivity: P 0 { 0 < P 0 ( A = a ∣ W ) < 1 } = 1 , ∀ a ∈ A ,
No unobserved confounding: Y ( a ) ⊥ A ∣ W , ∀ a ∈ A ,
Consistency: Y i = ∑ a ∈ A 1 a ( A i ) Y i ( a ) ,
No interference: the counterfactual outcome of each individual under each treatment does not depend on treatments assigned to other individuals in the population.

If these assumptions are satisfied, the CATE is identified by a parameter of the distribution of the observed data P 0 . We define μ 0 , a ( w ) ≔ E 0 [ Y ∣ A = a , W = w ] , where E 0 denotes the expectation under P 0 . The CATE is identified by ψ 0 , a ( v ) = E 0 [ μ 0 , a ( W ) − μ 0,0 ( W ) ∣ V = v ] . The object μ 0 , a is referred to as the outcome regression, and we sometimes use the notation μ 0 = ( μ 0 , a : a ∈ A ) to refer collectively to outcome regressions under all treatments. In the following, we also discuss the propensity score for treatment a , π 0 , a ( w ) ≔ P 0 ( A = a ∣ W = w ) .

2.2 Meta-learning with two treatment arms

This CATE estimation problem is commonly presented in the case of two treatment arms, A = { 0 , 1 } , in which case the target parameter for estimation is ψ 0 ( 1 ∣ ⋅ ) , a function only of covariates V . A common approach to estimation of ψ 0 ( 1 ∣ ⋅ ) is the S-learner [4], so-called because it involves a single outcome regression estimator. The estimate μ ˆ of μ 0 can be obtained using any regression technique. For example, many such techniques estimate μ 0 by minimizing an empirical risk criterion based on a loss ( ψ , o ) → L ( μ ) ( o ) . We focus throughout on results for squared-error loss with the understanding that other loss functions could be used. For the purpose of learning μ 0 , squared error loss may be written as L ( μ ) ( o ) ≔ { y − μ a ( v ) } 2 .

Regardless of the learning approach adopted to generate μ ˆ , the S-learner CATE estimate can be obtained post hoc. If V = W , then the estimate is ψ ˆ S , 1 ( w ) ≔ μ ˆ 1 ( w ) − μ ˆ 0 ( w ) . If instead, V ⊂ W , then a second stage of learning is required. In particular, we define a pseudo-outcome D μ ˆ ( O i ) ≔ μ ˆ 1 ( W i ) − μ ˆ 0 ( W i ) and regress this pseudo-outcome on covariates V to obtain the CATE estimate ψ ˆ S , 1 ( v ) ≔ E ˆ [ D μ ˆ ( O ) ∣ V = v ] . Again, the regression learning technique is interchangeable. For example, we could minimize empirical risk based on squared error loss L S ( ψ ) ( o ) = { D μ ( o ) − ψ ( 1 ∣ v ) } 2 .

An alternative strategy sometimes adopted is to split the data and regress Y on W separately for each treatment group to obtain estimates μ ˆ 0 ( w ) and μ ˆ 1 ( w ) of μ 0 , 1 ( w ) and μ 0,0 ( w ) , respectively. This method is referred to as the T-learner [4]. Again, if V = W , we can estimate the CATE by subtracting the two regression estimates: ψ ˆ T , 1 ( w ) = μ ˆ 1 ( w ) − μ ˆ 0 ( w ) . If V ⊂ W , this difference constitutes a pseudo-outcome, which we further regress V to obtain our CATE estimate. The T-learner typically has lower performance due to the lack of information sharing across treatment arms when estimating the outcome regression [6]. Thus, it is not considered further here.

Van der Laan [10] described a doubly robust (DR)-learner, which involves estimation not only of the outcome regression, but also the propensity score. The two nuisance parameter estimates μ ˆ or π ˆ are both used to construct the pseudo-outcome:

(1) D μ ˆ , π ˆ ( O i ) = 2 A i − 1 π ˆ A i ( W i ) { Y i − μ ˆ A i ( W i ) } + μ ˆ 1 ( W i ) − μ ˆ 0 ( W i ) .

D μ ˆ , π ˆ ( O ) is then regressed on V to obtain the CATE estimate ψ ˆ DR ( 1 ∣ v ) ≔ E ˆ [ D μ , π ( O ) ∣ V = v ] . We can obtain this estimate, for example, via empirical risk minimization based on a loss L DR ( ψ ) ( o ) . We focus on squared-error loss, L DR ( ψ ) ( o ) = { D μ , π ( o ) − ψ 1 ( v ) } 2 . This estimator is referred to as doubly robust because E 0 [ D μ , π ( O ) ∣ V = v ] = ψ 0 , 1 ( v ) if either μ = μ 0 or π = π 0 . Thus, we generally expect that ψ ˆ DR will be consistent for ψ 0 if either μ ˆ is consistent for μ 0 or π ˆ is consistent for π 0 .

2.3 Meta-learning with multiple treatment arms

These methods readily extend to the case when ∣ A ∣ > 2 . We describe analogs of each of the meta-learners when there are multiple treatment arms. First, we must choose a reference treatment against which we will compare all others. This may be a control group or it may just be one arm chosen arbitrarily. Throughout this article, we will let A = 0 be the reference treatment.

In the case when V = W , the S-learner procedure is largely unchanged. We regress Y on A and W to obtain μ ˆ a ( w ) , which can then be used to construct ψ ˆ S , a ( w ) ≔ μ ˆ a ( w ) − μ ˆ 0 ( w ) for each a ∈ A 0 .

In the case when V ⊂ W , the S-learner requires the estimation of μ 0 ( a , w ) and second-stage regression as mentioned earlier. For a given a , we could learn ψ ˆ S , a ( v ) by empirical risk minimization, based on the loss L S , μ a ( ψ ) ( o ) , such as mean squared-error, L S , μ a ( ψ ) ( o ) = { D μ a ( o ) − ψ ( v ) } 2 .

Alternatively, we could learn each ψ ˆ a ( v ) simultaneously by considering a summed loss function L S , μ ( ψ ) ≔ ∑ a ∈ A 0 ω ( a ) L S μ a ( ψ ) . Here, ω ( a ) is any positive weight. We index the loss by μ to enable distinction between the regression using the estimated nuisance parameters and the regression using the true nuisance parameters. To minimize this summed loss, we create ∣ A ∣ − 1 pseudo-outcomes for each individual, with the form of the pseudo-outcome for the i th individual associated with ψ a ( v ) calculated as

(2) D μ ˆ a ( O i ) = μ ˆ a ( W i ) − μ ˆ 0 ( W i ) .

This results in a modified dataset with n ( ∣ A ∣ − 1 ) rows that can then be used with any common regression software to learn a regression of the stacked pseudo-outcome values onto covariates plus one-hot-encoded a values. This approach may provide some benefit when some levels of treatment are rarely observed.

To construct the DR-learner with multiple treatments, we similarly create ∣ A ∣ − 1 pseudo-outcomes for each individual. In this case, the pseudo-outcome for the i th individual associated with ψ 0 ( a ∣ v ) is

(3) D μ ˆ , π ˆ a ( O i ) = 1 a ( A i ) π ˆ a ( W i ) − 1 0 ( A i ) π ˆ 0 ( W i ) { Y i − μ ˆ A i ( W i ) } + μ ˆ a ( W i ) − μ ˆ 0 ( W i ) .

Note that if an individual i has A i = a ≠ 0 , (3) reduces to

D μ ˆ , π ˆ a ( O i ) = { Y i − μ ˆ A i ( W i ) } π ˆ a ( W i ) + μ ˆ a ( W i ) − μ ˆ 0 ( W i ) .

If instead A i = 0 , then (3) reduces to

D μ ˆ , π ˆ a ( O i ) = − { Y i − μ ˆ A i ( W i ) } π ˆ 0 ( W i ) + μ ˆ a ( W i ) − μ ˆ 0 ( W i ) .

Finally, if individual i has A i ∉ { 0 , a } , (3) reduces to

D μ ˆ , π ˆ a ( O i ) = μ ˆ a ( W i ) − μ ˆ 0 ( W i ) .

This pseudo-outcome form is doubly robust as established in the following.

Theorem 2.1

For each a ∈ A 0 , E 0 [ D μ , π a ( O ) ∣ V = v ] = ψ 0 , a ( v ) if either μ = μ 0 or π = π 0 .

The proof of Theorem 2.1 can be found in Section S1.1 of Supplementary Material. As with the S-learner, the CATE estimate is obtained by regressing the stacked pseudo-outcomes D μ ˆ , π ˆ a ( O ) against the covariates V and one-hot-encoded a values.

2.4 Causal trees and forests

Athey and Imbens [7] developed causal trees for estimating the CATE in a way that can be represented as a decision tree. The method involves recursively partitioning the feature space into regions that have constant treatment effect. They employ what they call “honest estimation” of the CATE by using separate sets of data to partition the feature space and estimate the treatment effects within the resulting partitions. Partitioning is done in a greedy way to minimize an estimate of the expected mean square error of the treatment effect. The optimal depth of the tree is selected via cross-validation, and the tree is then pruned to the selected depth. Finally, treatment effects are estimated within each region of the partition using a held-out-dataset. An extension of this method, causal forests, applies bootstrap aggregation to causal trees in order to obtain smoother estimates of the CATE [8]. We compare the performance of both causal trees and causal forests to our method in Section 5.2.

3 HAL

3.1 Background

Our method centers on the use of HAL for regression in the final stage of the chosen meta-learning algorithm. We briefly introduce HAL and its properties in the context of a general supervised learning task. Later, we extend these properties to estimation of the CATE.

HAL is a general approach for the estimation of functional parameters using regularized empirical risk minimization. The theory of HAL is built around two assumptions about the underlying functional parameter: (i) that it is right-continuous with left-hand limits and (ii) that it has a finite variation norm. The assumption of smoothness defined by a function’s variation norm has been long studied in the statistical learning literature [11].

Benkeser and van der Laan [12] and van der Laan [13] discussed an approach for finding the minimum loss-based estimator (MLE) in the class of functions with variation norm smaller than a finite constant. The MLE is computed based on the fact that any cadlag function with finite variation norm can be arbitrarily well approximated by a tensor product of indicator basis functions. If learning, for example, the mean of a continuous-valued random variable Z conditional on a univariate continuous-valued variable C , then a HAL estimate is created by first mapping C into n basis functions of the form 1 [ C i , ∞ ) ( C ) for i = 1 , … , n . Thus, the functional form of the model can be expressed as E [ Z ∣ C ] = β 0 + β 1 , C 1 [ C 1 , ∞ ) ( C ) + β 2 , C 1 [ C 2 , ∞ ) ( C ) + … + 1 [ C n , ∞ ) ( C ) for some ( β 0 , β 1 , C , … , β n , C ) ∈ R n + 1 . In this way, the HAL model parameterizes a step function that steps at each observed value of C . To fit the estimator, a relevant empirical risk criterion (e.g., mean-squared error) is optimized over model coefficients while penalizing the L 1 norm of the ( n + 1 ) -dimensional coefficient vector [14]. The coefficient for the penalty is selected using cross-validation. As the dimension of covariates increases, HAL is extended by including tensor product basis functions. For example, in the bivariate case C = ( C 1 , C 2 ) the functional form of the model is extended to E [ Z ∣ C 1 ] = β 0 + β 11 , C 1 [ C 11 , ∞ ) ( C 1 ) + ⋯ + 1 [ C 1 n , ∞ ) ( C 1 ) + β 21 , C 1 [ C 21 , ∞ ) ( C 2 ) + ⋯ + 1 [ C 2 n , ∞ ) ( C 2 ) + β 1 , C 1 [ C 11 , ∞ ) , [ C 21 , ∞ ) ( C 1 , C 2 ) + ⋯ + 1 [ C 1 n , ∞ ) , [ C 2 n , ∞ ) ( C 1 , C 2 ) , where C j i is the observed value of C j for the i -th observation in the data. In this way, the HAL model parameterizes a histogram regression that jumps at observed values of ( C 1 , C 2 ) . The idea generalizes to arbitrary covariate dimensions. At covariate dimension p , the HAL model includes at most n ( 2 p − 1 ) basis functions included (see Benkeser and van der Laan [12] for additional examples and details).

Due to the construction of the basis functions, the L 1 -norm of the model coefficients equals the variation norm of the estimate. Benkeser and van der Laan [12] showed that HAL estimates converge in terms of regret at a rate of O p ( n − [ 1 ⁄ 4 + 1 ⁄ { 8 p + 1 } ] ) , where p is the dimension of the feature vector. The authors also demonstrated competitive performance of HAL with other commonly used machine learning frameworks across a variety of empirical experiments. This bound was later refined to include sieves that allow the variation norm of the function to grow with sample size while achieving a rate of O p ( a n n − 1 ⁄ 3 log ( n ) ( 2 p − 1 ) ⁄ 3 ) , where a n is a factor that controls the rate of growth of the variation norm [15].

3.2 Highly adaptive regression trees

Previous work has shown that regression fits generated by the HAL algorithm have an identical representation as trees [9] referred to as HART. This result is not entirely surprising given HAL’s representation as a histogram regression. Any histogram regression could be represented as a tree, where each leaf in the tree would correspond to a single “bar” in the histogram (i.e., a region of the covariate space over which the model returns a constant prediction). The challenge in representing general histogram regressions as trees is that the partitioning of the covariate space implied by an arbitrary histogram regression is not necessarily recursive. This is in contrast to the typical approach for building decision trees in regression problems, which builds trees by explicitly recursively partitioning the covariate space until a certain goodness-of-fit criterion is reached. For general histogram regressions, achieving a tree representation is more challenging, given that there may be many non-rectangular “bars” in the histogram regression that must be represented by multiple leaves in the tree. Nizam and Benkeser [9] solved this problem specifically for HAL by introducing an algorithm that maps a fitted HAL model into a decision tree representation. The algorithm involves generating a set of candidate values implied by the non-zero coefficients in a HAL fit and using those values to recursively partition the feature space. The algorithm is structured such that in each sub-region of the feature space, the set of candidate values is appropriately pruned to minimize the amount of redundancy in the final tree representation, i.e., the algorithm yields the fewest trees leaves possible for representing non-rectangular regions in the covariate space.

These tree representations make HAL an ideal candidate for the estimation of CATEs and, by extension, treatment policies.

4 CATE estimation with HAL

4.1 Theoretical guarantees for S and DR learners

Let μ ˆ and π ˆ be given estimates of μ 0 and π 0 . Furthermore, let R ˆ p * ≔ O p ( n − [ 1 ⁄ 4 + 1 ⁄ { 8 p + 1 } ] ) , which is the regret rate of the HAL estimator when estimating a function of bounded variation. As discussed in the introduction, we could equivalently set R ˆ p * ≔ O p ( a n n − 1 ⁄ 3 log ( n ) ( 2 p − 1 ) ⁄ 3 ) as in Bibaut and van der Laan [15]. Finally, for general P 0 -measurable function f of O , we define the L m ( P 0 ) norm of f to be ∣ ∣ f ∣ ∣ m , P 0 ≔ ∫ f m d P 0 1 ⁄ m .

Theorem 4.1

Let ψ ˆ S ( v ) be an estimate of ψ 0 ( v ) constructed with the S-learner algorithm. If V = W and HAL is used to construct the estimator, then

∫ [ L S ( ψ ˆ S ) − L S ( ψ 0 ) ] d P 0 = R ˆ p * 2 .

If V ⊂ W , and HAL is used to perform the second-stage regression, then

∫ [ L S , μ 0 ( ψ ˆ S ) − L S , μ 0 ( ψ 0 ) ] d P 0 = ∑ a ∈ A 0 O p ( ∣ ∣ μ ˆ a − μ 0 ∣ ∣ 2 , P 0 ) ⋅ R ˆ q * + R ˆ q * .

A proof of Theorem 4.1 can be found in Section S1.2. Recall that when V = W , the S-learner only involves the estimation of the outcome regression μ 0 ( a , v ) . Intuitively then, when HAL is used to estimate μ 0 ( a , v ) , the S-learner CATE estimate converges in terms of regret to the true CATE at a rate of R n , p * . When V ⊂ W , the S-learner requires the estimation of μ 0 as a nuisance parameter. We consider the rate at which the S-learner CATE estimate constructed with the true outcome regression converges in terms of regret to the true CATE. We see that when HAL is used for the second-stage regression, this rate depends both on the regret rate of the outcome regression estimator and that of the HAL pseudo-outcome regression estimator.

We have the following result for the DR learner.

Theorem 4.2

Let ψ ˆ DR ( v ) be an estimate of ψ 0 ( a ∣ v ) constructed with the DR-learner algorithm. If HAL is used for the final-stage regression, then

∫ [ L DR , μ 0 , π 0 ( ψ ˆ DR ) − L DR , μ 0 , π 0 ( ψ 0 ) ] d P 0 = R ˆ q * + ∑ a ∈ A 0 [ O p ( ∣ ∣ π ˆ − π 0 ∣ ∣ 4 , P 0 ) O p ( ∣ ∣ μ ˆ a − μ 0 ∣ ∣ 4 , P 0 ) ] ⋅ R ˆ q * .

Theorem 4.2 gives a similar result for the DR-learner as with the S. learner, but where the rate now depends on the regret rates of the estimates of π 0 , a ( w ) and μ 0 , a ( w ) . A proof can be found in Section S1.3. The theorem indicates that the DR-learner may enjoy advantages over the S-learner. Consider a scenario in which the CATEs and propensity score are easier to learn than the outcome regression. When using the S-learner, the overall regret rate may be slowed by poor estimation of the outcome regression. However, the rate of the DR-learner involves the product of the regret rates of the propensity score and the outcome regression. Thus, the ability to quickly learn the propensity score may mitigate the impact of poor outcome regression estimation, thereby achieving an overall rate closer or equal to R ˆ q * . This scenario is plausible in several settings including when the data are generated from randomized experiments.

Moreover, the pseudo-outcome regression may be easier to learn than the outcome regression itself simply because the contrast function μ 0 , a − μ 0,0 may be simpler or smoother function than μ 0 itself. In the case of HAL, the most relevant measure of smoothness is the variation norm of the respective functions. In the case when V ⊂ W , we may also expect that ψ 0 will have lower variation norm than μ 0 , a − μ 0,0 , as it conditions on a subset of variables, thereby averaging over covariates across which there may be substantial variation in μ 0 .

4.2 Tree representations

When HAL is used as the final-stage regression estimator for either the S- or DR-learning procedures, the resulting CATE and OTP estimates can be represented using trees. Both of those procedures rely on extensions of the original HART algorithm [9]. In the case when there are two treatments, we can simply apply the original HART algorithm to build a CATE representation and apply a threshold of 0 to the terminal nodes in that tree in order to obtain an OTP estimate representation. In the case when there are more than two treatments, we must account for the treatment indicator variables in the tree. The HART algorithm allows the user to select the order that variables appear in the tree. Typically, we aim to find an ordering that minimizes the size of the tree. However, in this case, it is convenient to enforce that the treatment indicator variables appear first in the tree so that the effects for each treatment arm relative to the reference treatment are represented by distinct regions of the tree. Full details for representing the CATE estimate as a tree are shown in Algorithm 1.

Algorithm 1 CATE tree construction
Input: CATE Estimate ψ ˆ a ( v )
if ∣ A ∣ = 2
Apply HART to ψ ˆ 1 ( v )
Collapse regions with identical terminal nodes
else
Apply HART to ψ ˆ a ( v ) enforcing A appears first
Collapse regions with identical terminal nodes
end if

When representing the OTP in the case of more than two treatments, we can enforce that the treatment indicator variables appear last in the tree, just before the terminal nodes in the regions where they are necessary to split on. We can then replace the treatment indicator variable nodes and the terminal nodes with terminal nodes displaying the treatment that has the highest expected counterfactual outcome. Full details for representing the OTP estimate can be found in Algorithm 2.

Algorithm 2 Treatment policy tree construction
Input: CATE Estimate ψ ˆ a ( v )
if ∣ A ∣ = 2
Apply HART to ψ ˆ 1 ( v )
Set terminal node values to 1
if ψ ˆ 1 ( v ) > 0 , 0 else
Collapse regions with identical terminal nodes
else
Apply HART to ψ ˆ a ( v ) enforcing A appears last Replace treatment nodes and terminal nodes with argmax a ∈ A ψ ˆ a ( v ) in each region
Collapse regions with identical terminal nodes
end if

5 Data analysis

5.1 Motivating study

Antibiotics for Children with severe Diarrhea (ABCD) was a randomized clinical trial designed to evaluate the efficacy of azithromycin, an antibiotic commonly used against diarrheal pathogens, on mortality and linear growth in dehydrated or undernourished children presenting with acute non-bloody diarrhea [16]. While the trial did not show a survival benefit, there was a small benefit in linear growth. Further analysis identified benefits of azithromycin among children with likely bacterial etiologies of diarrhea and suggested a broader range of factors beyond diarrhea etiology could help determine treatment benefit [17]. Interpretable OTPs could help incorporate these additional factors into treatment guidance, presenting an ideal use case for our an interpretable method for determining treatment.

We applied both the S- and DR-learner versions of HATT to a dataset simulated to closely follow the true ABCD data for linear growth. We also compared the performance of S- and DR-learner to that of causal trees. For all estimators, we used fivefold cross-validation to select any required tuning parameters. Finally, we used repeated training/test sets to evaluate the methods on the ABCD trial data to compare the generalization performance of treatment rules on test data. Finally, we generate a treatment recommendation based on HATT.

Additional simulations are included in the supplemental material that examine the asymptotic properties of the estimator and compare to a wider variety of competitor approaches.

5.2 Simulations

5.2.1 Data-generating process

We used the real ABCD data to inform our simulation design. Each of the 28 covariates W had marginal distributions that closely followed the observed covariate distribution. The majority of the covariates were simulated independently for simplicity, with the exception of the malnutrition characteristics, which had dependencies based on epidemiologically plausible relationships. Details regarding the functional forms of each covariate can be found in the supplementary materials.

Binary treatment assignment of azithromycin A ∈ { 0 , 1 } followed a Bernoulli distribution with p = 0.5 to maintain an randomized controlled trial structure in line with the real dataset. Based on this, our true propensity score is constant π a ( w ) = 0.5 for all w .

We simulated the continuous outcome length-for-age z-score at day 90 post-enrollment (day 90 LAZ) by fitting a model in the placebo arm of the real dataset and using LASSO for variable selection. LASSO with tenfold cross-validated shrinkage selected number of days with diarrhea prior to enrollment, LAZ, study site, and socioeconomic status (SES) quintile as covariates in the model. We also added terms for age and a binary indicator for Shigella, a common bacterial pathogen causing diarrhea, to make the model more epidemiologically plausible. Next, interaction terms were added with azithromycin. The interaction terms included (1) an interaction between Shigella and azithromycin; (2) an interaction between Shigella, LAZ, and azithromycin; and (3) an interaction between Shigella, LAZ, age, and azithromycin. We refit the final model to the real data to obtain values for the coefficients, modifying these slightly to increase the magnitude of the CATE. The final form of the outcome regression is included in the supplemental material. It includes main effects for number of days with diarrhea prior to enrollment, study site, age, baseline LAZ, in addition to the interactions described earlier. The interactions were generated such that the CATE was 0 if Shigella was not present, with strong antibiotic effects for children with lower LAZ. However, the inclusion of the three-way interaction with age was calibrated such that the increased magnitude of the CATE associated with decreased LAZ was mitigated by older age.

We simulated data at sample size n = 6,692 to match the number of participants in the real ABCD trial with valid quantitative polymerase chain reaction results. Missingness at random was added to 4% of the simulated outcomes to reflect the random missingness of day 90 LAZ in the real data.

5.2.2 Estimation and evaluation

Because the CATE is zero for a large portion of the population, the optimal treatment rule is non-unique. So rather than considering the trivial treatment rule that treats all children, we instead created treatment rules by thresholding the CATE at 0.15. This decision is supported scientifically as specificity in antibiotic treatment may be important for reducing the emergence of drug resistance to antibiotics.

We generated 500 simulated datasets that were analyzed using HAL DR-learner, HAL S-learner, and causal trees. To evaluate each estimation method, we generated independent test datasets of size N = 1 e 4 . These datasets were first used to approximate the L 2 norm of each estimated CATE: E 0 [ { ψ ˆ 1 ( V ) − ψ 0 , 1 ( V ) } 2 ] , a measure of global fit of ψ ˆ to ψ 0 , 1 . Next, we used the test data to estimate the expected value of follow-up LAZ value if all children were treated according to the rule that treats child i if ψ ˆ 1 ( V i ) ≥ 0.15 . We also computed the sensitivity, specificity, and positive predictive value for each estimation method.

5.2.3 Results

DR-learning of the CATE with HAL as the second-stage algorithm and S-learning HAL both yielded smaller L 2 norms than causal tree (Table 1). The expected day 90 LAZ outcomes were similar comparing DR-learning with HAL to causal tree, with both marginally better than the HAL S-learner. On the other hand, we found that the HAL-based learners more judiciously recommended treatment compared to causal tree, attaining similar results in terms of expected day 90 LAZ, while treating fewer children – behavior that would be desirable in real-world settings. This is also reflected in the improved sensitivity, specificity, and positive-predictive value, specifically of the HAL DR-learner.

Table 1

Simulation results for threshold = 0.15

	HAL (DR-learner)	HAL (S-learner)	Causal tree
L2-norm	0.0018 (0.001, 0.003)	0.0028 (0.002, 0.003)	0.0169 (0.014, 0.022)
E[Y(d)]	− 1.68 ( − 1.68 , − 1.68 )	− 1.69 ( − 1.71 , − 1.68 )	− 1.68 ( − 1.70 , − 1.68 )
P(d = 1)	0.18 (0.18, 0.18)	0.12 (0.07, 0.16)	0.25 (0.18, 0.32)
Sens.	1.00 (0.99, 1.00)	0.67 (0.36, 0.87)	0.91 (0.51, 1.00)
Spec.	0.99 (0.99, 0.99)	0.99 (0.99, 1.00)	0.87 (0.80, 0.93)
PPV	0.99 (0.99, 0.99)	0.96 (0.93, 0.97)	0.54 (0.41, 0.70)

Median interquartile range (IQR).

5.3 Real data

5.3.1 Training/testing details

We analyzed the ABCD data to develop treatment rules for the day 90 LAZ outcome (as in the simulations), as well as the additional outcome of diarrhea at day 3 post-enrollment (binary). As in the simulation, we considered treatment rules that were a function of LAZ, Shigella presence, and age. It is unclear which threshold might be most relevant clinically and so we considered evaluating performance under several different thresholds: t = 0.000 , 0.025, 0.050, 0.075, 0.100. We repeatedly sample 70% of the data into a training set, where we learned treatment rules based using HAL DR-learner, HAL S-learner, and causal trees. The remaining 30% of the sample was used to estimate the expected outcome (mean LAZ or proportion with diarrhea at day 3 post-enrollment) under that rule. Because the ABCD trial was randomized, these values can be reasonably estimated using inverse probability-weighted estimators that account for missingness of the outcome. We applied such estimators to each test sample and additionally computed the proportion of the test sample that was recommended treatment. We repeated the training/test split 500 times and calculated results aggregated over all replicates.

5.3.2 Results

We found that across a range of thresholds, treatment rules based on HAL tended to lead to better estimates of expected outcomes than those generated by causal tree, although the differences were relatively modest. The HAL-based learners tended to be more liberal in terms of treatment recommendation. Overall, the results provide some evidence of modestly improved performance of HAL-based estimators over causal trees.

We applied the HATT algorithm to the HAL DR-learner for the lowest non-trivial threshold (0.025) for the diarrhea outcome to assess the scientific feasibility of the resultant tree. The tree is displayed in Figure 1. The tree recommended treatment for all children with baseline LAZ < − 1.7 , indicating that the children with greatest levels of malnourishment may stand to benefit the most from antibiotic therapy. Among the remaining children, with higher LAZ, treatment was recommended in children under 6.6 months of age and children who had Shigella detected in their stool. Overall, the rule would have recommended that 88% of children in the ABCD trial receive antibiotics (Tables 2 and 3).

Figure 1

Treatment recommendation tree for the ABCD data using the HAL DR-learner.

Table 2

Estimated expected day 90 LAZ under treatment rule accounting for baseline LAZ, Shigella, and age

	Threshold	Expected day 90 LAZ	Proportion treated
HAL (DR-learner)	0.000	− 1.602 ( − 1.642 , − 1.573 )	1.000 (1.000, 1.000)
	0.025	− 1.605 ( − 1.646 , − 1.580 )	1.000 (1.000, 1.000)
	0.050	− 1.641 ( − 1.677 , − 1.599 )	0.596 (0.000, 1.000)
	0.075	− 1.669 ( − 1.700 , − 1.639 )	0.000 (0.000, 0.029)
	0.100	− 1.671 ( − 1.700 , − 1.638 )	0.000 (0.000, 0.000)
HAL (S-learner)	0.000	− 1.602 ( − 1.642 , − 1.577 )	0.987 (0.933, 1.000)
	0.025	− 1.626 ( − 1.659 , − 1.584 )	0.800 (0.731, 0.930)
	0.050	− 1.643 ( − 1.670 , − 1.604 )	0.511 (0.353, 0.635)
	0.075	− 1.657 ( − 1.686 , − 1.630 )	0.127 (0.023, 0.274)
	0.100	− 1.667 ( − 1.697 , − 1.637 )	0.023 (0.000, 0.087)
Causal tree	0.000	− 1.610 ( − 1.644 , − 1.579 )	0.680 (0.643, 0.723)
	0.025	− 1.614 ( − 1.655 , − 1.575 )	0.658 (0.619, 0.706)
	0.050	− 1.613 ( − 1.647 , − 1.581 )	0.635 (0.576, 0.676)
	0.075	− 1.622 ( − 1.652 , − 1.583 )	0.587 (0.487, 0.658)
	0.100	− 1.633 ( − 1.662 , − 1.591 )	0.524 (0.462, 0.617)

Median (IQR) over 500 splits of data into 70% training and 30% test.

Table 3

Estimated proportion of children with diarrhea at day 3 post-enrollment under treatment rule accounting for baseline LAZ, Shigella, and age

	Threshold	Proportion with diarrhea at day 3 post-enrollment	Proportion treated
HAL (DR-learner)	0.000	0.129 (0.120, 0.141)	0.973 (0.929, 1.000)
	0.025	0.137 (0.125, 0.148)	0.710 (0.603, 0.906)
	0.050	0.148 (0.137, 0.161)	0.453 (0.382, 0.509)
	0.075	0.173 (0.162, 0.185)	0.102 (0.071, 0.147)
	0.100	0.183 (0.172, 0.195)	0.036 (0.000, 0.065)
HAL (S-learner)	0.000	0.129 (0.121, 0.139)	0.995 (0.992, 0.997)
	0.025	0.136 (0.126, 0.148)	0.900 (0.793, 0.955)
	0.050	0.170 (0.154, 0.184)	0.167 (0.068, 0.235)
	0.075	0.185 (0.176, 0.197)	0.000 (0.000, 0.016)
	0.100	0.185 (0.177, 0.199)	0.000 (0.000, 0.000)
Causal tree	0.000	0.133 (0.123, 0.145)	0.722 (0.663, 0.753)
	0.025	0.135 (0.125, 0.147)	0.617 (0.542, 0.702)
	0.050	0.139 (0.131, 0.153)	0.523 (0.452, 0.569)
	0.075	0.142 (0.133, 0.156)	0.396 (0.327, 0.480)
	0.100	0.153 (0.139, 0.161)	0.296 (0.193, 0.353)

Median (IQR) over 500 splits of data into 70% training and 30% test.

6 Discussion

In this article, we have introduced a method for CATE and OTP estimation that empirically has strong performance and has desirable interpretability properties. The empirical results seen in our experiments align with the regret rate theory that we establish. When paired with our adaptations of the tree representation algorithms in Nizam and Benkeser [9], the method provides an appealing framework for developing data-driven, interpretable decision-making systems. Nevertheless, a potential downside of this work is that while learned OTP may be interpretable, they may not satisfy fundamental notions of fairness. A concern is that the interpretability of an OTP may provide a false sense of security in terms of fairness. However, it is not clear that human observers will be adequate judges of the fairness of an OTP simply by studying the policy alone. This may be particularly true if the tree for the OTP is complex. In the future, we would like to enhance our work by making use of the literature on algorithmic fairness, such as in Nabi et al. [18]. Unifying such work with our own could lead to the development of policies that are interpretable, optimal, and fair. Moreover, recent modifications to HAL could also be considered for improving the performance of the estimator further in the multi-treatment setting [19].

An additional facet of our approach relative to existing CATE and treatment rule approaches in the literature is that we explicitly distinguish between variables that must be adjusted for to appropriately control for confounding ( W ) and those that we wish to build a treatment rule around ( Z ). While much of the existing literature tends to focus on rules that are based on the full set of confounders, we find that treating these two covariates separately may be important in many applications. In particular, there may be settings in which we wish to use an existing data source to develop a treatment rule. These data may include a rich set of potential confounders W ; however, when the rule is deployed in the real world, we may have access to a much less rich set of information. Therefore, continuing to develop methods that allow for a distinction between confounders and rules on which to base treatment decisions is important. It is also interesting to consider additional variable selection techniques that could be employed to identifying the most important subset of variables that could be used to develop a treatment rule, although we view this as a distinct scientific endeavor.

Acknowledgments

We would like to thank the members of the AntiBiotics for Children with severe Diarrhea (ABCD) study team: Dilruba Ahmed, Tahmeed Ahmed, Tahmina Alam, Hannah E. Atlas, Per Ashorn, Henry Badji, Mohamed Bakari, Naor Bar-Zeev, Rajiv Bahl, Aishwarya Chauhan, Fadima Cheick Haidara, Mohammod Jobayer Chisti, Jen Cornick, Flanon Coulibaly, Ayesha De Costa, Nigel Cunliffe, Saikat Deb, Emily L. Deichsel, Pratibha Dhingra, Usha Dhingra, Christopher P. Duggan, Queen Dube, Arup Dutta, Bridget Freyne, Wilson Gumbi, Jean Gratz, Eric R. Houpt, Aneeta Hotwani, Ohedul Islam, Vijay Kumar Jaiswal, Furqan Kabir, Mamun Kabir, Kevin Mwangi Kariuki, Irene Kasumba, Shaila S. Khan, Upendo Kibwana, Rodrick Kisenge, Karen L. Kotloff, Jie Liu, Victor Maiden, Dramane Malle, Karim Manji, Christine McGrath, Ashka Mehta, Cecylia Msemwa, Chifundo Ndamala, Latif Ndeketa, Churchil Nyabinda, Darwin Operario, Irin Parvin, Patricia B. Pavlinac, Jasnehta Permala-Booth, James A. Platts-Mills, Ira Praharaj, Farah Naz Qamar, Shahida Qureshi, Muhammad Waliur Rahman, Doreen Rwigi, Abraham Samma, Sunil Sazawal, Sadia Shakoor, Anil Kumar Sharma, Jonathon Simon, Benson O. Singa, Sarah Somji, Samba O. Sow, Christopher R. Sudfeld, Milagritos D. Tapia, Rozina Thobani, Caroline Tigoi, Stephanie N. Tornberg-Belanger, Aliou Toure, Judd L. Walson, Desiree Witte, and Mohammad Tahir Yousafzai.

Funding information: This study was funded by NSF Division of Mathematical Sciences Award 2015540 (DB) and R01AI185140 (ERM). The ABCD trial (Grant Numbers OPP 1126331 and OPP 1179069) and subsequent analyses (INV-044317 to ERM) were funded by the Bill & Melinda Gates Foundation.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Conflict of interest: Prof. David Benkeser is a member of the Editorial Board of the Journal of Causal Inference, but was not involved in the review process of this article.
Ethical approval: The research related to human use has been complied with all the relevant national regulations, institutional policies, and in accordance the tenets of the Helsinki Declaration, and has been approved by the authors’ institutional review board or equivalent committee.
Data availability statement: Code needed to reproduce the simulation studies is available at https://github.com/allicodi/hatt_sims/tree/main. The data from the ABCD trial are not publicly available. Data may be available from the authors upon reasonable request and with permission of relevant stakeholders.

References

[1] Jacob D. CATE meets ML: Conditional average treatment effect and machine learning. Digital Finance. 2021;3(2):99–148. 10.1007/s42521-021-00033-7Search in Google Scholar

[2] Hu A. Heterogeneous treatment effects analysis for social scientists: A review. Soc Sci Res. 2022;109:102810. 10.1016/j.ssresearch.2022.102810Search in Google Scholar PubMed

[3] Rekkas A, Paulus JK, Raman G, Wong JB, Steyerberg EW, Rijnbeek PR, et al. Predictive approaches to heterogeneous treatment effects: a scoping review. BMC Med Res Methodol. 2020;20(1):1–12. 10.1186/s12874-020-01145-1Search in Google Scholar PubMed PubMed Central

[4] Künzel SR, Sekhon JS, Bickel PJ, Yu B. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc National Acad Sci. 2019;116(10):4156–65. 10.1073/pnas.1804597116Search in Google Scholar PubMed PubMed Central

[5] Nie X, Wager S. Learning objectives for treatment effect estimation. 2017. arXiv: http://arXiv.org/abs/arXiv:171204912. Search in Google Scholar

[6] Kennedy EH. Towards optimal doubly robust estimation of heterogeneous causal effects. 2020. arXiv: http://arXiv.org/abs/arXiv:200414497. Search in Google Scholar

[7] Athey S, Imbens G. Recursive partitioning for heterogeneous causal effects. Proc Nati Acad Sci. 2016;113(27):7353–60. 10.1073/pnas.1510489113Search in Google Scholar PubMed PubMed Central

[8] Wager S, Athey S. Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc. 2018;113(523):1228–42. 10.1080/01621459.2017.1319839Search in Google Scholar

[9] Nizam S, Benkeser D. Highly adaptive regression trees. J Evol Intel. 2024;17:535–47. 10.1007/s12065-023-00836-0Search in Google Scholar

[10] van der Laan MJ. Statistical inference for variable importance. Int J Biostat. 2006;2(1). 10.2202/1557-4679.1008 Search in Google Scholar

[11] Donoho DL High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture. 2000;1(2000):32. Search in Google Scholar

[12] Benkeser D, van der Laan M. The highly adaptive lasso estimator. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE; 2016. p. 689–96. 10.1109/DSAA.2016.93Search in Google Scholar PubMed PubMed Central

[13] van der Laan M. A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. Int J Biostat. 2017;13(2):20150097. 10.1515/ijb-2015-0097. Search in Google Scholar PubMed PubMed Central

[14] Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B (Methodological). 1996;58(1):267–88. 10.1111/j.2517-6161.1996.tb02080.xSearch in Google Scholar

[15] Bibaut AF, van der Laan MJ. Fast rates for empirical risk minimization over cadlag functions with bounded sectional variation norm. 2019. arXiv: http://arXiv.org/abs/arXiv:190709244. Search in Google Scholar

[16] Ahmed T, Chisti MJ, Rahman MW, Alam T, Ahmed D, Parvin I, et al. Effect of 3 days of oral Azithromycin on young children with acute diarrhea in low-resource settings: a randomized clinical trial. JAMA Network Open. 2021;4(12):e2136726–e2136726. 10.1001/jamanetworkopen.2021.36726Search in Google Scholar PubMed PubMed Central

[17] Pavlinac PB, Platts-Mills JA, Liu J, Atlas HE, Gratz J, Operario D, et al. Azithromycin for bacterial watery diarrhea: a reanalysis of the AntiBiotics for Children with severe Diarrhea (ABCD) trial incorporating molecular diagnostics. J Infect Diseases. 2024;229(4):988–98. 10.1093/infdis/jiad252Search in Google Scholar PubMed PubMed Central

[18] Nabi R, Malinsky D, Shpitser I. Learning optimal fair policies. In: International Conference on Machine Learning. PMLR; 2019. p. 4674–82. Search in Google Scholar

[19] Malenica I, Phillips RV, Lazzareschi D, Coyle JR, Pirracchio R, van der Laan MJ. Multi-task highly adaptive Lasso. 2023. arXiv: http://arXiv.org/abs/arXiv:230112029. Search in Google Scholar

Received: 2023-12-28

Revised: 2024-09-30

Accepted: 2024-11-01

Published Online: 2025-08-20

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary material

Articles in the same Issue

https://doi.org/10.1515/jci-2023-0085

Keywords for this article

policy recommendation; machine learning; antibiotic resistance; causal inference

Creative Commons

BY 4.0