Regression calibration for time-to-event outcomes: mitigating bias due to measurement error in real-world endpoints

Benjamin Ackerman; Ryan W. Gan; Youyi Zhang; Juned Siddique; James Roose; Jennifer L. Lund; Janick Weberpals; Jocelyn R. Wang; Craig S. Meyer; Jennifer Hayden; Khaled Sarsour; Ashita S. Batavia

doi:10.1515/em-2025-0009

Article Open Access

Regression calibration for time-to-event outcomes: mitigating bias due to measurement error in real-world endpoints

Benjamin Ackerman , Ryan W. Gan , Youyi Zhang , Juned Siddique , James Roose , Jennifer L. Lund , Janick Weberpals , Jocelyn R. Wang , Craig S. Meyer , Jennifer Hayden , Khaled Sarsour and Ashita S. Batavia

Published/Copyright: September 26, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Epidemiologic Methods Volume 14 Issue 1

Abstract

Objectives

In drug development, there is increasing interest in leveraging real-world data (RWD) to augment trial data and generate evidence about treatment efficacy. However, comparing patient outcomes across trial and routine clinical care settings can be susceptible to bias, namely due to differences in how and when disease assessments occur. This can introduce measurement error in RWD relative to trial standards and lead to bias when comparing endpoints. We develop a novel statistical method, survival regression calibration (SRC), to mitigate measurement error bias in time-to-event RWD outcomes and improve inferences when combining trials with RWD in oncology.

Methods

SRC extends upon existing regression calibration methods to address measurement error in time-to-event RWD outcomes. The method entails fitting separate Weibull regression models using trial-like (‘true’) and real-world-like (‘mismeasured’) outcome measures in a validation sample, and then calibrating parameter estimates in the full study according to the estimated bias in Weibull parameters. We evaluate performance of SRC under varying degrees of existing measurement error bias via simulation, and then illustrate how SRC can address measurement error when estimating median progression-free survival (mPFS) in newly diagnosed multiple myeloma RWD.

Results

When measurement error exists between trial and real-world mPFS, SRC can effectively account for its resulting bias. SRC yields greater reduction in measurement error bias than standard regression calibration methods, due to its suitability for time-to-event outcomes.

Conclusions

Outcome measurement error is important to address when combining trials and RWD, as it may lead to biased results. Our SRC method helps mitigate such bias, improving comparability between real-world and trial endpoints and strengthening evidence of treatment efficacy.

Keywords: measurement error; real-world data; real-world evidence; time to event; calibration; endpoints

Introduction

When evaluating the efficacy of new therapies, it is important to collect measurements of the outcome under study that are reliable, accurate, and reflective of the underlying true disease state. In clinical trials, endpoints are constructed from outcomes that are measured per strict protocol definitions, often entailing independent review, confirmation, and/or complete biomarker data collection. Increasingly, research in ‘real-world’ settings is being conducted across the drug development landscape, from discovery and regulatory submission to post-approval and post-marketing. Such research includes prospective observational studies or retrospectively collected electronic health records or administrative healthcare claims, commonly referred to as ‘real-world data’ (RWD) [1]. In RWD, outcome collection may be less regimented or complete when compared to a clinical trial. For instance, in RWD, endpoints may be derived by combining specific data elements abstracted from unstructured clinical notes, lab reports, and/or patient-reported data. Outcomes may also be heterogeneous, and dependent on local or regional variations in clinical practice. Furthermore, missingness in documents (e.g., pathology reports, genomic data, image files) or structured data fields could be present. These factors could all contribute to differences in how and when outcomes are captured between RWD and trials, resulting in measurement error [2]. Our work is motivated by the need to understand and address differences in how outcomes are measured between trial settings and real-world settings, namely when constructing external comparator arms in oncology.

Statistical methods exist to address mismeasured variables, like those that may appear in RWD [3], [4], [5], [6], [7], [8]. Broadly, these methods entail obtaining a ‘validation sample,’ in which both the true and mismeasured variables are collected for a group of patients, and a model is fit to estimate their relationship. Validation samples can either be internal, meaning that true variables are collected on a sub-population of the main study of interest, or external, meaning that true variables (and mismeasured variables) are collected for a completely separate set of patients from the main study of interest. The models are then used to adjust (or calibrate) mismeasured variables in the full sample of interest. While much of the existing literature has focused on understanding and reducing bias due to misclassified exposure status or mismeasured covariates, less attention has been paid to mismeasured outcomes [9], [10], [11]. Nevertheless, outcome measurement error remains a critical concern for estimating treatment effectiveness [12].

Within the literature on outcome measurement error, there is emphasis on outcomes that are normally distributed or binary [13], 14], while methods suitable for time-to-event outcomes are less studied. Time-to-event endpoints, like overall survival (OS), and the composite endpoint of progression-free survival (PFS), are commonly used in oncology studies. Additionally, much of the methodological work to date has focused on estimating causal contrasts between two (or more) groups, reflecting a treatment effect [15], [16], [17], rather than a summary measure for a single group. However, the latter is particularly important when considering single-arm clinical trials that use external controls, in which RWD may be used to construct a ‘standard of care’ benchmark to contextualize the trial results. In this case, outcomes are assumed to be measured without error in the single-arm trial, and potentially measured with error in the comparator arm, relative to trial standards.

Few methods for time-to-event outcome measurement error correction exist, and their relevance and applicability to oncology RWD endpoints is limited. While Giganti et al. (2020) proposed a multiple imputation approach to correct for misclassified event status over time when estimating cumulative incidence of AIDS-defining events among patients with HIV, the method is highly susceptible to model misspecification and reliant on the existence of large validation samples [18]. Edwards et al. (2019) and Bakoyannis and Yiannoutsos (2015) derived a cumulative incidence estimator that accounts for time-varying rates of misclassified false positive and false negative events [19], 20]; however, this approach accounts for misclassified event status, not mismeasured event times. Oh et al. (2021) assumed that error in the time-to-event outcome is additive, and used linear regression to estimate bias between the true and mismeasured outcomes [21], then calibrated the mismeasured outcomes in the full sample accordingly. However, modeling a time-to-event outcome in this way may perform sub-optimally when data are right-censored, as often occurs in the trial and real-world settings of interest. Nevertheless, regression calibration is an appealing and established approach for handling other types of mismeasured variables [7], and further refinement to address time-to-event outcome measurement error is warranted. Accordingly, the aim of this work is to develop regression calibration methods to be more suitable for time-to-event outcome data by reframing measurement error in terms of a Weibull model parameterization.

The remainder of the paper is structured as follows. First, we summarize available regression calibration methods for mitigating outcome measurement error in continuous outcomes and highlight the limitations when applied to time-to-event outcomes. We then introduce refined framing and parameterization of the measurement error problem and extend regression calibration methods accordingly. Through simulation, we evaluate our proposed method and compare its performance to standard regression calibration approaches. Next, using the control arm of a historical trial in newly diagnosed multiple myeloma (NDMM), we demonstrate the application of our method, comparing estimates of median PFS under trial and RWD-like conditions. Lastly, we discuss the advantages and limitations of our method, highlighting the importance of novel methods for mitigating measurement error in real-world time-to-event oncology endpoints.

Materials and methods

Throughout this paper, we will consider a motivating example in NDMM, with a primary goal of estimating median PFS in a real-world dataset. We will consider that, for each patient, the outcome Y (time to a progression event or death), along with relevant baseline covariates X, are collected in the RWD. Furthermore, we will consider that all RWD patients have the potentially mismeasured version of Y collected (i.e., criteria for determining the event may be applied differently and assessments may differ), while a small subset of patients is sampled into an internal validation sample, and therefore also has the ground truth Y collected (i.e., these patients are also assessed per randomized trial ‘gold standard’ criteria).

Let Y_i be the event times measured without error for patient i, i=1, …, n, and Y_i* be the event time measured with error. A standard way to define outcome measurement error is by assuming an additive error structure:

Yi*=Yi+γ1+γXX+ε=Yi+ω

where X is a baseline covariate, γ ₁ and γ _X are parameters describing the measurement error magnitude, and ɛ is a random variable with mean 0. For simplicity, we will assume a nondifferential error structure such that ω=γ ₁ (i.e., an ‘intercept-only’ model). With this formulation, the standard regression calibration (RC) approach broadly entails estimating ω in a validation sample (a necessary source of data where both Y and Y* are collected), calibrating Y_i* in the full sample by subtracting the estimate of E [ω], and using the calibrated outcome to estimate the quantity of interest. However, with time-to-event outcomes where the quantity of interest may be a marginal median failure time (i.e., using a Kaplan–Meier estimator), assuming an additive linear error structure may not be appropriate, and can be viewed as a form of model misspecification. In practice, using an additive linear model to calibrate time-to-event outcome measurement error may lead to negative times for some patients. Consider a simple example where we collect true and mismeasured outcomes for a subset of 10 patients in a broader study (See Table 1). On average, regression calibration would reduce error between true and mismeasured times, but for some patients with shorter observed times, this would yield negative calibrated times, which is not possible to observe. This scenario could occur in practice if measurement error leads to large biases in observed outcomes and if there is heterogeneity in event times among patients, which may be plausible in RWD sources like electronic health records [2]. Furthermore, when event rates are low and censoring rates are high, there may only be a small subset of patients for whom the outcome is measured with error. In such cases, typical regression calibration could entail fitting linear models to zero-inflated data. Lastly, note that while this approach calibrates event times, it fails to account for mismeasurement in the event status, which could further lead to erroneous survival curve estimation.

Table 1:

Illustrative regression calibration example, in which negative calibrated event times are produced.

Y	Y ^*	ω=Y ^*–Y	E [ω]	Ŷ=Y ^*–E [ω]
10	10	0	3.5	6.5
11	15	4		11.5
3	12	9		8.5
2	2	0		−1.5
4	3	−1		−0.5
6	7	1		3.5
3	9	6		5.5
3	5	2		1.5
6	20	14		16.5
8	8	0		4.5

Weibull parameterization to frame time-to-event outcome measurement error

As an alternative, we propose framing time-to-event outcome measurement error according to the Weibull distribution.^[1] Consider the following models of the true and mismeasured event times:

logY=a0+σε

logY*=a0*+σ*ε

where a ₀ is the Weibull regression intercept (i.e., log - scale), 1/σ is the Weibull shape, ɛ follows an extreme value distribution, and asterisks indicate mismeasured versions of these parameters [22]. The difference between the true and mismeasured event times can be described as:

δ=logY*−logY=a0*−a0+σ*−σε

Here, the bias due to measurement error, δ, is attributed to differences in the intercept (log-scale) and shape parameters, which describe the failure event rate and its constancy (or lack of) over time, respectively. Note that assuming measurement error on the log-additive scale for time-to-event outcomes has similarities to previously established literature on multiplicative measurement error models for covariates, where mismeasured covariates may be log-Normally transformed, though the correction methods do not naturally extend as the target parameter of interest differs [6].

If measurement error exists between Y and Y* such that it affects the failure event rate, using Y* to estimate the shape and intercept (and consequentially, the median time to event or other percentiles of the CDF), will yield biased estimates for the true parameters. Furthermore, this translates to biased estimates of summary measures of interest like median time to event. To illustrate how so, we can derive the bias in median times as the ratio of medians between the true and mismeasured models, notated as θ and θ*, respectively:

θθ*=expαlog21σexpα*log21σ*=expα−α*log21σ−1σ*∝expbiasinterceptlog2biasshape

where

biasintercept=α−α*biasshape=σ−σ*

When both the shape and intercept parameters are under-estimated in the mismeasured outcome data (such that their biases are both negative), then θ* will be biased towards shorter median time. Conversely, when both parameters are over-estimated (such that their biases are both positive), then θ* will be biased towards longer median time. However, when bias in the parameters are in opposing directions, then θ* may appear closer to θ due to a potential ‘canceling out’ effect, and it is therefore important to consider the bias in the parameter estimates in order to contextualize biases in median times.

In routine clinical practice, it is indeed plausible for time-to-event outcomes to be mismeasured, resulting in biased Weibull regression parameter estimates. Consider the composite endpoint of PFS, defined by the earliest occurrence of progression or death, censored by loss to follow-up, end of study or data cut-off. If progression events are misclassified such that true events are not captured (i.e., false negatives), and PFS is either defined by a subsequent progression or death or censoring, this can lead to a mismeasured Y* that appears to have a lower event rate than the true Y, potentially resulting in mismeasured shape and intercept parameters that are greater than the truth. Conversely, erroneous and premature capture of progression events (i.e., false positives) may lead to higher event rates at earlier times, biasing the mismeasured shape and intercept in the opposing direction. We will revisit this example further in our illustrative data application.

Survival regression calibration (SRC)

As an extension to standard regression calibration, we describe an approach to model and mitigate the measurement error bias in time-to-event outcomes by calibrating the Weibull model parameters. First, we fit a Weibull regression model using the mismeasured outcome in the full sample (denoted with subscript ‘f’) to obtain initial estimates of the mismeasured shape and intercept parameters, α̂0,f* and σ̂f*:

logYf*=α̂*0,f+σ̂f*ϵ

Next, using a validation sample (denoted with subscript ‘v’), in which the true and mismeasured outcomes are both collected, we fit separate Weibull models to two outcome versions and estimate their respective shape and intercept parameters:

logYv*=α̂*0,v+σ̂v*ϵ

logYv=α̂0,v+σ̂vϵ

Using the parameter estimates of the true and mismeasured intercepts and shapes, we then estimate the bias of each respective parameter:

biaŝintercept=α̂0,v−α̂*0,v

biaŝshape=σ̂v−σ̂v*

We calibrate the mismeasured shape and intercept in the full sample, α̂0,f* and σ̂f*, and use the calibrated Weibull parameters to construct an adjusted survival curve, which can be used to estimate the survival probabilities at time t:

Ŝt=exp−texpα̂*0,f+biaŝintercept̂σ̂f*+biaŝshape

Finally, since the median time is of interest here, we estimate the corrected median time (θ̂src) to event as follows:

θ̂src=expα̂*0,f+biaŝintercept×log2̂1σ̂f*+biaŝshape

Model assumptions and transportability

There are several assumptions related to the parametric modeling and validation sampling that are important to highlight. First, when fitting a Weibull accelerated failure time model, a typical assumption is that censoring is either not present or is non-informative. Next we assume that the hazard is constant, such that the sigma parameter does not vary over time. These assumptions are required when fitting the parametric models to the true and mismeasured outcomes in the approach outlined above.

Additionally, recall that in order to implement these methods to correct for outcome measurement error, a validation sample, defined as a study (or subgroup of a study) with the true and mismeasured outcomes both collected, must exist. When estimating the bias in the parameters from a validation sample, we assume that the bias estimates are transportable from the validation sample to the full study [23]. In other words, we assume that α0,v−α0,v*=α0,f−α0,f*and σv−σv*=σf−σf*, such that estimates of the differences between the true and mismeasured Weibull parameters in the validation sample are unbiased estimates of their respective differences in the full study population. When the validation sample is internal, and more specifically, when it is either a simple random sample or a probabilistic sample of the full study, this assumption is plausible to make. However, when the validation sample is external, or an internal validation sample is identified via non-probability sampling (e.g., patients are included in a validation sample based on convenience factors like geographic proximity, participation willingness, or ease of data availability to the researcher), it is possible that transportability could be violated. More specifically, if the amount of measurement error bias is differential with respect to baseline characteristics, and if the validation sample demographics differ from the main study with respect to those characteristics, then additional bias may be present, and additional adjustments may be required [24]. In the remainder of the manuscript, we will assume a representative internal validation sample is available for use such that transportability holds from the internal validation sample to the broader full sample, and we will discuss this potential challenge in the conclusion.

Simulation study

We now present a simulation study to evaluate our proposed survival regression calibration methodology. This simulation is motivated by a case study in NDMM (described further in the Data Application), where the primary (and potentially mismeasured) endpoint of interest in a study is PFS.

Simulation design

We simulate n=365 patients and generate true and mismeasured PFS times according to Weibull distributions. We assume a true median PFS (mPFS) of 34.2 months, and a study duration of 60 months (See Supplementary Materials for simulation results with shorter and longer true mPFS times). Patients who do not have a simulated PFS event by study completion are censored, and the censoring rate is approximately 30 %. We vary the true Weibull shapes to be either 0.8, 1, or 1.2. The true Weibull regression intercepts are optimized accordingly to preserve the desired true mPFS time and shape. To assess how the method performs under differing amounts of measurement error bias, we vary the mismeasured shape and intercept parameters to be +/− (0, 0.1, 0.3) of the true values. In other words, we simulate scenarios where (1) no measurement error is present, (2) measurement error leads to increases in the shape only, (3) measurement error leads to increases in the intercept, and (4) measurement error leads to changes in both parameters, either in similar or opposing directions.

For each simulated study sample, we randomly allocate 40 % of subjects (n≈150) to belong to an internal validation sample, in which we collect both the true and mismeasured outcomes. Note that for evaluation purposes, we retain the true outcomes in the full simulated sample; however, we fit the models to estimate the bias calibration parameters in the validation sample only. Additional simulations in which the size of the validation sample is varied can be found in the Supplementary Materials.

We compare three different estimates of mPFS in the study: (1) unadjusted, (2) adjusted via regression calibration (RC), where we assume an additive error structure and fit a linear model to estimate the bias, and (3) adjusted via survival regression calibration (SRC), where we fit Weibull regression models to estimate the bias in the shape and intercept. To assess additional sensitivity to model misspecification, we explore a fourth model where we estimate measurement error bias parameters using a log-logistic model rather than a Weibull model (further details available in the Supplementary Materials). We report bias in mPFS as the difference between the mismeasured/calibrated mPFS and the true mPFS. We run 1,000 simulation iterations, and report mPFS estimates as the mean across iterations, and we report bootstrapped confidence intervals for each type of estimated mPFS as the 2.5th and 97.5th percentiles of the distributions for the simulated estimates [25]. Coverage of the 95 % confidence interval, defined as the proportion of simulation iterations where mPFS confidence intervals contain the true mPFS value, is also estimated via the bootstrap approach, and detailed results are reported in the Supplementary Materials.

Simulation results

We now present results from the simulation study. For simplicity, we focus on results of bias where the true Weibull shape is 1 (corresponding to an exponential distribution); additional simulation results, with similar findings, can be found in the Supplementary Materials.

No measurement error

When the shape and intercept parameters are equal between the true and mismeasured PFS, the bias in the unadjusted estimate is negligible on average, as expected, though in rare cases non-negligible bias remains (mPFS bias=0.14 months, 95 % CI=−6.2 to 6.27 months). Both the RC and SRC-adjusted estimates have slightly less bias, with comparable amounts of variability (RC-adjusted mPFS bias=0.08 months, 95 % CI=−5.43 to 5.45 months; SRC-adjusted mPFS bias=0.2 months, 95 % CI=−4.98 to 6.1 months).

Measurement error in the Weibull intercept parameter

When measurement error leads to differences in the intercept parameter only (i.e., only the intercept parameter is different due to measurement error), but the shape parameters are equivalent (Figure 1, left panel, turquoise line), such differences in the intercept can lead to unadjusted mPFS that is anywhere from a year greater than to ∼9 months less than the true mPFS. More specifically, unadjusted mPFS is longer than the true mPFS when measurement error leads to larger intercepts (e.g., unadjusted mPFS bias when true intercept is 3.9 and mismeasured intercept is 4.2: 11.88 months, 95 % CI=4.86 to 19.59 months). Unadjusted mPFS is shorter than the true mPFS when measurement error leads to smaller intercepts (e.g., unadjusted mPFS bias when true intercept is 3.9 and mismeasured intercept is 3.6: −8.94 months, 95 % CI=−14.25 to −3.66 months). Linear regression calibration (RC) can reduce this bias, albeit not completely (RC-adjusted mPFS bias ranges from −3.7 to 7 months, Figure 1, middle panel, turquoise line). Survival regression calibration (SRC) reduces the bias to a greater extent than RC, such that the estimates of SRC-adjusted mPFS bias, under intercept parameter error, range −0.05 to 0.06 months (Figure 1, right panel, turquoise line).

Figure 1:

Simulation results when true Weibull shape=1. The left panel shows the observed mPFS bias when varying bias in the intercept (by x axis) and bias in the shape (by color). The middle panel shows bias after applying standard regression calibration methods. The right panel shows bias after applying the proposed survival regression calibration method.

Measurement error in the Weibull shape parameter

When measurement error leads to differences in the shape parameter only (i.e., the true and mismeasured intercepts equal 3.9, Figure 1 left panel, x axis=0), such differences can lead to unadjusted mPFS bias that ranges from −5 to 3 months. Unadjusted mPFS is biased towards earlier times when the mismeasured shape is less than the true shape (i.e., unadjusted mPFS bias when true shape is 1 and mismeasured shape is 0.7: −4.99 months, 95 % CI = −11.19 to 1.51 months), and biased towards later times when the mismeasured shape is greater than the true shape (i.e., unadjusted mPFS bias when true shape is 1 and mismeasured shape is 1.3: 3.01 months, 95 % CI=−2.77 to 8.55 months). Here, RC again reduces the bias partially (RC-adjusted mPFS bias ranges from −2.26 to 1.04 months, Figure 1 middle panel, x axis=0), but not to the same magnitude as SRC (SRC-adjusted mPFS bias ranges from 0.1 to 0.2 months, Figure 1 right panel, x axis=0).

Measurement error in both parameters

When measurement error leads to differences in both Weibull parameters in similar directions (i.e., both parameters are greater than, or less than, the true respective parameters), it can amplify the amount of resulting bias in a nonlinear fashion (See Table 2). In such cases, SRC continues to outperform RC in bias reduction. For instance, when mismeasured shape and intercept are both 0.3 less than the true values, RC-adjusted mPFS bias is still −5.94 months (95 % CI=−11.07 to −0.74 months) while SRC-adjusted mPFS bias is 0.08 months (95 % CI: −4.97 to 5.49 months). When both mismeasured parameters are 0.3 greater than the true values, RC-adjusted mPFS bias is 8.26 months (95 % CI=2.45 to 14.28 months) while SRC-adjusted mPFS bias is 0.11 months (95 % CI: −4.44 to 5.38 months).

Table 2:

Simulation results when true shape=1 (i.e., true data come from exponential distribution).

True shape	Mismeasured shape	True intercept	Mismeasured intercept	Observed mPFS bias (95 % CI)	RC-adjusted mPFS bias (95 % CI)	SRC-adjusted mPFS bias (95 % CI)
No measurement error
1	1.0	3.90	3.90	0.14 (−6.2, 6.27)	0.08 (−5.43, 5.45)	0.2 (−4.98, 6.1)
Measurement error in intercept only
1	1	3.90	3.60	−8.94 (−14.25, −3.66)	−3.68 (−8.69, 1.28)	0.06 (−4.74, 5.44)
			3.80	−3.59 (−9.42, 2.13)	−1.81 (−7.14, 3.3)	0.06 (−4.85, 5.16)
			4.00	3.6 (−2.31, 10.21)	1.84 (−3.85, 7.44)	−0.05 (−5.29, 5.74)
			4.20	11.88 (4.86, 19.59)	7.04 (1.11, 13.91)	0.03 (−4.69, 6.21)
Measurement error in shape only
1	0.7	3.90	3.90	−4.99 (−11.19, 1.51)	−2.26 (−7.99, 4.07)	0.2 (−5.02, 5.76)
	0.9			−1.05 (−6.87, 5.02)	−0.4 (−5.89, 4.89)	0.11 (−4.73, 5.86)
	1.1			1.14 (−4.87, 7.3)	0.49 (−5.22, 5.7)	0.1 (−4.8, 5.66)
	1.3			3.01 (−2.77, 8.55)	1.04 (−3.85, 6.38)	0.16 (−4.95, 5.89)
Measurement error in both parameters: both smaller than truth
1	0.7	3.90	3.60	−12.58 (−18.16, −6.84)	−5.94 (−11.07, −0.74)	0.08 (−4.97, 5.49)
			3.80	−7.87 (−14.11, −1.56)	−3.84 (−10.08, 1.96)	0.17 (−5.11, 5.6)
	0.9		3.60	−9.94 (−15.41, −4.77)	−4.19 (−9.01, 0.61)	0.38 (−4.69, 6.07)
			3.80	−4.52 (−10.82, 1.45)	−2.24 (−7.59, 2.75)	−0.04 (−4.78, 5.21)
Measurement error in both parameters: both larger than truth
1	1.1	3.90	4.00	4.97 (−0.81, 10.8)	2.49 (−2.83, 8.04)	0.27 (−4.59, 6)
			4.20	13.63 (6.52, 21.54)	7.7 (1.57, 13.86)	0.11 (−4.71, 5.52)
	1.3		4.00	6.96 (1.36, 13.25)	2.91 (−2.69, 8.31)	0.27 (−5.03, 5.98)
			4.20	15.85 (9.66, 22.46)	8.26 (2.45, 14.28)	0.11 (−4.44, 5.38)
Measurement error in both parameters: shape smaller, intercept larger than truth
1	0.7	3.90	4.00	−1.9 (−8.79, 5.41)	−0.48 (−6.91, 6.1)	0.05 (−5.11, 5.75)
			4.20	5.53 (−2.51, 14.12)	4.26 (−3.48, 12.33)	0.11 (−5.23, 5.74)
	0.9		4.00	2.03 (−4.2, 8.26)	1.27 (−4.78, 7.9)	−0.01 (−4.86, 5.53)
			4.20	10.16 (2.56, 18.94)	6.47 (0.17, 14.43)	0.16 (−4.83, 5.5)
Measurement error in both parameters: shape larger, intercept smaller than truth
1	1.1	3.90	3.60	−8.01 (−13.13, −3.23)	−3.17 (−7.83, 1.5)	0.13 (−4.92, 5.68)
			3.80	−2.08 (−7.5, 3.33)	−0.97 (−6.2, 3.97)	0.12 (−5.09, 5.57)
	1.3		3.60	−6.77 (−11.47, −2.32)	−2.38 (−7.53, 2.01)	0.22 (−4.67, 6.1)
			3.80	−0.84 (−6.37, 4.35)	−0.67 (−5.43, 3.96)	0.28 (−5.05, 6.11)

Conversely, measurement error that leads to an increase in one parameter, but a decrease in the other, could result in a scenario with no observed bias. For example, when the mismeasured shape is 1.3 (greater than true shape of 1.0) and the mismeasured intercept is 3.8 (smaller than the true intercept of 3.9), the unadjusted mPFS bias is −0.84 months (95 % CI: −6.37 to 4.35 months). In other words, it is possible that biases due to differences in these Weibull parameters may counteract one another if they appear in opposing directions, leading to an observed and unadjusted mPFS that is unbiased. In the lefthand panel of Figure 1, this can be further observed by noting where the different-colored lines (representing different amounts of shape bias) cross the dotted line of 0 at different points along the x axis (representing different amounts of intercept bias). While adjustment with SRC still yields unbiased mPFS estimates (mPFS bias=0.28 months, 95 % CI: −5.05 to 6.11 months), this is an important cautionary finding to study these biases carefully and recognize that unadjusted mPFS that appears ‘unbiased’ may not necessarily mean the outcome is error-free.

95 % confidence interval coverage

Broadly, measurement error leads to poor coverage for the unadjusted mismeasured mPFS estimate, RC demonstrates improved coverage under minimal amounts of measurement error, and SRC yields consistently high confidence interval coverage. This is, in part, due to greater variability of mPFS estimates when applying SRC, and is also associated with the size of the simulated internal validation sample. Smaller validation samples yield greater variability, and confidence interval coverage closer to 100 %. More detailed findings on 95 % confidence interval coverage can be found in the Supplementary Materials.

Data application

We next apply the developed SRC method to an illustrative example in NDMM. Disease assessments in multiple myeloma (MM) are conducted according to standards by the International Myeloma Working Group (IMWG) Uniform Response Criteria based on imaging and biomarkers from blood, urine, and bone marrow biopsy testing [26]. In MM clinical trials, assessments are conducted rigorously, per strict protocol. In routine clinical practice, however, the adherence to practice guidelines like IMWG may be less strict compared to a trial, and as a result, MM endpoints in RWD may be derived with error relative to trial standards. For example, MM trials typically require collection of serum and urine labs every 28 days, while in routine clinical practice, it is more common for only serum to be collected, and on a more variable assessment frequency. This variability in measurement can introduce bias when comparing outcomes between RWD and RCTs. With this motivation in mind, we illustrate the application of our methodology to estimate mPFS in the control arm of the MAIA trial, in which patients with NDMM were treated with lenalidomide and dexamethasone (Rd) [27]. Note that, by using data from a completed trial and constructing an endpoint that is analogous to those collected in routine practice, we are able to illustrate how this methodology mitigates bias due to measurement error without concern of other biases common in studies using RWD being present (i.e., selection bias, surveillance bias).

For this exercise, we assume the trial-collected outcomes are measured without error, and will constitute our ‘true’ outcomes. We derive ‘mismeasured’ outcomes by applying a subset of IMWG criteria to the trial data per real-world standards [28]. While these criteria are similar to the full IMWG criteria, it is possible that a less strict approach to progression capture in clinical practice could yield either false positive progression events, false negatives, or both.

To emulate the availability of an internal validation sample, we randomly sample 40 % of the trial control arm, for whom we ‘collect’ both the true and mismeasured progression events. In practice, the existence of such validation samples may not be feasible for studies utilizing RWD in certain therapeutic areas, hence the illustrative nature of this example. We perform a double-bootstrap procedure to estimate variance of the mPFS, sampling the control arm population as well as the internal validation sample with replacement, over 1,000 iterations. Confidence intervals are constructed by reporting the 2.5th and 97.5th percentiles of the mPFS estimate distribution across the 1,000 re-samplings. We compare the true trial estimates, the mismeasured ‘real-world’ estimates, and the survival regression-calibrated estimates of mPFS to one another.

Results

First, note that bias exists due to measurement error between the trial and ‘real-world’ endpoint versions in the full study sample. The ‘real-world’ (rw) mPFS appears to be biased towards earlier times (mPFS bias=−2.6 months), likely due to high false positive rates in classifying progression events per the flexible criteria. When comparing the Weibull model parameters between the trial and ‘real-world’ outcomes, the ‘real-world’ shape and intercept are both smaller than the trial parameter estimates (Table 3). This further highlights that both parameters, in practice, may be mismeasured, as opposed to just one or the other, and it is therefore important to calibrate both parameters in the model.

Table 3:

True (trial) and mismeasured rwPFS and Weibull parameter estimates in full study sample.

Outcome version	mPFS, months	Weibull shape	Weibull intercept (log-Weibull scale)
‘True’ trial PFS	34.8	0.97	3.92
‘Mismeasured’ rwPFS	32.2	0.92	3.86

Figure 2a shows the estimates of the trial and ‘real-world’ Weibull parameters in the bootstrapping validation sample, which are used to adjust for this bias in the ‘real-world’ PFS among all patients in the full sample. Similar to Table 3, these bootstrapping estimates of the shape and intercept parameters in the ‘real-world’ outcome model are smaller on average than the ‘truth’ in the trial outcome model. Figure 2b compares the true and mismeasured PFS curves to the SRC-adjusted curve. Again, in a typical study we would not observe the true (black) curve, only the mismeasured (red) curve. By retaining the data on true trial outcomes, we see the SRC method is able to reduce the bias and better estimate the true mPFS estimate (SRC-adjusted mPFS=34.5 months, 95 % CI: 28.5–41.6 months). Note that the confidence interval width is reflective of the validation sample size, as well as the overall size of the control arm.

Figure 2:

Results from the illustrative data application. A) Weibull parameter estimates by endpoint version, B) survival curves (true and mismeasured), along with survival regression calibration-adjusted estimates.

Discussion

Standard regression calibration methodology, which assumes a linear additive error structure on a Normally-distributed outcome, may not be suitable to mitigate measurement error for time-to-event outcomes due to fundamental model misspecification and risk of generating implausible (i.e., negative) event times. In this paper, we proposed a survival regression calibration (SRC) method that frames time-to-event outcome measurement error as the difference in Weibull parameters and accordingly calibrates bias in each parameter. Whereas previous regression calibration methods could yield negative event times and fail to account for event misclassification, our SRC method is better suited for time-to-event outcomes. Through simulation, we demonstrated that SRC not only outperforms regression calibration across a range of existing measurement error biases, but also successfully mitigates measurement error bias when present. We then illustrated the use of SRC in a data example to correct for error in real-world PFS in a cohort with NDMM.

There are several takeaways, limitations, and opportunities for future work to highlight. First, our developed approach can potentially address measurement error bias when estimating a marginal summary quantity, like mPFS, within a real-world study population. Furthermore, we explore a case where any amount of outcome measurement error is consistent across patient subgroups. This may be particularly relevant when constructing an external benchmark using RWD to contextualize single-arm trial results, in which indirect comparisons are typically drawn between real-world and trial outcomes. When treatment effect estimates are relevant to the research question, whether marginal or conditional on baseline covariates, additional research is needed to extend the calibration of outcomes when estimating hazard ratios and other relative effect measures. When measurement error in outcomes is differential by baseline characteristics, future work should also evaluate the plausibility of utilizing conditional models to estimate the shape and intercept bias parameters. Such differential error may also contribute to informative censoring, and should thus be studied further. Nevertheless, the direction and impact of these biases at the study population level, whether estimating mPFS or a hazard ratio, are related, and thus our methods can still be impactful for broader inferences. For example, when rwPFS is biased towards longer times, assuming no other bias or confounding, a hazard ratio comparing such real-world and trial PFS would bias towards the null, as it would make an experimental trial treatment appear less efficacious by comparison.

Similarly, we have developed and evaluated this method to address measurement error when it is present, on average, for an entire real-world comparator cohort. While it is possible that error may be more pronounced for certain patient subgroups defined by baseline characteristics, our work is easily extendable to stratified analyses, such that bias can be addressed within each patient strata of interest. Future work may also consider extending to more complex error structures, such as those that depend on covariates, as well as to other parametric survival models [29].

Next, we demonstrated via simulation that our method can mitigate measurement error bias by estimating both the true and mismeasured Weibull regression parameters in a validation sample. In practice, validation samples can be challenging to conceptualize or curate for progression and response-based endpoints, particularly when using retrospective RWD where outcomes may be abstracted from unstructured clinical documents or derived from diagnostic codes. For solid tumor diseases in oncology, where imaging-based endpoints are commonly used in clinical trials, there may be opportunity to curate validation samples among real-world patients who have imaging data available, which could be processed and segmented independently and compared to abstracted ‘mismeasured’ outcomes. For retrospective MM RWD, however, validation samples may be more challenging to design in practice for these outcomes. Notably, patterns of IMWG lab observability and the use of flexible IMWG algorithms to derive response outcomes may make it infeasible to design or identify an internal validation sample where ‘trial-like’ outcomes are measurable according to full IMWG algorithmic criteria. For endpoints like overall survival, where RWD capture of death events may be imprecise, data linkages to registries and other reliable external data sources can enable modeling and mitigation of measurement error bias [30]. Hybrid control arm designs, where a randomized trial’s comparator arm is augmented by RWD, may also produce valuable validation data required for outcome calibration [31]. Thoughtful study planning and prospective data capture from routine practice settings can further help address these challenges, and additional research on innovative study design is critical alongside the development of these methods.

We focused on the case where an internal validation sample is available, such that the necessary modeling to mitigate measurement error bias can be performed in a direct (and randomly or probabilistically sampled) subset of the RWD cohort of interest. By doing so, we have assumed that bias parameter estimates are transportable from a validation sample to the full study sample. When using an external validation sample for method application (i.e., using a separate patient set from the cohort of interest), transportability becomes a larger concern, namely when baseline differences exist between the primary study and external validation sample populations. It is important, when such concerns arise, to compare the populations closely and ensure similarities whenever possible, particularly on factors associated with the measurement error patterns. Future work should also explore extensions of our methods to account for transportability violations, such as using weighted estimators, or through sensitivity analyses for the calibrated Weibull parameters. Given the context-dependent challenges highlighted above when curating validation samples, it is also plausible that a validation sample may not be available at all for a given application. In the absence of appropriate validation samples, sensitivity analyses or quantitative bias analyses may be important tools to evaluate the robustness of study results to outcome measurement error, and such approaches should be further developed as well.

Lastly, when addressing outcome measurement error in non-experimental studies (such as RWE), there are often other contributing sources of bias (i.e. selection bias, immortal time bias) that must be addressed to draw rigorous and causal inferences. By assessing how bias in studies using RWD may be attributed to measurement error, we have developed an approach to address such concerns and fill a critical gap in the literature. It is imperative for future research to consider how to integrate this methodology with other methods and tools to account for other types of bias. Nevertheless, we have demonstrated that measurement error in time-to-event outcomes can cause substantial bias on its own, and we have developed improved methodology that can appropriately calibrate such mismeasured outcomes. Further development and adoption of these methods may enable more efficient and robust evidence generation in studies using RWD in drug development and ultimately improve comparability between trial and real-world findings.

Corresponding author: Benjamin Ackerman, Johnson & Johnson, 920 US Highway 202 S, Raritan, NJ, 08869, USA, E-mail: backerm3@its.jnj.com

Benjamin Ackerman and Ryan W. Gan contributed equally to this work and share first authorship.

Ashita S. Batavia and Khaled Sarsour share senior authorship.

Funding source: U.S. Food and Drug Administration

Award Identifier / Grant number: 1U01FD007937-01

Acknowledgments

The authors acknowledge the following consortium whose thought leadership and input aided authors’ efforts: Sebastian Schneeweiss (Harvard), Charles Poole (UNC at Chapel Hill), Issa Dahabreh (Harvard), Til Stürmer (UNC at Chapel Hill), Noopur Raje (Harvard), Smith Giri (UAB), Omar Nadeem (Harvard), Sikander Ailawadhi (Mayo Clinic), Joel Greshock (Johnson & Johnson), Yiyang Yue (Johnson & Johnson), Laura Hester (Johnson & Johnson), Kelly Johnson-Reid (Johnson & Johnson), Jason Brayer (Johnson & Johnson), and Robin Carson (Johnson & Johnson).

Research ethics: The local Institutional Review Board deemed the study exempt from review.
Informed consent: Not applicable.
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission. BA, RWG and YZ conceptualized the method. JS, JR, JLL and JW provided early feedback on the method and evaluation approach. BA and RWG developed and performed simulation-based analysis. BA, RWG and JRW conducted the illustrative data analysis. BA and RWG wrote the initial manuscript draft, which was reviewed and revised by YZ, JS, JR, JLL, JW, JRW, CSM, JH, KS and ASB.
Use of Large Language Models, AI and Machine Learning Tools: None declared.
Conflict of interest: BA is an employee of Johnson and Johnson and owns stock in the company. RWG is an employee of Johnson and Johnson and owns stock in the company. YZ is an employee of Johnson and Johnson and owns stock in the company. JS has no conflicts to report. JR is an employee of Flatiron Health, an independent subsidiary of the Roche group, and owns stock in Roche. JLL has no conflicts to report. JW is now an employee of AstraZeneca and owns stock in AstraZeneca. JRW is an employee of Johnson and Johnson and owns stock in the company. CSM is an employee of Johnson and Johnson and owns stock in the company. JH is an employee of Johnson and Johnson and owns stock in the company. KS is an employee of Johnson and Johnson and owns stock in the company. ASB is an employee of Johnson and Johnson and owns stock in the company.
Research funding: This project is supported by the Food and Drug Administration (FDA) of the U.S. Department of Health and Human Services (HHS) as part of a financial assistance award [FAIN] totaling $468,058 (32 percent) funded by FDA/HHS and approximately $1,001,454 (68 percent) funded by non-government source(s). The contents are those of the author(s) and do not necessarily represent the official views of, nor an endorsement, by FDA/HHS, or the U.S. Government. Grant number: 1U01FD007937-01.
Data availability: Data available on request from the authors.

References

1. U.S. Department of Health and Human Services, Food and Drug Administration. Real-world evidence 2024a. Available at: https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence.Search in Google Scholar

2. Ackerman, B, Gan, RW, Meyer, CS, Wang, JR, Zhang, Y, Hayden, J, et al.. Measurement error and bias in real-world oncology endpoints when constructing external control arms. Front Drug Saf Regul 2024;4:1423493. https://doi.org/10.3389/fdsfr.2024.1423493.Search in Google Scholar PubMed PubMed Central

3. Grace, YY, Delaigle A, Gustafson P, editors. Handbook of measurement error models. Boca Raton, FL, USA: CRC Press; 2021.Search in Google Scholar

4. Lenis, D, Ebnesajjad, CF, Stuart, EA. A doubly robust estimator for the average treatment effect in the context of a mean-reverting measurement error. Biostatistics 2017;18:325–37. https://doi.org/10.1093/biostatistics/kxw046.Search in Google Scholar PubMed PubMed Central

5. Webb-Vargas, Y, Rudolph, KE, Lenis, D, Murakami, P, Stuart, EA. An imputation-based solution to using mismeasured covariates in propensity score analysis. Stat Methods Med Res 2017;26:1824–37. https://doi.org/10.1177/0962280215588771.Search in Google Scholar PubMed PubMed Central

6. Keogh, RH, White, IR. A toolkit for measurement error correction, with a focus on nutritional epidemiology. Statistics Med 2014;33:2137–55. https://doi.org/10.1002/sim.6095.Search in Google Scholar PubMed PubMed Central

7. Carroll, RJ, Ruppert, D, Stefanski, LA. Measurement error in nonlinear models: a modern perspective. London, UK: Chapman and Hall/CRC; 2006.10.1201/9781420010138Search in Google Scholar

8. Bound, J, Brown, C, Mathiowetz, N. Measurement error in survey data. In: Handbook of econometrics. Elsevier, Amsterdam, the Netherlands; 2001;5:3705–843. https://doi.org/10.1016/s1573-4412(01)05012-7.Search in Google Scholar

9. Innes, GK, Bhondoekhan, F, Lau, B, Gross, AL, Ng, DK, Abraham, AG. The measurement error elephant in the room: challenges and solutions to measurement error in epidemiology. Epidemiol Rev 2021;43:94–105. https://doi.org/10.1093/epirev/mxab011.Search in Google Scholar PubMed PubMed Central

10. Gravel, CA, Platt, RW. Weighted estimation for confounded binary outcomes subject to misclassification. Statistics Med 2018;37:425–36. https://doi.org/10.1002/sim.7522.Search in Google Scholar PubMed

11. Keogh, RH, Carroll, RJ, Tooze, JA, Kirkpatrick, SI, Freedman, LS. Statistical issues related to dietary intake as the response variable in intervention trials. Statistics Med 2016;35:4493–508. https://doi.org/10.1002/sim.7011.Search in Google Scholar PubMed PubMed Central

12. U.S. Department of Health and Human Services, Food and Drug Administration. Real-world data: assessing electronic health records and medical claims data to support regulatory decision-making for drug and biological products 2024b. Available at: https://www.fda.gov/regulatoryinformation/search-fda-guidance-documents/real-world-data-assessing-electronic-health-records-and-medical-claims-data-support-regulatory.Search in Google Scholar

13. Edwards, JK, Cole, SR, Troester, MA, Richardson, DB. Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. Am J Epidemiol 2013;177:904–12. https://doi.org/10.1093/aje/kws340.Search in Google Scholar PubMed PubMed Central

14. Siddique, J, Daniels, MJ, Carroll, RJ, Raghunathan, TE, Stuart, EA, Freedman, LS. Measurement error correction and sensitivity analysis in longitudinal dietary intervention studies using an external validation study. Biometrics 2019;75:927–37. https://doi.org/10.1111/biom.13044.Search in Google Scholar PubMed PubMed Central

15. Oh, EJ, Shepherd, BE, Lumley, T, Shaw, PA. Considerations for analysis of time-to-event outcomes measured with error: bias and correction with SIMEX. Statistics Med 2018;37:1276–89. https://doi.org/10.1002/sim.7554.Search in Google Scholar PubMed PubMed Central

16. Lyles, RH, Tang, L, Superak, HM, King, CC, Celentano, DD, Lo, Y, et al.. Validation data-based adjustments for outcome misclassification in logistic regression: an illustration. Epidemiology 2011;22:589–97. https://doi.org/10.1097/EDE.0b013e3182117c85.Search in Google Scholar PubMed PubMed Central

17. Beesley, LJ, Mukherjee, B. Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification. Biometrics 2022;78:214–26. https://doi.org/10.1111/biom.13400.Search in Google Scholar PubMed

18. Giganti, MJ, Shaw, PA, Chen, G, Bebawy, SS, Turner, MM, Sterling, TR, et al.. Accounting for dependent errors in predictors and time-to-event outcomes using electronic health records, validation samples, and multiple imputation. Ann Appl statistics 2020;14:1045–61. https://doi.org/10.1214/20-aoas1343.Search in Google Scholar PubMed PubMed Central

19. Edwards, JK, Bakoyannis, G, Yiannoutsos, CT, Mburu, MW, Cole, SR. Nonparametric estimation of the cumulative incidence function under outcome misclassification using external validation data. Statistics Med 2019;38:5512–27. https://doi.org/10.1002/sim.8380.Search in Google Scholar PubMed PubMed Central

20. Bakoyannis, G, Yiannoutsos, CT. Impact of and correction for outcome misclassification in cumulative incidence estimation. PLoS One 2015;10:e0137454. https://doi.org/10.1371/journal.pone.0137454.Search in Google Scholar PubMed PubMed Central

21. Oh, EJ, Shepherd, BE, Lumley, T, Shaw, PA. Raking and regression calibration: methods to address bias from correlated covariate and time-to-event error. Statistics Med 2021;40:631–49. https://doi.org/10.1002/sim.8793.Search in Google Scholar PubMed PubMed Central

22. Zhang, Z. Parametric regression model for survival data: Weibull regression model as an example. Ann Transl Med 2016;4:484. https://doi.org/10.21037/atm.2016.08.45.Search in Google Scholar PubMed PubMed Central

23. Ross, RK, Cole, SR, Edwards, JK, Zivich, PN, Westreich, D, Daniels, JL, et al.. Leveraging external validation data: the challenges of transporting measurement error parameters. Epidemiology 2024;35:196–207. https://doi.org/10.1097/ede.0000000000001701.Search in Google Scholar PubMed PubMed Central

24. Ackerman, B, Siddique, J, Stuart, EA. Calibrating validation samples when accounting for measurement error in intervention studies. Stat Methods Med Res 2021;30:1235–48. https://doi.org/10.1177/0962280220988574.Search in Google Scholar PubMed

25. Efron, B, Tibshirani, R. An introduction to the bootstrap. New York: Chapman & Hall; 1993.10.1007/978-1-4899-4541-9Search in Google Scholar

26. Kumar, S, Paiva, B, Anderson, KC, Durie, B, Landgren, O, Moreau, P, et al.. International Myeloma Working Group consensus criteria for response and minimal residual disease assessment in multiple myeloma. Lancet Oncol 2016;17:e328–46. https://doi.org/10.1016/S1470-2045(16)30206-6.Search in Google Scholar PubMed

27. Facon, T, Kumar, SK, Plesner, T, Orlowski, RZ, Moreau, P, Bahlis, N, et al.. Daratumumab, lenalidomide, and dexamethasone versus lenalidomide and dexamethasone alone in newly diagnosed multiple myeloma (MAIA): overall survival results from a randomised, open-label, phase 3 trial. Lancet Oncol 2021;22:1582–96. https://doi.org/10.1016/S1470-2045(21)00466-6.Search in Google Scholar PubMed

28. Wang, JR, Hayden, J, Yue, Y, Meyer, CS, Gan, R, Zhang, Y, et al.. Design and methodological considerations for real world data-derived progression-free survival in multiple myeloma. Forthcom ASH 2024:2024.10.1182/blood-2024-199224Search in Google Scholar

29. Rodriguez, G. Parametric survival models. Princeton: Princeton University, Rapport technique; 2010. Available from: https://grodri.github.io/survival/ParametricSurvival.pdf.Search in Google Scholar

30. Zhang, Q, Gossai, A, Monroe, S, Nussbaum, NC, Parrinello, CM. Validation analysis of a composite real-world mortality endpoint for patients with cancer in the United States. Health Serv Res 2021:1–7. https://doi.org/10.1111/1475-6773.13669.Search in Google Scholar PubMed PubMed Central

31. Schneeweiss, S. Enhancing external control arm analyses through data calibration and hybrid designs. Clin Pharmacol Ther 2024;116:1168–73. https://doi.org/10.1002/cpt.3364.Search in Google Scholar PubMed

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/em-2025-0009).

Received: 2025-02-25

Accepted: 2025-09-07

Published Online: 2025-09-26

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary Material Details

Articles in the same Issue

https://doi.org/10.1515/em-2025-0009

Keywords for this article

measurement error; real-world data; real-world evidence; time to event; calibration; endpoints

Creative Commons

BY 4.0