Nearest-Neighbor Estimation for ROC Analysis under Verification Bias

Gianfranco Adimari; Monica Chiogna

doi:10.1515/ijb-2014-0014

Article Publicly Available

Nearest-Neighbor Estimation for ROC Analysis under Verification Bias

Gianfranco Adimari and Monica Chiogna

Published/Copyright: March 13, 2015

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal The International Journal of Biostatistics Volume 11 Issue 1

Abstract

For a continuous-scale diagnostic test, the receiver operating characteristic (ROC) curve is a popular tool for displaying the ability of the test to discriminate between healthy and diseased subjects. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the test result and other characteristics of the subjects. Estimators of the ROC curve based only on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias, in particular under the assumption that the true disease status, if missing, is missing at random (MAR). MAR assumption means that the probability of missingness depends on the true disease status only through the test result and observed covariate information. However, the existing methods require parametric models for the (conditional) probability of disease and/or the (conditional) probability of verification, and hence are subject to model misspecification: a wrong specification of such parametric models can affect the behavior of the estimators, which can be inconsistent. To avoid misspecification problems, in this paper we propose a fully nonparametric method for the estimation of the ROC curve of a continuous test under verification bias. The method is based on nearest-neighbor imputation and adopts generic smooth regression models for both the probability that a subject is diseased and the probability that it is verified. Simulation experiments and an illustrative example show the usefulness of the new method. Variance estimation is also discussed.

Keywords: diagnostic tests; missing data imputation; sensitivity; specificity

1 Introduction

The evaluation of the ability of a diagnostic or a screening test to separate diseased from non-diseased subjects is a crucial issue in modern medicine. In fact, before applying a test in a clinical setting, rigorous statistical assessment of its performance in discriminating the disease status from the non-disease status is required.

Typically, in evaluating a diagnostic test’s discriminatory ability, the available data come from medical records of patients who undergo the test. The accuracy of the test under study is ideally evaluated by comparison with a perfect gold standard test, which assesses disease status with certainty. In practice, however, a gold standard may be too expensive, or too invasive or both for regular use. Hence, only a subset of patients undergoes disease verification, and the decision to send a patient to verification is often based on the test result and other patient characteristics. As noted by many authors (see Begg and Greenes [1], Begg [2] and Zhou [3], among others), summary measures of test performance based on data from patients with verified disease status only may be badly biased. This bias is usually referred to as verification bias.

For a diagnostic test that yields a continuous test result, the receiver operating characteristic (ROC) curve is a popular tool for displaying the ability of the test to discriminate between healthy and diseased subjects. The continuous test result can be dichotomized at a specified cutpoint. Given the cutpoint c, the sensitivity Se(c) is the probability of a true positive, i.e., the probability that the test correctly identifies a diseased subject. The specificity Sp(c) is the probability of a true negative, i.e. the probability that the test correctly identifies a non-diseased subject. When one varies the cutpoint throughout the entire real line, the resulting pairs (1-specificity, sensitivity) form the ROC curve. A commonly used summary measure that aggregates performance information of the test is the area under the ROC curve (AUC). See, for example, Zhou et al. [4] as a general reference.

In the presence of verification bias, under the assumption that the true disease status, if missing, is missing at random (MAR), estimation of the ROC curve of a continuous test, i.e., estimation of sensitivity and specificity, has been discussed in Alonzo and Pepe [5], where alternative estimators are reviewed and compared. MAR assumption states that the probability of a subject having the disease status verified is purely determined by the test result and the subjects’ observed characteristics and is conditionally independent of the unknown true disease status. This corresponds to a so-called ignorable missingness, which is often assumed in practice. Estimation of the ROC curve when the true diseased status is subject to non-ignorable missingness is tackled in Fluss et al. [6] and Liu and Zhou [7]. In all these cases, however, inference on the ROC curve requires specification of a parametric regression model for the probability of a subject being diseased and/or verified. A wrong specification of these parametric models affects the behavior of the estimators.

To reduce the effects of model misspecification, He and McDermott [8] propose a method that stratifies the verified sample into several subsamples that have homogeneous propensity scores (the conditional probabilities of verification) and allows correction for verification bias within each subsample. Parametric models are still used to estimate the propensity scores, but since the estimated propensity scores are only used for the purpose of stratification, the estimators of sensitivity and specificity are less sensitive to model misspecification. The method applies to binary tests under the MAR assumption.

In this paper, we propose a fully nonparametric method for the estimation of the ROC curve of a continuous test under verification bias. The proposed method is based on nearest-neighbor imputation and adopts generic smooth regression models for both the probability that a subject is diseased and the probability that it is verified. Our choice is motivated by the results in Ning and Cheng [9], according to which the nearest-neighbor imputation method favorably compares with other nonparametric imputation methods in estimating a population mean.

The estimators for the sensitivity and the specificity obtained by the new approach are shown to be consistent and asymptotically normal under the MAR assumption. Estimation of their variance is also discussed. Some simulation results and an illustrative example show usefulness of our proposal and advantages in comparison with known estimators.

The paper is organized as follows. In Section 2, we give a brief review of existing methods for estimating the ROC curve under verification bias. Section 3 describes the proposed approach, giving theoretical justification. Section 4 presents some results of a simulation study carried out to compare the new method with the existing methods. In Section 5, we illustrate the method with an example, and Section 6 contains details about variance estimation. A concluding discussion is given in Section 7.

2 Background

In this section, we review current bias-correction methods in the presence of verification bias, as presented in Alonzo and Pepe [5]. Let Ti denote the continuous test result from a diagnostic test, and let Di denote the binary disease status, i=1,…,n, where Di=1 indicates the ith patient is diseased and Di=0 indicates the ith patient is free of disease. Let Vi denote the binary verification status of the ith patients, with Vi=1 if the ith patient has the true disease status verified, and Vi=0 otherwise. In practice, some information, other than the results from the test, can be obtained for each patient. Let Xi be a vector of observed covariates for the ith patient that may be associated with both Di and Vi.

When all patients are verified, i.e., Vi=1,i=1,…,n, a complete data set is obtained. In this case, for any cutpoint c, the sensitivity Se(c) and the specificity Sp(c) could be easily estimated by

Seˆ(c)=∑i=1nI(Ti≥c)Di∑i=1nDi,Spˆ(c)=∑i=1nI(Ti<c)(1−Di)∑i=1n(1−Di),

where I(⋅) is the indicator function. Seˆ(c) and Spˆ(c) are unbiased estimators for Se(c) and Sp(c), respectively.

If not all patients have their disease status verified, several estimators based on the MAR assumption have been proposed. MAR assumption states that the binary responses D and V are mutually independent given the test result T and the covariates X, i.e.,

(1)Pr(V=1|D,T,X)=Pr(V=1|T,X).

The so-called full imputation (FI) estimators of Se(c) and Sp(c) are

SeˆFI(c)=∑i=1nI(Ti≥c)ρˆi∑i=1nρˆi,SpˆFI(c)=∑i=1nI(Ti<c)(1−ρˆi)∑i=1n(1−ρˆi).

Parametric models, such as logistic regression models, have to be used to obtain the estimate ρˆi of ρi=Pr(Di=1|Ti,Xi) using only data from verified subjects. Mean score imputation (MSI) is another possible approach that only imputes disease status for subjects who are not in the verification sample. In this case,

SeˆMSI(c)=∑i=1nI(Ti≥c){ViDi+(1−Vi)ρˆi}∑i=1n{ViDi+(1−Vi)ρˆi},SpˆMSI(c)=∑i=1nI(Ti<c){Vi(1−Di)+(1−Vi)(1−ρˆi)}∑i=1n{Vi(1−Di)+(1−Vi)(1−ρˆi)}.

The inverse probability weighting (IPW) estimator weights each verified subject by the inverse of the probability that the subject is selected for verification. Therefore, the estimators of Se(c) and Sp(c) are

SeˆIPW(c)=∑i=1nI(Ti≥c)ViDiπˆi−1∑i=1nViDiπˆi−1,SpˆIPW(c)=∑i=1nI(Ti<c)Vi(1−Di)πˆi−1∑i=1nVi(1−Di)πˆi−1,

where πˆi is an estimate of πi=Pr(Vi=1|Ti,Xi). Finally, the semiparametric efficient (SPE) estimators are

SeˆSPE(c)=∑i=1nI(Ti≥c){ViDi+(πˆi−Vi)ρˆi}πˆi−1∑i=1n{ViDi+(πˆi−Vi)ρˆi}πˆi−1,SeˆSPE(c)=∑i=1nI(Ti<c){Vi(1−Di)+(πˆi−Vi)(1−ρˆi)}πˆi−1∑i=1n{Vi(1−Di)+(πˆi−Vi)(1−ρˆi)}πˆi−1.

Alonzo and Pepe [5] find that SPE estimators are doubly robust in the sense that they are consistent if either πi’s or ρi’s are estimated consistently.

3 The proposal

All the verification bias-corrected estimators of Se(c) and Sp(c) reviewed in the previous section require a regression model to be fitted for a binary response, D or V. The FI and MSI approaches require estimates of ρi’s, whereas the IPW approach requires estimates of πi’s. The SPE approach requires estimates of both ρi’s and πi,’s although only one of the two sets of probabilities needs to be estimated consistently. Typically, suitable generalized linear regression models are employed to this end. However, a wrong specification of such parametric models might strongly affect the behavior of the estimators.

To avoid misspecification problems, in what follows we propose a fully nonparametric approach to the estimation of Se(c) and Sp(c). Our approach is based on the K-nearest-neighbor (KNN) imputation estimator of the mean of a response variable as discussed in Ning and Cheng [9].

Hereafter, we will assume Y=(T,X)T to be a continuous-valued random vector. Let θ1 be the disease prevalence, i.e., θ1=E(D)=Pr(D=1). As θ1 is a mean, following Ning and Cheng [9], for a finite positive integer K and a suitable distance measure, a nearest-neighbor imputation estimator of θ1, based on the sample (Yi,Di,Vi),i=1,…,n, may be defined as

(2)θˆ1=1n∑i=1n{ViDi+(1−Vi)ρˆKi},

where ρˆKi=1K∑j=1KDi(j), and {(Yi(j),Di(j)):Vi(j)=1,j=1,…,K} is a set of K observed data pairs and Yi(j) denotes the jth nearest neighbor to Yi=(Ti,Xi)T among all Y’s corresponding to the verified patients, i.e., to those Dh’s with Vh=1.

Let θ2=Pr(T≥c,D=1) and θ3=Pr(T≥c,D=0). Then Se(c)=θ2θ1 and Sp(c)=1−θ31−θ1. Similarly to θˆ1, KNN estimators for θ2 and θ3 can be defined as:

θˆ2=1n∑i=1nI(Ti≥c){ViDi+(1−Vi)ρˆKi},

θˆ3=1n∑i=1nI(Ti≥c){Vi(1−Di)+(1−Vi)(1−ρˆKi)}.

Therefore

SeˆKNN(c)=θˆ2θˆ1andSpˆKNN(c)=1−θˆ1−θˆ31−θˆ1,

are KNN imputation estimators for the sensitivity Se(c) and the specificity Sp(c), respectively. The following theorem gives asymptotic normality of SeˆKNN(c) and SpˆKNN(c)

Let ρ(y)=Pr(D=1|Y=y) and π(y)=Pr(V=1|Y=y).

Theorem 1Assume (1) and first-order differentiability of the functionsρ(y)andπ(y). Moreover, assume thatE(1/π(Y))<∞.Then, for a fixed cutpointc,the KNN imputation estimatorsSeˆKNN(c)andSpˆKNN(c)based on the sample(Yi,Di,Vi),i=1,…,n,are consistent and asymptotically normally distributed.

Proof 1SinceE(D2)<∞,Var(D|Y=y)=ρ(y)(1−ρ(y))<∞,ρ(y)andπ(y)are finite and first order differentiable, by Theorem 1 in Ning and Cheng [9], the KNN imputation estimator θˆ1 is consistent and asymptotically normally distributed, that is

(3)n(θˆ1−θ1)→N(0,σ12),

as n goes to infinity, where

σ12=θ1(1−θ1)+Eρ(Y)(1−ρ(Y))(1−π(Y))1+1K+Eρ(Y)(1−ρ(Y))(1−π(Y))2π(Y).

Moreover, one can write

(4)θˆ2=1n∑i=1nI(Ti≥c){ViDi+(1−Vi)ρi}+1n∑i=1nI(Ti≥c)(1−Vi)(ρˆKi−ρi),

and

(5)θˆ3=1n∑i=1nI(Ti≥c){Vi(1−Di)+(1−Vi)(1−ρi)}−1n∑i=1nI(Ti≥c)(1−Vi)(ρˆKi−ρi).

Hence, conditions stated in the theorem allow to apply the arguments given in the proof of Theorem 1 in Ning and Cheng [9] showing, in particular, that

1n∑i=1nI(Ti≥c)(1−Vi)(ρˆKi−ρi)=W+op(n−1/2),

whereW=1n∑i=1nI(Ti≥c)(1−Vi)1K∑j=1KVi(j)Di(j)−ρi(j), and

nW→N0,1KE(1−π(Y))σ2(Y)+E(1−π(Y))2σ2(Y)π(Y),

in distribution (here, σ2(Y) denotes the conditional variance of I(T≥c,D=1) given Y). Furthermore, nW behaves asymptotically as a sample mean. This, together with an application of the standard central limit theorem to the first term of the right hand side of equations (4) and (5), leads to asymptotic results for θˆ2 and θˆ3 similar to that in (3). That is

n(θˆ2−θ2)→N(0,σ22),n(θˆ3−θ3)→N(0,σ32),

as n goes to infinity, for suitable values σ22 and σ32 (see also Section 6). Finally, θˆ1,θˆ2 and θˆ3 are jointly asymptotically normal. Thus, by a standard application of the delta method, SeˆKNN(c)=θˆ2θˆ1 and SpˆKNN(c)=1−θˆ1−θˆ31−θˆ1 are consistent and asymptotically normal estimators of Se(c) and Sp(c), respectively.

It is straightforward to show that estimators SeˆKNN(c) and SpˆKNN(c) are nonparametric version of the MSI estimators, i.e.,

(6)SeˆKNN(c)=∑i=1nI(Ti≥c){ViDi+(1−Vi)ρˆKi}∑i=1n{ViDi+(1−Vi)ρˆKi},SpˆKNN(c)=∑i=1nI(Ti<c){Vi(1−Di)+(1−Vi)(1−ρˆKi)}∑i=1n{Vi(1−Di)+(1−Vi)(1−ρˆKi)}.

Clearly, by varying c, the pairs (1−SpˆKNN(c),SeˆKNN(c)) give rise to the nonparametric verification bias-corrected estimate of the ROC curve. Moreover, it is worth noting that (2) gives a fully nonparametric estimator for the disease prevalence that is alternative to the estimators, obtained by the FI, MSI, IPW and SPE methods, discussed in Alonzo and Pepe [10].

In practice, the use of our estimators requires to select the neighborhood size K and a suitable distance measure. Such aspects, touched upon in the following section, are discussed in Section S1 of Supplementary Material.

4 Simulation study

In this section, Monte Carlo experiments are used to compare the new method with existing approaches with respect to bias and standard deviation. In particular, we compare the ability of the MSI, IPW, SPE and KNN methods to estimate the sensitivity and the specificity of a test. We do not consider the FI method because of its similarities with the MSI method. As for the KNN method, we give the results for the estimators based on the quite commonly used Euclidean distance and on values of K equal to 1 and 3. This choice is supported by the results of a preliminary simulation study, in which KNN estimators based on various distance measures (Manhattan, Euclidean, Lagrange and Mahalanobis) and on different neighborhood sizes (K=1,3,5,10,20) have been compared (see Supplementary Material, Section S1).

From Section 2, the MSI method requires a parametric model for ρ(y), the IPW method requires a parametric model for π(y), and the SPE method requires both models. A wrong specification of such models may affect the est imation. Hence, in the simulation study we consider two scenarios: (i) the models for ρ(y) and π(y) are both correctly specified, (ii) the models for ρ(y) and π(y) are both misspecified. Scenario (i) allows to evaluate the behavior of the proposed estimators in samples of moderate sizes, where the MSI, IPW and SPE estimators are expected to well behave. On the other side, scenario (ii) allows to look for weaknesses of existing methods and to highlight the potential advantages of the new proposal.

Simulation settings are similar to those in Alonzo and Pepe [5] and He and McDermott [8]. Starting from two independent random variables Z1∼N(0,0.5) and Z2∼N(0,0.5), the disease indicator D is specified as D=I[g(Z1,Z2)>r1]. The threshold r1 determines the disease prevalence (in what follows, we choose r1 to make the disease prevalence 0.25) and different specifications of the function g(Z1,Z2) give rise to different disease processes. The diagnostic test result T and an auxiliary covariate X are generated to be related to D through Z1 and Z2. More precisely, T=h(Z1,Z2)+ε1 and X=f(Z1,Z2)+ε2, for suitable functions h(⋅,⋅) and f(⋅,⋅), where ε1 and ε2 are independent N(0,0.25) random variables, independent also from Z1 and Z2. Finally, the verification probability π is set to be a suitable function of T and X, in accordance with the MAR assumption. The number of replicates in each simulation experiment is 5,000.

Models for ρ(y) and π(y) both correctly specified.

We set g(Z1,Z2)=f(Z1,Z2)=Z1+Z2,h(Z1,Z2)=α(Z1+Z2), and π(T,X)=eδ0+δ1T+δ2X1+eδ0+δ1T+δ2X. We fix δ0=0.05, δ1=0.9, δ2=0.7. This choice corresponds to a verification rate of about 0.51. As for α, we choose three different values, i.e., 0.5, 1 and 1.5 that give rise to different variances of T, as well as to different correlations between T and X, with larger values giving rise to higher variances and correlations. In particular, on going from α=0.5 to α=1.5 the variance of T becomes five times greater. Moreover, we consider four values for the cutpoint c, i.e., 0.2, 0.5, 0.8, and 1.2. Obviously, each combination (α,c) determines a different true value for the pair (sensitivity, specificity), given by

Se(c)=1−∫r1+∞Φ(c−αz0.25)φ(z)dz1−Φ(r1),Sp(c)=∫−∞r1Φ(c−αz0.25)φ(z)dzΦ(r1),

where φ(⋅) and Φ(⋅) are the density function and the cumulative distribution function of the standard normal random variable, respectively.

According to the aim of the study in this scenario, we fix two sample sizes, a relatively small one, i.e., n=50, and a moderate one, i.e., n=100. This allows to evaluate the behavior of the proposed estimators in settings where the MSI, IPW and SPE estimators are expected to well behave.

To estimate the conditional disease probabilities, we use a generalized linear model for D given T and X with probit link; this model is correctly specified (see Alonzo and Pepe [5]). The conditional verification probabilities are estimated from a logistic regression model with V as the response and T and X as predictors. Evidently, also this model is correct.

Tables 1 and 2 show Monte Carlo means and standard deviations (in brackets) of the estimators for the sensitivity and the specificity. Results concern the estimators IPW, MSI, SPE and the new proposals 1NN and 3NN, i.e., the KNN estimator with K=1 and K=3, respectively, computed using the Euclidean distance. From the simulation results it is clear that all of the methods behave well if both parametric models for ρ(y) and π(y) are correctly specified, with the IPW method showing slightly poorer performances in some circumstances. In terms of bias and standard deviation, the new proposals compare very well with existing estimators. Moreover, the estimators 1NN and 3NN seem to achieve similar performances, making the choice of the number K of nearest neighbors not particularly crucial (within the range 1–3).

Table 1:

Monte Carlo means and standard deviations (in brackets) of the estimators for the sensitivity and the specificity, when the models for ρ(y) and π(y) are correctly specified. “True” denotes the true parameter value. Sample size = 50.

	c=0.2		c=0.5		c=0.8		c=1.2
α=0.5
Sensitivity
True	0.782		0.590		0.377		0.154
IPW	0.787	(0.146)	0.598	(0.169)	0.384	(0.157)	0.163	(0.116)
MSI	0.783	(0.133)	0.595	(0.158)	0.383	(0.147)	0.161	(0.109)
SPE	0.783	(0.140)	0.595	(0.161)	0.382	(0.149)	0.162	(0.110)
1NN	0.783	(0.143)	0.596	(0.166)	0.382	(0.153)	0.162	(0.112)
3NN	0.775	(0.136)	0.587	(0.159)	0.376	(0.147)	0.159	(0.109)
Specificity
True	0.742		0.877		0.953		0.992
IPW	0.735	(0.100)	0.873	(0.070)	0.953	(0.043)	0.992	(0.017)
MSI	0.742	(0.074)	0.877	(0.057)	0.955	(0.035)	0.993	(0.014)
SPE	0.742	(0.075)	0.876	(0.058)	0.954	(0.036)	0.993	(0.015)
1NN	0.742	(0.076)	0.877	(0.058)	0.954	(0.036)	0.992	(0.015)
3NN	0.741	(0.074)	0.875	(0.057)	0.954	(0.035)	0.992	(0.015)
α=1
Sensitivity
True	0.951		0.874		0.742		0.513
IPW	0.954	(0.077)	0.877	(0.117)	0.746	(0.148)	0.520	(0.159)
MSI	0.951	(0.067)	0.875	(0.110)	0.747	(0.140)	0.516	(0.152)
SPE	0.953	(0.076)	0.876	(0.118)	0.746	(0.142)	0.516	(0.153)
1NN	0.950	(0.079)	0.872	(0.117)	0.744	(0.142)	0.515	(0.156)
3NN	0.944	(0.077)	0.863	(0.117)	0.735	(0.142)	0.509	(0.153)
Specificity
True	0.745		0.855		0.931		0.982
IPW	0.729	(0.105)	0.846	(0.078)	0.926	(0.052)	0.980	(0.027)
MSI	0.745	(0.074)	0.856	(0.060)	0.931	(0.043)	0.982	(0.022)
SPE	0.745	(0.074)	0.856	(0.061)	0.931	(0.044)	0.982	(0.023)
1NN	0.745	(0.075)	0.856	(0.062)	0.931	(0.044)	0.981	(0.024)
3NN	0.744	(0.074)	0.855	(0.060)	0.930	(0.043)	0.981	(0.023)
α=1.5
Sensitivity
True	0.991		0.969		0.918		0.784
IPW	0.992	(0.031)	0.973	(0.058)	0.920	(0.092)	0.787	(0.133)
MSI	0.990	(0.028)	0.970	(0.053)	0.920	(0.086)	0.786	(0.130)
SPE	0.991	(0.032)	0.971	(0.057)	0.920	(0.089)	0.785	(0.131)
1NN	0.990	(0.033)	0.970	(0.060)	0.918	(0.092)	0.783	(0.134)
3NN	0.986	(0.038)	0.965	(0.061)	0.912	(0.091)	0.776	(0.133)
Specificity
True	0.731		0.822		0.897		0.963
IPW	0.700	(0.121)	0.803	(0.092)	0.886	(0.069)	0.958	(0.040)
MSI	0.730	(0.076)	0.824	(0.065)	0.898	(0.053)	0.963	(0.032)
SPE	0.731	(0.076)	0.824	(0.066)	0.898	(0.053)	0.963	(0.033)
1NN	0.731	(0.077)	0.824	(0.067)	0.898	(0.054)	0.962	(0.033)
3NN	0.731	(0.076)	0.824	(0.065)	0.897	(0.053)	0.962	(0.032)

Table 2:

	c=0.2		c=0.5		c=0.8		c=1.2
α=0.5
Sensitivity
True	0.782		0.590		0.377		0.154
IPW	0.785	(0.102)	0.595	(0.116)	0.379	(0.109)	0.159	(0.079)
MSI	0.785	(0.091)	0.595	(0.107)	0.380	(0.103)	0.159	(0.076)
SPE	0.785	(0.097)	0.594	(0.110)	0.379	(0.104)	0.159	(0.076)
1NN	0.783	(0.101)	0.594	(0.115)	0.378	(0.106)	0.159	(0.077)
3NN	0.780	(0.096)	0.590	(0.110)	0.376	(0.104)	0.158	(0.075)
Specificity
True	0.742		0.877		0.953		0.992
IPW	0.738	(0.068)	0.877	(0.047)	0.954	(0.029)	0.992	(0.012)
MSI	0.742	(0.052)	0.878	(0.038)	0.955	(0.025)	0.992	(0.010)
SPE	0.742	(0.053)	0.878	(0.039)	0.954	(0.025)	0.992	(0.011)
1NN	0.742	(0.054)	0.878	(0.040)	0.954	(0.026)	0.992	(0.011)
3NN	0.741	(0.053)	0.877	(0.039)	0.954	(0.025)	0.992	(0.011)
α=1
Sensitivity
True	0.951		0.874		0.742		0.513
IPW	0.952	(0.054)	0.875	(0.081)	0.746	(0.102)	0.517	(0.112)
MSI	0.950	(0.046)	0.875	(0.074)	0.746	(0.096)	0.516	(0.108)
SPE	0.951	(0.053)	0.875	(0.078)	0.746	(0.099)	0.516	(0.109)
1NN	0.950	(0.056)	0.873	(0.083)	0.744	(0.103)	0.515	(0.111)
3NN	0.947	(0.053)	0.870	(0.079)	0.741	(0.099)	0.512	(0.109)
Specificity
True	0.745		0.855		0.931		0.982
IPW	0.738	(0.073)	0.851	(0.052)	0.929	(0.036)	0.982	(0.018)
MSI	0.745	(0.052)	0.855	(0.042)	0.931	(0.030)	0.982	(0.015)
SPE	0.746	(0.053)	0.855	(0.043)	0.931	(0.031)	0.982	(0.016)
1NN	0.745	(0.054)	0.855	(0.044)	0.931	(0.032)	0.982	(0.016)
3NN	0.745	(0.053)	0.855	(0.043)	0.931	(0.031)	0.982	(0.016)
α=1.5
Sensitivity
True	0.991		0.969		0.918		0.784
IPW	0.992	(0.022)	0.970	(0.041)	0.918	(0.064)	0.785	(0.093)
MSI	0.991	(0.018)	0.969	(0.036)	0.918	(0.059)	0.785	(0.090)
SPE	0.992	(0.022)	0.970	(0.040)	0.918	(0.063)	0.784	(0.091)
1NN	0.991	(0.024)	0.969	(0.043)	0.917	(0.066)	0.783	(0.094)
3NN	0.990	(0.022)	0.967	(0.041)	0.914	(0.063)	0.780	(0.091)
Specificity
True	0.731		0.822		0.897		0.963
IPW	0.713	(0.082)	0.812	(0.062)	0.893	(0.046)	0.961	(0.026)
MSI	0.731	(0.052)	0.823	(0.045)	0.898	(0.036)	0.963	(0.022)
SPE	0.731	(0.052)	0.823	(0.046)	0.898	(0.037)	0.963	(0.023)
1NN	0.731	(0.053)	0.824	(0.047)	0.898	(0.037)	0.963	(0.023)
3NN	0.731	(0.053)	0.823	(0.046)	0.898	(0.037)	0.963	(0.022)

Tables 1 and 2 allow also to gain insight into the effect on results of different variances of T and of different correlations between T and X. By crossing values of α and c giving rise to comparable values of sensitivity or specificity, it is possible to note that, for all considered estimators, the obtained Monte Carlo means and standard deviations are essentially not influenced by the different values of variance and correlation. As far as sensitivity is concerned, for example, one can compare results obtained for α=0.5 and c=0.2 (true sensitivity equal to 0.782), with results obtained for α=1.5 and c=1.2 (true sensitivity equal to 0.784). As for specificity, one can compare results obtained for α=0.5 and c=0.2 (true specificity equal to 0.742), with results obtained for α=1 and c=0.2 (true specificity equal to 0.745) or α=1.5 and c=0.2 (true specificity equal to 0.731).

As pointed out by a Referee, the values chosen for α in Tables 1 and 2 refer to situations where the diagnostic tests perform well, i.e., situations where the true AUC value ranges from 0.85 to 0.97. Performance of estimators in situations where the true AUC value of the test is relatively small (0.59 and 0.71) are given in Section S2, Supplementary Material. Results show the same behavior as the one shown in Tables 1 and 2.

Simulation results allowing to explore the effect of a multidimensional vector of auxiliary covariates are given in Section S4, Supplementary Material. A vector X of dimension 3 is employed. Compared with results in Tables 1 and 2, results in Tables 10 and 11, Supplementary Material, show some loss of efficiency of the KNN estimators with respect to the parametric competitors.

Models for ρ(y) and π(y) both misspecified.

We set g(Z1,Z2)=exp[2(Z1Z2)2],h(Z1,Z2)=2(Z1Z2)2, f(Z1,Z2)=2(Z12+Z22), and π(T,X)=0.05+δI[T>1.2]+(1−0.05−δ)I[X>1.95]. In this case, the verification probabilities are: 1 for those subjects with T>1.2 and X>1.95;1−δ for those subjects with T≤1.2 and X>1.95;0.05+δ for those subjects with T>1.2 and X≤1.95; 0.05 otherwise. The values 1.2 and 1.95 correspond roughly to the 92-th and the 86-th percentile of the distributions of T and X, respectively. The value of δ is allowed to range from 0.1 to 0.9 with steps of 0.2. By varying δ, one can vary the strength of the dependence among V,T, and X: small values of δ indicate a strong dependence of V on X, whereas high values of δ indicate a strong dependence of V on T. Finally, for the cutpoint c, we choose three different values, i.e., 0.2,0.4,0.6, that give rise to three different values for the target pair (sensitivity, specificity). The aim in this scenario is to compare the estimators when the complete data set provides a great amount of information, in order to highlight possible weaknesses of competitors of our KNN estimators, in particular their possible inconsistency. Therefore, the required size for generating samples should be high enough to guarantee both reliable estimates from the complete data set and a sufficiently high number of verified healthy and diseased subjects. In the setting of scenario (ii), for δ going from 0.1 to 0.9, the verification rate ranges roughly from 0.29 to 0.18 and, within healthy subjects, from 0.11 to 0.05. This has led us to the choice of n=1000.

To estimate the conditional disease probabilities, we use a generalized linear model for D given T and X with logit link; this model is misspecified. The conditional verification probabilities are estimated from a logistic regression model with V as the response and T as predictor. Clearly, also this model is misspecified.

Table 3 presents Monte Carlo means and standard deviations (across 5,000 replications) for the estimators of the sensitivity and the specificity. Results concern the estimators IPW, MSI, SPE, 1NN and 3NN. Moreover, results for the estimators based on complete data (denoted by “Full” in the table), that is with all cases verified, are also presented. Given the large sample size utilized in this setting, we expect that the Monte Carlo means for the Full estimators represent a good approximation of the true values of the sensitivity and the specificity and they are therefore used as the benchmark values.

Table 3:

Mean estimated sensitivity, mean estimated specificity and standard deviation (in brackets) from 5,000 replications when both models for ρ(y) and π(y) are misspecified and the cutpoint c is set equal to 0.2, 0.4, 0.6. “Full” indicates the estimator based on complete data, which does not change with δ. Sample size = 1,000.

δ	IPW	MSI	SPE	1NN	3NN	Full
Sensitivity
c = 0.2
0.1	0.778 (0.052)	0.868 (0.029)	0.846 (0.051)	0.888 (0.055)	0.885 (0.047)
0.3	0.767 (0.060)	0.877 (0.029)	0.858 (0.061)	0.889 (0.054)	0.885 (0.048)
0.5	0.752 (0.077)	0.887 (0.030)	0.870 (0.083)	0.886 (0.057)	0.882 (0.049)	0.888 (0.020)
0.7	0.736 (0.108)	0.898 (0.032)	0.893 (0.158)	0.886 (0.060)	0.880 (0.052)
0.9	0.744 (0.169)	0.903 (0.040)	0.929 (0.731)	0.879 (0.067)	0.871 (0.058)
c=0.4
0.1	0.685 (0.051)	0.778 (0.039)	0.759 (0.050)	0.818 (0.064)	0.814 (0.056)
0.3	0.679 (0.060)	0.794 (0.040)	0.778 (0.060)	0.821 (0.065)	0.816 (0.056)
0.5	0.671 (0.074)	0.810 (0.041)	0.796 (0.077)	0.821 (0.066)	0.816 (0.058)	0.820 (0.024)
0.7	0.663 (0.100)	0.828 (0.043)	0.819 (0.250)	0.820 (0.068)	0.813 (0.060)
0.9	0.684 (0.164)	0.839 (0.056)	0.908 (3.428)	0.813 (0.080)	0.803 (0.070)
c=0.6
0.1	0.593 (0.049)	0.672 (0.046)	0.658 (0.051)	0.738 (0.069)	0.734 (0.061)
0.3	0.590 (0.056)	0.691 (0.047)	0.678 (0.058)	0.739 (0.069)	0.734 (0.061)
0.5	0.588 (0.070)	0.713 (0.049)	0.701 (0.072)	0.737 (0.071)	0.732 (0.063)	0.737 (0.028)
0.7	0.593 (0.093)	0.739 (0.053)	0.733 (0.103)	0.738 (0.074)	0.731 (0.066)
0.9	0.638 (0.153)	0.759 (0.066)	0.867 (3.715)	0.732 (0.084)	0.724 (0.074)
Specificity
c=0.2
0.1	0.776 (0.031)	0.649 (0.024)	0.641 (0.028)	0.603 (0.026)	0.603 (0.024)
0.3	0.799 (0.032)	0.637 (0.023)	0.630 (0.028)	0.603 (0.026)	0.603 (0.024)
0.5	0.827 (0.032)	0.626 (0.023)	0.620 (0.031)	0.603 (0.027)	0.603 (0.024)	0.602 (0.018)
0.7	0.865 (0.033)	0.614 (0.021)	0.607 (0.096)	0.603 (0.028)	0.603 (0.025)
0.9	0.914 (0.030)	0.604 (0.021)	0.613 (0.661)	0.602 (0.029)	0.602 (0.026)
c=0.4
0.1	0.870 (0.022)	0.788 (0.021)	0.782 (0.024)	0.742 (0.025)	0.742 (0.022)
0.3	0.885 (0.022)	0.776 (0.021)	0.772 (0.024)	0.743 (0.025)	0.743 (0.022)
0.5	0.903 (0.021)	0.765 (0.020)	0.762 (0.024)	0.743 (0.025)	0.743 (0.022)	0.743 (0.016)
0.7	0.926 (0.019)	0.755 (0.020)	0.751 (0.028)	0.743 (0.025)	0.743 (0.022)
0.9	0.956 (0.016)	0.745 (0.019)	0.739 (0.221)	0.743 (0.027)	0.744 (0.023)
c=0.6
0.1	0.932 (0.015)	0.887 (0.016)	0.885 (0.018)	0.852 (0.022)	0.852 (0.018)
0.3	0.939 (0.014)	0.879 (0.016)	0.876 (0.018)	0.852 (0.021)	0.852 (0.018)
0.5	0.948 (0.013)	0.870 (0.016)	0.868 (0.018)	0.852 (0.020)	0.853 (0.018)	0.852 (0.013)
0.7	0.961 (0.011)	0.862 (0.016)	0.860 (0.019)	0.852 (0.021)	0.853 (0.018)
0.9	0.976 (0.009)	0.854 (0.015)	0.851 (0.055)	0.852 (0.022)	0.854 (0.018)

Table 3 clearly shows limitations of parametric estimators when models for ρ(y) and π(y) are misspecified. In particular, in terms of bias, the IPW, MSI and SPE methods perform almost always poorly, with high distortion in some cases. Moreover, the Monte Carlo standard deviations shown in the table indicate that the SPE method (and sometime also the IPW method) might yield very unstable estimates. In fact, the SPE estimates may even fall outside the interval (0,1). In our simulations, in the worst case this event happened about 20 times across 5,000 replications.

Overall, the new estimators 1NN and 3NN perform well in terms of both bias and standard deviation. In particular, they yield estimates that are, in all cases, close to the full data estimates (see also results in Section S3, Supplementary Material, where some simulations have been produced for a smaller sample size). The estimator 3NN appears to be slightly more biased than 1NN, but, on the other side, with slightly less variance. Note that in this setting the function π(y) used for the verification process is not smooth. Then, the KNN estimators seem to show also some degree of robustness against violation of smoothness assumptions. This is not surprising because, as stated in Section 2 of Ning and Cheng [9], “the NN rule is basically unaffected by discontinuity of π(y), sparse data or multi-dimensional covariate”.

5 An illustration

To illustrate the application of the method developed in the previous sections, we utilize the Wisconsin Breast Cancer Data, publicly available at the UCI Machine Learning Repository [11]. The construction of the dataset was motivated by the need to accurately diagnose breast masses on the basis, solely, of a Fine Needle Aspiration (FNA). The dataset collects various features which are computed from a digitized image of a FNA of a breast mass, describing characteristics of the cell nuclei present in the image. A total of 30 nuclear features are computed on each of 569 samples, of which 357 are benign and 212 malignant. The dataset has been extensively used in the literature. The interested reader can refer to the UCI Machine Learning Repository documentation for retrieving information about the dataset creation, the description of its attributes, and a list of relevant papers using or citing this data set.

Here, we use one of the features, i.e., the worst radius (WR), as the test to diagnose malignant breast masses, and one of the remaining features, i.e., the worst concave point (WCP), as a covariate giving auxiliary information. Our aim is to estimate the ROC curve of the test WR. To mimic verification bias, a subset of the complete dataset is constructed. In this subset, the test WR and the covariate WCP are known for all samples, but the true status (benign or malignant) is available only for some samples, that we select according to the following mechanism. We select all samples having a value for both WR and WCP above their respective medians; we do not select samples having a value for both WR and WCP below their respective medians; we select all remaining samples with probability equal to 0.95.

Figure 1:

Illustrative example: estimated ROC curves of the test WR.

The obtained dataset shows a percentage of samples with true status known (verified) of about 58%. The percentage of benign samples is about 36% among verified samples and 99% among non-verified samples.

As traditional methods (MSI, IPW, SPE) require the use of parametric regression models for the conditional probability of a sample being malignant and/or selected (i.e., with the true status known), we use a generalized linear model for the status given WR and WCP with probit link to estimate the conditional disease probabilities, and a logistic regression model with WR and WCP as predictors to estimate the conditional selection probabilities. Clearly, this last model is misspecified.

Figure 1 shows the estimated ROC curves of the test WR obtained with the IPW, MSI, SPE, 1NN and 3NN methods. Such curves are benchmarked with the estimated ROC curve obtained from the complete dataset by using the Full estimator of sensitivity and specificity. The plot shows that the estimators MSI, 1NN and 3NN well behave, whereas the estimators IPW and SPE are highly biased. This could imply that, in our data, the probit model is a good approximation of the disease process, whereas misspecification of the selection model seems to highly affect the estimators IPW and SPE. This is somehow surprising, as far as the doubly robust SPE estimator is concerned, especially taking into account the good behavior of the MSI estimator. It is worth noting, however, that the SPE estimator produces estimates that are outside the range (0,1) and estimates of the specificity around 0.82 are not monotonically increasing for increasing values of the cutpoint c, as shown in Table 4, which reports estimates obtained for the sensitivity and the specificity at some values of c.

Table 4:

Illustrative example: estimates for the pair (sensitivity, specificity) of the test WR obtained by the various methods at different values of the cutpoint c.

c	Full	MSI	IPW	SPE	1NN	3NN
14.569	(0.976, 0.728)	(0.984, 0.730)	(0.987, 0.558)	(0.986, 0.781)	(0.986, 0.732)	(0.966, 0.728)
14.710	(0.976, 0.739)	(0.984, 0.741)	(0.987, 0.561)	(0.986, 0.793)	(0.986, 0.743)	(0.966, 0.739)
14.851	(0.976, 0.765)	(0.984, 0.766)	(0.987, 0.569)	(0.986, 0.819)	(0.986, 0.768)	(0.966, 0.765)
14.993	(0.967, 0.790)	(0.974, 0.791)	(0.877, 0.690)	(0.875, 0.775)	(0.976, 0.793)	(0.955, 0.789)
15.134	(0.958, 0.812)	(0.964, 0.813)	(0.868, 0.724)	(0.867, 0.799)	(0.962, 0.813)	(0.944, 0.811)
15.275	(0.953, 0.826)	(0.960, 0.827)	(0.864, 0.777)	(0.863, 0.814)	(0.957, 0.827)	(0.940, 0.825)
15.417	(0.948, 0.849)	(0.955, 0.849)	(0.860, 0.813)	(0.859, 0.838)	(0.953, 0.849)	(0.935, 0.847)
15.558	(0.934, 0.868)	(0.941, 0.869)	(0.847, 0.860)	(0.846, 0.859)	(0.938, 0.869)	(0.921, 0.867)

^[1]

6 Variance estimation

In this section, we describe an approach to obtain estimates of the variances of the estimators proposed in Section 3. Such estimates could be used to build confidence intervals and perform hypothesis testing.

Recall that θˆ1 in (2) is the KNN imputation estimator of the disease prevalence θ1=Pr{D=1}, and that θˆ2 and θˆ3 are the KNN imputation estimators of θ2=Pr{T≥c,D=1} and θ3=Pr{T≥c,D=0}, respectively. Moreover, recall that Y=(T,X)⊤.

From Section 3, θˆ1 has asymptotic variance σ12=θ1(1−θ1)+ω12, where

ω12=Eρ(Y)(1−ρ(Y))(1−π(Y))1+1K+Eρ(Y)(1−ρ(Y))(1−π(Y))2π(Y).

This result follows by an application of Theorem 1 in Ning and Cheng [9]. Note that, in the expression of σ12, θ1(1−θ1) is the variance of D, and that the term ρ(Y)(1−ρ(Y) is the conditional variance of D given Y. Therefore, taking into account that I(T≥c)ρ(Y)(1−ρ(Y)) is the conditional variance of I(T≥c,D=1) given Y, the asymptotic variance of θˆ2 is given by σ22=θ2(1−θ2)+ω22, where

ω22=EI(T≥c)ρ(Y)(1−ρ(Y))(1−π(Y))1+1K+EI(T≥c)ρ(Y)(1−ρ(Y))(1−π(Y))2π(Y).

Similarly, for the asymptotic variance of θˆ3 one obtains σ32=θ3(1−θ3)+ω22.

Define γ1=Pr{T<c,D=1} and γ0=Pr{T<c,D=0}. Then, γ1=θ1−θ2 and γ0=1−θ1−θ3. Let

γˆ1=1n∑i=1nI(Ti<c){ViDi+(1−Vi)ρˆKi}

and

γˆ0=1n∑i=1nI(Ti<c){Vi(1−Di)+(1−Vi)(1−ρˆKi)}

be the KNN imputation estimators of γ1 and γ0, respectively. Let ζ12 and ζ02 denote the asymptotic variances of γˆ1 and γˆ0, respectively. The above given arguments still hold, leading to the expressions ζ12=γ1(1−γ1)+ω32 and ζ02=γ0(1−γ0)+ω32, where

ω32=EI(T<c)ρ(Y)(1−ρ(Y))(1−π(Y))1+1K+EI(T<c)ρ(Y)(1−ρ(Y))(1−π(Y))2π(Y).

It is easy to see that γˆ1=θˆ1−θˆ2 and γˆ0=1−θˆ1−θˆ3. As a consequence, the asymptotic covariances between θˆ1 and θˆ2 – say σ12 – and θˆ1 and θˆ3 – say σ13 – may be obtained as σ12=(1/2)(σ12+σ22−ζ12) and σ13=(1/2)(σ12+σ32−ζ02).

Finally, recall that SeˆKNN(c)=θˆ2θˆ1 and SpˆKNN(c)=1−θˆ1−θˆ31−θˆ1. Therefore, by applying the delta method, one obtains

asVar(SeˆKNN(c))=θ22θ14σ12+σ22θ12−2θ2θ13σ12

and

asVar(SpˆKNN(c))=θ32(1−θ1)4σ12+σ32(1−θ1)2−2θ3(1−θ1)3σ13.

To obtain consistent estimates of the asymptotic variances given above, we may replace the unknown quantities in their expressions by the corresponding estimates. In particular, to estimate asVar(SeˆKNN(c)) and asVar(SpˆKNN(c)), we ultimately need the estimates θˆ1, θˆ2, θˆ3, ωˆ12, ωˆ22 and ωˆ32.

In a nonparametric regression imputation framework, quantities as ω12, ω22 and ω32 are typically estimated by their empirical counterparts. The propensity score π(y) is generally estimated by some kernel regression method (see Cheng [12]). In our context, however, we propose an approach that uses a nearest-neighbor rule to estimate both the functions ρ(y) and π(y) in ω12, ω22 and ω32. In particular, for the conditional probabilities of disease we can use the estimates ρ˜i=ρˆKˉi, for some suitable positive integer Kˉ. For the conditional probabilities of verification, instead, we choose the estimates π˜i=1Ki∗∑j=1Ki∗Vi(j), where {(Yi(j),Vi(j)),j=1,…,Ki∗} is a set of Ki∗ observed pairs, and Yi(j) denotes the jth nearest neighbor to Yi=(Ti,Xi)⊤among all Y’s. When π˜i is computed for a non-verified sample unit i, Ki∗ is set equal to the rank of the first verified nearest neighbor to the unit i, i.e., Ki∗ is such that Vi=Vi(1)=Vi(2)=⋯Vi(Ki∗−1)=0, and Vi(Ki∗)=1. When π˜i is computed for a verified sample unit i, Ki∗ is set equal to the rank of the first non-verified nearest neighbor to the unit i, i.e., Ki∗ is such that Vi=Vi(1)=Vi(2)=⋯Vi(Ki∗−1)=1, and Vi(Ki∗)=0. Observe that such procedure automatically avoids zero values for the π˜i’s. Then, based on the ρ˜i’s and the π˜i’s, we obtain the estimates

ωˆ1=K+1nK∑i=1nρ˜i(1−ρ˜i)(1−π˜i)+1n∑i=1nρ˜i(1−ρ˜i)(1−π˜i)2π˜i,

ωˆ2=K+1nK∑i=1nI(Ti≥c)ρ˜i(1−ρ˜i)(1−π˜i)+1n∑i=1nI(Ti≥c)ρ˜i(1−ρ˜i)(1−π˜i)2π˜i

and

ωˆ3=K+1nK∑i=1nI(Ti<c)ρ˜i(1−ρ˜i)(1−π˜i)+1n∑i=1nI(Ti<c)ρ˜i(1−ρ˜i)(1−π˜i)2π˜i,

from which, together with θˆ1, θˆ2, θˆ3, one derives the estimates of the variances of the KNN imputation estimators proposed in the paper. Clearly, to avoid ωˆ1, ωˆ2 and ωˆ3 to be equal to zero, we need to choose Kˉ>1 in estimating the conditional probabilities of disease.

Table 5:

For each parameter: relative biases, computed as (MCV−MCM)/MCV, of the estimators of the asymptotic variance of KNN estimators.

			θ1	θ2	θ3	γ1	γ0	Se(c)	Sp(c)
α = 0.5	c = 0.2	1NN	0.092	0.027	–0.011	0.126	0.049	0.125	–0.011
		3NN	0.081	0.000	–0.030	0.077	0.036	0.067	–0.034
	c = 0.5	1NN	0.075	–0.038	0.030	0.131	0.069	0.071	0.036
		3NN	0.066	–0.055	0.000	0.104	0.053	0.046	0.006
	c = 0.8	1NN	0.126	0.022	–0.026	0.134	0.125	0.067	–0.030
		3NN	0.118	0.000	–0.086	0.112	0.116	0.047	–0.063
α = 1	c = 0.2	1NN	0.048	0.024	0.027	0.087	0.034	0.098	0.031
		3NN	0.045	0.020	0.006	0.000	0.027	0.022	0.011
	c = 0.5	1NN	0.022	–0.011	–0.056	0.060	0.024	0.082	–0.050
		3NN	0.032	–0.017	–0.078	0.022	0.020	0.040	–0.070
	c = 0.8	1NN	0.022	–0.013	–0.017	0.067	0.004	0.073	–0.020
		3NN	0.023	–0.019	–0.056	0.048	0.004	0.049	–0.043
α = 1.5	c = 0.2	1NN	0.023	0.014	0.022	–0.250	0.016	–0.067	0.021
		3NN	0.028	0.015	0.006	–0.250	0.020	–0.170	0.007
	c = 0.5	1NN	0.023	0.025	–0.008	0.000	–0.013	0.031	–0.014
		3NN	0.024	0.020	–0.016	–0.083	–0.013	–0.034	–0.024
	c = 0.8	1NN	0.050	0.021	–0.012	0.000	0.021	0.014	–0.014
		3NN	0.047	0.016	–0.025	–0.036	0.017	–0.010	–0.030

^[2]

To assess the behavior of the discussed variance estimators, we performed some simulation experiments. The results are given in Table 5. For the parameters θ1, θ2, θ3, γ1, γ0, Se(c) and Sp(c), the table shows the relative biases, computed as (MCV−MCM)/MCV, where MCV is the Monte Carlo variance (multiplied by the sample size n, so as to obtain the asymptotic Monte Carlo variance) of the 1NN and 3NN estimators and MCM the Monte Carlo mean of the corresponding estimators of the asymptotic variances. The considered variance estimators are those discussed in this section. For each variance estimator, the involved estimates of parameters such as θ1, θ2, θ3 are based on the same nearest-neighbor rule (1NN or 3NN) used to estimate the parameter of interest. For the estimates of the probabilities of disease in ω12, ω22 and ω32, we chose Kˉ=2. The simulation setting is the same as in scenario (i) in Section 4, with some values for the pair (α,c) and sample size n=100. The number of replicates in each simulation experiment is 5,000. Some other simulation results referring to scenario (ii) can be found in Section S3, Supplementary Material.

In summary, results in Table 5 (and in Section S3, Supplementary Material) seem to indicate that the proposed variance estimators behave satisfactorily. Of course, other variance estimators could be retrieved. For example one could, at least in principle, resort on resampling strategies. Naturally, this requires further investigation.

7 Discussion

This paper considers the estimation of the ROC curve of a continuous test under verification bias. Existing methods for correcting verification bias require estimation of ρ(y) or π(y), or both, and parametric models are commonly used to this end. However, as shown also by the simulation results presented in Section 4, a wrong specification of these models can have an adverse impact on the performance of the estimators, which result in a high bias and/or unstable behavior.

The new estimators of sensitivity and specificity (6) are fully nonparametric. Their use reduces the effects of possible misspecification to the inference results. The loss of efficiency with respect to the use of parametric competitors (when these can be reasonably employed) can range from minimal to sensible values according to the nature of the problem at hand, as simulation results in the main paper and in Supplementary Material, Section S4 show. This is somehow intrinsic in the nonparametric nature of the proposed estimators.

The new approach is based on the K-nearest-neighbor imputation, which requires the choice of a value for K. Our simulation results (see also the Supplementary Material) seem to confirm results in Ning and Cheng [9] according to which a small value of K-within the range 1–3 may be a good choice. It is worth noting, however, that the choice of K might depend upon the dimension of the feature space. In our study, the feature space includes the diagnostic test result T and the an auxiliary covariate X of dimension 1 (and 3, see Section S4, Supplementary Material). A small number of features is quite common in the context of the evaluation of diagnostic tests. However, if the number of features increases, it could be convenient to consider higher values for K. Of course, the nonparametric nature of the approach imposes to take into account the number of verified units, both in the healthy and diseased group, available in the sample. In particular, this means that K should not be too big compared to the number of the verified subjects, nver say. Generally speaking, a possible strategy to choose a suitable value for K in practice could be cross-validation, based on using the KNN estimators on the verified subjects only. Each verified subject is treated in turn as if it were not verified; for fixed K, the estimate ρˆKi of its conditional disease probability is computed using KNN imputation and compared to the truth to produce a measure of discrepancy. This is done for a number of K’s and the K for which the discrepancy is smallest is retained for use in the original sample. A possible choice for the discrepancy in this context could be 1nver∑i=1nver|Di−ρˆKi|.

The issue of the choice of the distance measure to use is of more general nature. Our simulation results (see Supplementary Material) seem to indicate that the standard Euclidean distance may be a good choice. However, it is clear that an adequate choice ultimately depends on several aspects, such as features of the data to analyze, as well as computational concerns.

Estimators (6) modify in an obvious way when no covariates are measured, i.e., when Y=T. Moreover, a simple extension, that could be used when categorical variables are also observed for each patient, is possible. Consider, for example, the problem of estimating the sensitivity. Without loss of generality, suppose that a single factor U, with u levels, is observed together with Y. We also assume that U may be associated with both D and V. Then, if Theorem 1 holds in each stratum, i.e., in each group of units with the same level of U, a consistent and asymptotically normally distributed estimator of SeKNN(c) is

1n∑j=1uSeˆKNNjcond(c)nj,

where nj denotes the size of the jth sample stratum and SeˆKNNjcondc is the KNN estimator of the conditional sensitivity, i.e., $SeˆKNNc obtained from the patients in the jth stratum. Clearly, the use of such estimator relies on availability of sufficient information in each stratum.

As suggested by a Referee, one could think of possible devices aimed at enhancing performances of the estimators. One possibility could be to assign unequal weights to the K nearest neighbors entering in the estimates ρˆKi. This might produce a reduction of the mean square error of the KNN estimators for the sensitivity and the specificity. Otherwise, “hybrid” estimators for the sensitivity and specificity could be obtained by combining the SPE estimator with the KNN strategy. This could lead, at least in principle, to partially parametric estimators, robust with respect to possible weaknesses of the (nonparametric) model chosen for ρ(y) when, for example, the disease process depends also on unobserved auxiliary variables. These are interesting and intriguing topics whose development, however, requires non-trivial treatment, both from a theoretical and empirical perspective.

Acknowledgement

The contribution of Stefano Mussi in producing some simulation results in Supplementary Material is gratefully acknowledged.

References

1. BeggCB, GreenesRA. Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics1983;39:207–15.10.2307/2530820Search in Google Scholar

2. BeggCB. Biases in the assessment of diagnostic tests. Statist Med1987;6:411–23.10.1002/sim.4780060402Search in Google Scholar PubMed

3. ZhouX-H. Correcting for verification bias in studies of a diagnostic test’s accuracy. Stat Meth Med Res1998;7:337–53.10.1191/096228098676485370Search in Google Scholar

4. ZhouX-H, ObuchowskiNA, McClishDK. Statistical methods in diagnostic medicine. New York: Wiley-Interscience, 2002.Search in Google Scholar

5. AlonzoTA, PepeMS. Assessing accuracy of a continuous screening test in the presence of verification bias. J R Stat Soc Ser C2005;54:173–90.10.1111/j.1467-9876.2005.00477.xSearch in Google Scholar

6. FlussR, ReiserB, FaraggiD, RotnitzkyA. Estimation of the ROC curve under verification bias. Biometrical J2009;51:475–90.10.1002/bimj.200800128Search in Google Scholar PubMed PubMed Central

7. LiuD, ZhouXH. A model for adjusting for nonignorable verification bias in estimation of the ROC curve and its area with likelihood-based approach. Biometrics2010;66:1119–28.10.1111/j.1541-0420.2010.01397.xSearch in Google Scholar PubMed PubMed Central

8. HeH, McDermottMP. A robust method using propensity score stratification for correcting verification bias for binary tests. Biostatistics2012;13:32–47.10.1093/biostatistics/kxr020Search in Google Scholar PubMed PubMed Central

9. NingJ, ChengPE. A comparison study of nonparametric imputation methods. Statist Comput2012;22:273–85.10.1007/s11222-010-9223-ySearch in Google Scholar

10. AlonzoTA, PepeMS. Estimating disease prevalence in two-phase studies. Biostatistics2003;4:313–23.10.1093/biostatistics/4.2.313Search in Google Scholar PubMed

11. AsuncionA, NewmanDJ. UCI machine learning repository. Irvine, CA: School of Information and Computer Science, University of California, 2007. Available at: http://www.ics.uci.edu/mlearn/MLRepository.htmlSearch in Google Scholar

12. ChengPE. Nonparametric estimation of mean functionals with data missing at random. J Am Stat Assoc1994;89:81–7.10.1080/01621459.1994.10476448Search in Google Scholar

Supplemental Material

The online version of this article (DOI: 10.1515/ijb-2014-0014) offers supplementary material, available to authorized users.

Published Online: 2015-3-13

Published in Print: 2015-5-1

Supplementary Material

Articles in the same Issue

https://doi.org/10.1515/ijb-2014-0014

Keywords for this article

diagnostic tests; missing data imputation; sensitivity; specificity