The Orthogonally Partitioned EM Algorithm: Extending the EM Algorithm for Algorithmic Stability and Bias Correction Due to Imperfect Data

Michael D. Regier; Erica E. M. Moodie

doi:10.1515/ijb-2015-0016

Article Publicly Available

The Orthogonally Partitioned EM Algorithm: Extending the EM Algorithm for Algorithmic Stability and Bias Correction Due to Imperfect Data

Michael D. Regier and Erica E. M. Moodie

Published/Copyright: May 26, 2016

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal The International Journal of Biostatistics Volume 12 Issue 1

Abstract

We propose an extension of the EM algorithm that exploits the common assumption of unique parameterization, corrects for biases due to missing data and measurement error, converges for the specified model when standard implementation of the EM algorithm has a low probability of convergence, and reduces a potentially complex algorithm into a sequence of smaller, simpler, self-contained EM algorithms. We use the theory surrounding the EM algorithm to derive the theoretical results of our proposal, showing that an optimal solution over the parameter space is obtained. A simulation study is used to explore the finite sample properties of the proposed extension when there is missing data and measurement error. We observe that partitioning the EM algorithm into simpler steps may provide better bias reduction in the estimation of model parameters. The ability to breakdown a complicated problem in to a series of simpler, more accessible problems will permit a broader implementation of the EM algorithm, permit the use of software packages that now implement and/or automate the EM algorithm, and make the EM algorithm more accessible to a wider and more general audience.

Keywords: EM algorithm; missing data; measurement error; imperfect data; bias correction

1 Introduction

The Expectation-Maximization (EM) algorithm, introduced by Dempster, Laird, and Rubin [1], is a well-known and flexible method for obtaining a maximum likelihood estimate (MLE) when data are incomplete. It is a two-step iterative algorithm. It estimates the expectation of the complete data log-likelihood given the observed data and the current parameter estimates in the first step. In the second step, it maximizes the log-likelihood from the expectation step giving new values of the parameter estimates. The algorithm iterates through the Expectation step (E-step) and the Maximization step (M-step) until the sequence of parameter estimates converges to a stationary point [1, 2].

For any situation in which maximum likelihood estimation is desirable and feasible for fully observed data, estimation in the analogous incomplete data situation can be accomplished with the EM algorithm. It is this ability to associate an incomplete data problem with a complete data problem that makes it an attractive approach for missing data analysis. Unfortunately the EM algorithm has fundamental drawbacks. For example, convergence may be slow and the covariance matrix is not an instant algorithmic by-product [2]. Fortunately, there are proposals for speeding up EM convergence, such as the Parameter-Expanded EM algorithm [3], and for obtaining the covariance matrix [4, 5].

The EM algorithm was originally formulated as an approach to obtain MLEs when there are missing data, but it has also been applied to analyses where there are mismeasured variables. Dawid and Skeme [6] used the EM algorithm to correct misclassification when the true value was never observed, but multiple measures for each subject were obtained. In this situation, the multiple measurements permit the estimation of the misclassification rates allowing for the estimation of the underlying true population proportion. Drews et al. [7] consider the situation where the exposure is misclassified not by one classification scheme, but by two schemes (patient interviews and medical records). They use the EM algorithm to obtain corrected odds ratio estimates in this situation. Ganguli et al. [8] proposed a variation of the Monte Carlo EM (MCEM) algorithm specific to structural nonparametric regression problems with measurement error in covariates for additive models; it is referred to as the Ganguli, Staudenmayer, Wand EM.

Despite the transportation of the EM approach from missing data problems to the problem of measurement error, and the wealth of literature about the EM algorithm for missing data problems, there has been limited application of the EM algorithm to measurement error problems. Gustafson [9] comments that the lack of a closed form solution of the E-step for many problems, the requirement of an approximate solution, and the additional work required for covariance estimation has contributed to the scarceness of EM approaches as a proposed solution when mismeasured variables are present. This is echoed by Buonaccorsi [10] who comments on the unwieldy nature of the EM algorithm for the calibration adjustment of measurement error in the response of a linear model.

In many situations, there are alternate approaches that can be used. For missing data, weighting [11] or multiple imputation methods [12–14] often are viable options. For measurement error, alternative methods include simulation-extrapolation [15–17], instrumental variables [18], calibration [19], and imputation [20, 21]. Our interest in the likelihood approach is that it is a method of correction for both missing data and measurement error problems. Thus, there is considerable appeal to this approach, which can handle both aspects of “imperfect data” in the same model, within a unified methodological framework.

We consider the situation where the EM algorithm is a viable and potentially attractive approach for obtaining corrected parameter estimates for imperfect data. Of particular interest (Section 2.2) is the situation where there are missing responses, mismeasured continuous explanatory variables, no closed form solution for the E-step, and the algorithm has an unacceptably low probability of convergence. We propose a new variation of the EM algorithm, inspired by the expectation-conditional maximization (ECM) of Meng and Rubin [22] and the use of tangent spaces by Tsiatis [23], that maintains the ability to correct for both missing data and measurement error while overcoming non-convergence. This makes it a practical option for applied statistical research. With the emergence of software packages that now implement and/or automate the EM algorithm such as EMCluster [24], our proposal may permit the direct use of these new packages for increasingly complex problems as well as making the algorithm accessible to a wider audience.

In Section 2, we present the EM algorithm, motivating problem and imperfect data. In Section 3, we present the proposed variation of the EM algorithm and the general theory for maximization using a sequence of self-contained, smaller and simpler EM algorithms. We use a simulation study to consider the finite sample performance of the proposed variation of the EM algorithm in Section 4. Section 5 discusses possible future applications of the proposed method.

2 The EM algorithm, imperfect data and the motivating problem

2.1 The EM algorithm

When there are imperfect data (i.e. missing and mismeasured data), we can obtain MLEs by maximizing the log-likelihood of the observed portion of the data. The log-likelihood associated with the observed data, ℓo, is

ℓo(θY|Yo=yo)=∫ym∈ymlog∏i=1nf(Yi|θY)dym.

where Yi denotes the vector of variables intended for collection from the ith observational unit, f(Yi|θY) is the pdf of Yi parameterized by θY, and the variable Yi={Yi,o,Yi,m} such that Yi,o refers to the elements of Yi that are observable and Yi,m refers to the elements of Yi that are missing. We let yi={yi,o,yi,m} denote the realization of Yi={Yi,o,Yi,m} where yi,o denote the subset of observed variables and yi,m denote the subset of variables for which realizations are not observable [25, 26]. Finally, ym is the space associated with ym. Without a closed form solution, the maximization of ℓo(θY|Yo=yo) can be cumbersome.

The EM algorithm provides an indirect approach for obtaining the MLE for θY. The process begins by obtaining an estimate of θY using only the completely observed portion of the data, denoted θY(0). Given this initial estimate of the parameters, the E-step finds the expectation of ℓ(θY|y) given the observed data, yo, and the current estimate of θY, θY(0). The M-step maximizes Eℓ(θY|Y)|yo;θY(0) from the E-step, yielding an updated estimate of the parameter, θY(1). The E- and M-steps are repeated with θY(1) used in place of θY(0).

The algorithm iterates through E- and M-steps; the E-step for the (t+1)th iteration is defined as

(1)QθY|θY(t)=EℓθY|Y|yo;θY(t),

where θY(t) is the current parameter estimate obtained from the tth iteration. The M-step, requires that we select the updated parameter, θY(t+1) that maximizes the Q-function (Equation (1)) over the parameter space, ΘY, such that

(2)QθY(t+1)|θY(t)≥QθY|θY(t).

Under the regularity conditions of Dempster et al. [1] and Wu [27], the likelihood, L(⋅), increases with each iteration

LθY(t+1)>LθY(t),

requiring only that Equation (1) be continuous in θY. This is a weak condition that is held in many practical situations, and it holds for the exponential family [2].

The E- and M-steps are repeated until the difference between sequential steps in the likelihood changes by an arbitrarily small amount, δ,

∥LθY(t+1)−L(θY(t)||)<δ.

Pragmatically, the stopping criterion

(3)∥θY(t+1)−θY(t)∥<δ.

is used, probably motivated by Wu [27] who showed

∥θY(t+1)−θY(t)∥→0

as t→∞, if the limit points are in a compact, connected subset of all stationary points.

The generalized EM algorithm (GEM) relaxes the M-step condition. It requires that Equation (2) holds, permitting the selection of θY(t+1) such that the Q-function increases. This avoids maximizing over all θY∈ΘY, while ensuring that the likelihood does not decrease after each iteration [1]. The limit point of the sequence of estimated parameters is a stationary point of the likelihood, thus providing a local maximum for θY.

Three relevant variations of the EM algorithm are the Monte Carlo EM (MCEM) [28], expectation-conditional maximization (ECM) of Meng and Rubin [22], and the multi-cycle ECM algorithms [22]. The MCEM algorithm replaces a complicated and analytically intractable expectation in the E-step by approximating the integral with a finite sum,

(4)QmθY|θY(t)=m−1∑r=1mℓθY|yo,ym,r.

where m denotes the Monte Carlo sample size. In the (t+1)th E-step, the missing data, Ym, is simulated from the current conditional distribution f(Ym|yo;θY(t)) [28]. For each iteration, the ECM has a single E-step followed by a sequence of conditional maximization steps, CM-steps. The constraint function for the CM-steps is defined such that maximization occurs over the entire parameter space [22]. The multi-cycle ECM extends the ECM algorithm. For each iteration, the multi-cycle ECM permits an E-step before a few select CM-steps. The extra computational effort within each iteration results in larger increases in the likelihood, but does introduce the potential for the algorithm to not be a GEM [22].

2.2 Imperfect data and the motivating problem

For subjects i=1,…,n, let Xi denote imperfect data variables as (Xi∗,Xi). If there is only missing data then Xi∗,Xi=Xi,o,Xi,m so that Xi∗=∅. On the other hand, if there is only measurement error then (Xi∗,Xi)=(Xi,o∗,Xi,o) so that Xi,m∗=Xi,m=∅. Finally, if there is both missing data and measurement error, then (Xi∗,Xi)=(Xi,o∗,Xi,m∗,Xi,o,Xi,m). Realizations of measurement error and missing data have similarities; for example, both have at least one variable for which direct observation may not be possible. For measurement error, this is the true value, xi, and for missing data, this is xi,m.

In what follows, we assume that Xi is only measured with error, (Xi∗,Xi)=(Xi,o∗,Xi,o), and that all the measurement error of all subjects’ follows the same distribution and a classical, unbiased, additive, nondifferential measurement error model. The classical, unbiased, additive, nondifferential measurement error model for the ith subject and a random variable, X, is defined as Xi,o∗=Xi,o+εi where εi∼N(0,θM) for i=1,…,n and where θM parameterizes the distribution of εi [29]. We observe xi,o∗ and not xi,o for each of the i units of interest.

The measurement error is unbiased, E(X∗|x)=x, and non-differential (surrogate) defined by the assumption that X∗|x,y=X∗|x or equivalently X∗ and Y are independent given x [10, 29]. This suggests that a surrogate is the measured substitute for the desired variable, but it gives no information beyond that which would have been given by the true variable. When there is more than one mismeasured variable in the model, then we add the further assumption that the associated errors are independent, that is for mismeasured variables Xj,o∗ and Xk,o∗ we assume that εij⊥εik. A measurement error indicator, CiM∈{0,1} with CiM=0 defining the observation of the true value, may be defined as the probability of measurement error in a manner analogous to the following definition for the probability of missingness.

We let Yi denote the imperfect response and assume that the response is subject to missing data, Yi=(Yi,o,Yi,m). Rubin [30] identified three basic types of missing data mechanisms: Missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). For the ith subject, define the missing data indicator as Ci∈{0,1} with Ci=0 defining observation. MCAR occurs when the probability of not observing a response does not depend either the observed or unobserved data, f(Ci|Yi=yi,θC)=f(Ci|θC), where θC parameterizes the missing data mechanism. MAR occurs when the probability of an observation being missing is independent of the unobserved values given the observed values, f(Ci|Yi=yi,θC)=f(Ci|Yi,o=yi,o,θC), where yi,o is the observed value of yi. An MNAR missing data mechanism occurs when the probability of an observation being missing depends on both the observed and unobserved values, f(Ci|Yi=yi,θC).

Assuming unique parameterization of the conditional distributions [9, 26] while restricting our exposition to classical unbiased nondifferential measurement error, MCAR and MAR missing data mechanisms, the joint distribution may be expressed as

fYi,Ci,Xi,Xi∗,Zi|θ=fYi|xi,zi;θY

×fCi|xi,zi;θC

×fXi∗|xi;θM

×fXi|zi;θX

(5)×fZi|θZ

where Zi denote complete variables (i.e. no missing data and no measurement error), and θ={θY,θC,θM,θX,θZ}.

The joint distribution (Equation 5) has been decomposed into the product of four conditional distributions and the marginal distribution of the perfect covariates Z. The outcome model, f(Yi|xi,zi;θY), describes the relationship between the variables of interest, (X,Z) and the outcome. The missing data mechanism, f(Ci|xi,zi;θC) is parameterized by θC. The measurement model, f(Xi∗|xi;θM) is the conditional density of X∗ given the unobserved, true value, X=x. The final conditional and marginal distributions comprise the joint distribution of the covariates, sometimes referred to as the exposure model in epidemiological applications.

The outcome model f(Yi|xi,zi;θY) is of primary interest as is an estimate of θY. Typically, the remaining parameters, {θC,θM,θX,θZ}, are treated as nuisance parameters. If the nuisance parameter space is orthogonal to the parameter space of interest, then an unbiased estimator for θY can be obtained provided the nuisance parameter estimator is root-n consistent [23].

We have a primary interest in the outcome model, f(Yi|xi,zi;θY), and a secondary interest in the missing data mechanism, f(Ci|xi,zi;θC). As such, we are interested in the unbiased estimation of {θY,θC}. Under the specification of unique parameterization, the assumption is that {θY,θC} is orthogonal to {θM,θX,θZ}. The challenge is that we are interested in a parameter that is commonly relegated to the nuisance space. An example of possible application (Section 5) is the use of inverse probability weights (IPW) when there is mismeasurement in at least one variable used to model the weights and the response, and the response may be missing [31]. In this scenario, we wish to obtain “good” estimates of the parameters in the weighting models to ensure that the inverse probability weighted estimators exhibit reduced or no bias.

Motivated by this example, we used a simulation study to examine the finite-sample bias reduction for both {θY,θC}, using known implementations of the EM algorithm. Due to lack of convergence of the algorithm for a standard implementation, we implemented the ECM and the multi-cycle ECM algorithm with Monte Carlo integration. By using a single EM algorithm to obtain estimates, we observed that θY and θC were have problems converging. Having a unified approach resulted in a lack of convergence for the single EM algorithm. Convergence rates, across the three implementations of the EM algorithm, ranged from 0% to 60%, suggesting unstable algorithms. To overcome this, we sought to partition the parameter space orthogonally and implement separate EM algorithms for each partition. By having root-n consistent estimators in an particular subspace, we would have unbiased estimators that converge to a stationary point in the complementary subspace.

3 The orthogonally partitioned EM algorithm

Let X be a vector-valued random variable with a joint distribution f(X|θ), where θ∈Θ and Θ denotes the associated parameter space. The likelihood, L(θ|x), is said to be uniquely parameterized if it can be written as a product of conditional distributions,

(6)L(θ|x)=∏j=1pfXp−j+1|x1,…,xp−j; θp−j+1,

such that θi∈Θi, θj∈Θj, Θi⊂Θ, Θj⊂Θ, Θi⊥Θj for all i≠j, i=1,…,p and j=1,…,p, and x0≡∅.

Unique parameterization has been used, without a formal definition as a means to specify complex joint distributions through a product of conditional ones [32–36]. Proposition 1 formalizes unique parameterization, ensuring that the parameter space is comprised of orthogonal subspaces.

Proposition 1

A likelihood may be uniquely parameterized if and only if

Θ=⊕j∈{1,…,p}Θj,

for p≥1, where ⊕ denotes the direct sum of Θj⊆Θ.

Proof

See Appendix.

The proposed extension of the EM algorithm begins with an initial estimation of θ denoted θ(0)={θY(0),θC(0),θM(0),θX(0),θZ(0)}. Proposition 1 permits the construction of an orthogonal partitioning set of subspaces using direct-sum decomposition [49], {Θk: k=1,…,K}. The ability to partition our parameter space for unique parameterization allows a general formulation for the E-step of kth stage in the orthogonally partitioned EM (OPEM) algorithm,

(7)Q˜θτ+1|θτ=∑k=1KQθktk+1|θktk,θˆ1t1′,…,θˆk−1tk−1′,θk+10,…,θK0

where τ+1 is the (tk+1)th step for the kth partition given the tth step of the of the kth partition and the maximized values θˆ1(t1′),…,θˆk−1(tk−1′) obtained on steps t1′,…,tk−1′ of the respective k−1 EM algorithms, where τ=tk,t1′,…,tk−1′ of the kth stage. For the kth stage, partitions from k+1 to K have their parameter estimates fixed at the initialized values {θk+1(0),…,θK(0)} and parameter estimates from stages 1 to (k−1) fixed at their stage estimate values {θ1(t1′),…,θk−1(tk−1′)}

Equation (5) is used to illustrate Equation (7) and motivate the simulation study in Section 4. For the ith subject, the (t+1)th E-step associated with Equation (5) is

where x is the support of X. For simplicity, let K=2 and partition Θ such that Θ=Θ1 ⊕ Θ2 where Θ1=⊕j∈{C,M,X,Z}Θj and Θ2=ΘY. The subspaces for each of the k-stages may be determined by the context of the research or to provide algorithmic stability.

For the first stage, θ1∈Θ1, we have

where

The first stage Q-function at the (t1+1)th step, Qθ1t1+1|θ1t1 does not have a closed form; it is evaluated using Monte Carlo integration with a Gibbs sampler [33, 28]. Maximum likelihood estimates for the parameters are obtained using a Gibbs sampler and the adaptive rejection algorithm [37]. Sampling is from fYi,m,Xi|yi,o,ci,xi∗,zi;θt, where

fYi,m,Xi|yi,o,ci,xi∗,zi;θt∝fCi|xi,zi;θCfXi∗|xi;θMfXi|zi;θX

for θ1. The first stage Q-function for the ith subject at the (tk+1)th iteration is written as,

Qθ1t1+1|θ1t1≈1m1∑r=1m1[logfCi|xir,zi;θC+logfXi∗|xir;θM

+logfXir|zi;θX]

The argument for the second stage Q-function at the t2+1th step,

Qθ2t2+1|θˆ1t1′,θ2t2

parallels that of the first stage.

Each stage can be thought of as a constrained maximization, where the constraint is to an orthogonal subspace of θ. An EM algorithm or a generalized EM algorithm requires the optimizing constraint to be space-filling [22] and to have the appropriate convergence properties. If the space-filling criterion holds, then we obtain an unconstrained maximum. If the algorithm is a generalized EM algorithm, then convergence to a stationary point is quickly obtained.

Proposition 2

If the E-step can be written in the form of Equation (7) andQ˜θτ+1|θτis maximized subject to the constraint functionG={gk(θ)=θ/θk|k=1,…,K}, wheregk(θ)is the vector of all parameter subvectors inθexceptθk, using K Monte Carlo EM stages, then we call the algorithm an orthogonally partitioned EM (OPEM) algorithm.

Beginning with the space-filling property, the gradient of G, ∇gk(θ), is full rank at θk(t)∈Θ for all k and t. It is a set of dk elementary column vectors and a set of dΘ−dk zero column vectors, where dΘ=dim(Θ) and dk=dim(θk). Taking η∈Rdk, the column space of the gradient is ck(θ)={η∇gk(θ)|η∈Rdk}. The intersection of the column spaces, using Proposition 1, is

c(θ)=⋂kck(θ)=∅,

with the complement

cθc=⋃k{η∇gkθ|η∈Rdk}⊥

=RdΘ.

Meng and Rubin [22] showed that having the convex hull of all feasible directions determined by the constraint spaces, {θ∈Θ|gk(θ)}, k=1,…,K, is equivalent to RdΘ, the Euclidean space of dimension dΘ. This is equivalent to the space-filling criterion as defined using tangent cones.

If each of the k, k=1,…,K, Q-functions is a generalized EM algorithm, then Proposition 3 states that the entire algorithm is a generalized EM algorithm. To address the problem of non-monotonicity of the MCEM algorithm [38], we introduce the requirement that there exists an iteration, tk≥1, for which the kthQ-function increases with each iteration.

Proposition 3

If we have an OPEM (Proposition 2) and there exists atk≥1for whichQθktk+1|θktk≥Qθktk|θktkfor all subsequent steps then the algorithm is a generalized EM algorithm.

Proof

See Appendix.

Dempster, Laird, and Rubin [1] and Wu [27] provide two methods to show convergence to stationary points. The simplest is to view the algorithm as a single process. By the Inverse Mapping Theorem and by Meng and Rubin [22], Theorem 2 proof part (i), it is a closed mapping. This directly implies that the limit points are stationary points (Wu [27], Theorem 1).

An alternate approach views the algorithm as the sum of K generalized EM algorithms. Each converges to a limiting point which is a stationary point [27]. In each stage, Monte Carlo integration is used with the relevant EM variant to obtain the maximizer of Q(θk(tk+1)|θk(tk)). With space-filling G, we have the following variation of Theorem 3 in Meng and Rubin [22].

Proposition 4

If all K maximizations of Equation (7) are unique, then the limit points of the K OPEM stagesθˆk(tk′);tk′≥0,k=1,…,Kare stationary points ofLo(θ|xo)if G is space filling.

By extending Wu’s Theorem 6 [27] and replacing the uniqueness condition with the continuity of ∂Q(θk|θk′)/∂θkθk=θk′ in both θk and θk′, and ∇gk(θ) for all k, k=1,…,K, we can relax the assumption that all conditional maximizations are unique.

4 Numerical example

4.1 Simulation specification

We consider a finite-sample simulation study using the general model given in Equation (5). As noted in Section 2.2, we are interested in the unbiased estimation of {θY,θC}. For the ith subject, we specify the exposure model (Xi,Zi), i=1,⋯,n, to be independently and normally distributed with mean zero, σX,X=σZ,Z=1, and σX,Z=0.2, where {θX,θZ}={μX,μZ,σX,X,σZ,Z,σX,Z}. The response Yi is binary with Yi=1 denoting the occurrence of the event of interest. The missing data indicator for the response, Ci, denotes missing responses as Ci=1.

The response is modeled as a logistic regression,

logit{PYi=1|xi,zi;θY}=θ0Y+θ1Yxi+θ2Yzi,

where θY=(θ0Y,θ1Y,θ2Y). The missing data mechanism is modeled as

logit{PCi=1|xi,zi;θC}=θ0C+θ1Cxi+θ2Czi,

where θC=(θ0C,θ1C,θ2C). We set θ0Y=0.40 and θ0C=0.03 and the remaining parameters were randomly generated and restricted to lie in the range (−3,3) for each dimension (Table 1).

Table 1:

Simulation parameterization for the response and missing data mechanism models.

Model	θ0C	θ1C	θ2C	θ0Y	θ1Y	θ2Y
1	0.03	0.38	–1.88	0.40	2.63	2.31
2	0.03	–0.18	–2.82	0.40	2.45	0.86
3	0.03	1.04	0.89	0.40	1.86	–0.68

Imperfect data are introduced in the following manner. Let X be measured with error using a classical unbiased nondifferential measurement error model, X∗=X+ε. Here, ε∼N(0,θM). Bias reduction as a function of θM is explored to consider the effect of imprecision. We let θM take values {0.1,0.3,0.5,0.7}. With σX,X=1, θM is directly interpreted as the proportion of imprecision in the measurement of X [9].

In applied settings, θM may be estimated through the collection of auxiliary information or by inferring a measurement error model from prior studies. This approach prevents the problem of having a nonidentifiable model ([9], section 4.1). Given that we are not generating auxiliary information from which estimates of θM may be derived, we specify θM for all simulations.

We let K=2 for our simulation study for a two-stage OPEM. The first stage targets θC with {θM,θX,θZ} as nuisance parameters and θY fixed at its initialization value, θ0Y. This stage is an implementation of the MCEM algorithm, given θ0Y. The second stage estimates θY given the stage one estimates.

For each parameterization, we implemented S=100 simulations and each simulation had n=500 observations. We replicated this at each of the four specifications of measurement error noise, θM. For the Monte Carlo integration, we used a burn-in of 1,000 and a Monte Carlo sample of m=2500 as determined by a small Gibbs sampler convergence study and relevant recommendations [33, 39]. We used an MCEM algorithm for each stage of the 2-stage OPEM, and a dissimilarity metric based on a one step lag (e.g. Equation (3)) with measure less than 0.0025 to terminate the algorithm.

Due to the low probability of convergence using the MCEM, ECM and multi-cycle ECM algorithm with Monte Carlo integration (Section 2.2), we compare the OPEM with the naive analysis that assumes complete data. This is a reasonable contrast since the OPEM is not a direct extension of the ECM or multi-cycle ECM, but a new option for implementing an EM algorithm when convergence is problematic. The proposed OPEM algorithm has the advantage of convergence when alternate implementations of the EM algorithm have a low probability of convergence. All simulations were implemented in R [50].

4.2 Simulation results

For the estimated coefficients θˆ1,naiveC and θˆ1,naiveY of the naive analysis, we expect either attenuation, estimates that are smaller in magnitude than the true value, or augmentation, estimates that are larger in magnitude than the true value ([9], section 2.6). Furthermore, we expect an enhanced effect due to the modest correlation between X and Z ([9], Section 2.2) and induced bias for θˆ2,naiveC and θˆ2,naiveY ([10], section 5.3). If the OPEM algorithm performs as theory suggests in Section 3, we should observe bias reduction for the affected estimators without inducing bias in those that were largely unaffected by measurement error. Finally, we note that all of the simulations converged which is a substantial improvement over the convergence rates for standard implementations of the EM algorithm to this problem (Section 2.2).

Although our two-stage OPEM algorithm first estimated {θˆC,θˆM,θˆX} and then estimated θˆY given {θˆC,θˆM,θˆX}, we first present the results for response model since it is of primary interest. We follow the response model results with those of secondary interest, the parameter estimation results for the missing data mechanism.

4.2.1 Response mechanism model results, θY

The estimates of the outcome model were obtained using an MCEM algorithm, conditional on the parameter estimates from the first stage, {θˆC,θˆM,θˆX} (Section 4.2.2). Table 2 shows that the OPEM provides bias reduction for θ1Y, across the levels of θM, when compared to the bias of the naive estimator. This ranges from a 33% reduction to a 98% reduction in the bias. There is one situation where a slight inflation of the OPEM bias occurs, model 3 when θM=0.1. The proportion of imprecision in x∗ is small and the associated confidence intervals suggest that naive and OPEM biases are similar. The same conclusion can be drawn for model 2 when θM=0.1, even though the OPEM does reduce the bias in this situation.

Table 2:

Bias assessment for simulated example: θnaiveY denotes the use of complete data and θOPEMY denotes the application of the OPEM to adjust for incomplete information.

Model		θnaiveY			θOPEMY
	θM	θ0Y (SE)	θ1Y (SE)	θ2Y (SE)	θ0Y (SE)	θ1Y (SE)	θ2Y (SE)
1	0.1	0.01 (0.01)	−0.06 (0.02)	−0.01 (0.02)	−0.01 (0.01)	0.01 (0.2)	0.02 (0.02)
	0.3	0.02 (0.01)	−0.39 (0.02)	−0.09 (0.02)	−0.01 (0.02)	0.06 (0.03)	0.08 (0.03)
	0.5	0.06 (0.02)	−0.90 (0.02)	−0.28 (0.02)	0.01 (0.02)	0.07 (0.04)	0.05 (0.03)
	0.7	0.10 (0.01)	−1.30 (0.01)	−0.41 (0.02)	−0.01 (0.02)	0.20 (0.06)	0.12 (0.05)
2	0.1	0.02 (0.01)	−0.03 (0.02)	0.01 (0.02)	0.02 (0.01)	0.02 (0.02)	0.01 (0.02)
	0.3	0.04 (0.01)	−0.36 (0.02)	−0.03 (0.01)	0.02 (0.01)	0.02 (0.03)	−0.02 (0.02)
	0.5	0.07 (0.01)	−0.80 (0.02)	−0.07 (0.01)	−0.01 (0.02)	0.07 (0.04)	0.01 (0.02)
	0.7	0.09 (0.01)	−1.19 (0.01)	−0.07 (0.01)	0.01 (0.02)	0.10 (0.05)	0.02 (0.03)
3	0.1	−0.01 (0.01)	0.01 (0.01)	0.01 (0.01)	−0.01 (0.01)	0.03 (0.02)	0.01 (0.01)
	0.3	0.02 (0.01)	−0.20 (0.01)	0.04 (0.01)	0.01 (0.01)	0.07 (0.02)	−0.03 (0.01)
	0.5	0.05 (0.01)	−0.55 (0.01)	0.13 (0.01)	0.01 (0.01)	0.01 (0.02)	−0.02 (0.01)
	0.7	0.06 (0.01)	−0.82 (0.01)	0.21 (0.01)	−0.01 (0.01)	0.06 (0.03)	−0.02 (0.02)

Bias reduction for θ0Y and θ2Y is observed. In particular, the reduction is more notable than those observed for θ0C and θ2C (Section 4.2.2). Part of this is related to the magnitude of the naive estimator’s bias (e.g. model 3, θ2Ywhen θM=0.7), but it is also noticeable where the naive estimator’s bias was modest (e.g. model 1, θ0Y when θM=0.7). As before, there is a slight increase in the standard errors of the OPEM estimates. When compared against the bias reduction, we have overall gains in terms of the MSE and bias-variance trade-off.

4.2.2 Missing data mechanism results, θC

We observe the expected attenuation and augmentation in θ1C for the naive estimate. As expected, we observe bias reduction for the OPEM implementation since the first stage in any K-stage OPEM is a simple EM algorithm.

We observe modest bias reduction for model 1 across all levels of θM (Table 3). For models 2 and 3, we observe much larger bias reductions across the four levels of θM, ranging from a 33% reduction in the bias of θ1C, to a 95% reduction. The reduction is greatest when proportion of imprecision is large, θM=0.5,0.7.

We observe a nominal inflation of the associated standard errors with the OPEM. In terms of mean squared error and bias-variance trade-off, the OPEM performs well when the proportion of imprecision in the measurement of X is large, θM=0.5,0.7. The OPEM provides nominal gains when the proportion of imprecision in the measurement of X is small, θM=0.1,0.3. This is not unreasonable as the observed values, x∗ are relatively similar to the unobserved values x when the imprecision is slight.

The first stage of the OPEM permits us to compare the effect of estimation using a two-stage OPEM for bias reduction versus a one-stage approach (standard implementation of the EM algorithm), providing that the one stage EM algorithm converges. We observe minimal reduction of bias in θ0C and θ2C, but noticeable bias reduction in θ0Y and θ2Y which are the stage two analogues of θ0C and θ2C. This finite sample simulation result, suggests that partitioning the EM algorithm into simpler steps may provide better bias reduction in the estimation of parameters of the outcome model.

Table 3:

Bias assessment for simulated example: θnaiveC denotes the use of complete data and θOPEMC denotes the application of the OPEM to adjust for incomplete information.

Model		θnaiveC			θOPEMC
	θM	θ0C (SE)	θ1C (SE)	θ2C (SE)	θ0C (SE)	θ1C (SE)	θ2C (SE)
1	0.1	−0.11 (0.04)	0.05 (0.02)	−0.08 (0.03)	−0.11 (0.04)	0.05 (0.02)	−0.08 (0.03)
	0.3	−0.10 (0.04)	−0.01 (0.02)	−0.06 (0.03)	−0.11 (0.04)	0.03 (0.02)	−0.07 (0.03)
	0.5	−0.11 (0.04)	0.05 (0.02)	−0.08 (0.03)	−0.11 (0.04)	0.05 (0.02)	−0.08 (0.03)
	0.7	−0.06 (0.04)	−0.10 (0.02)	0.01 (0.03)	−0.10 (0.04)	0.05 (0.03)	−0.05 (0.03)
2	0.1	−0.03 (0.03)	−0.02 (0.02)	−0.05 (0.03)	−0.03 (0.03)	−0.02 (0.02)	−0.05 (0.03)
	0.3	−0.08 (0.04)	0.01 (0.02)	−0.06 (0.04)	−0.08 (0.04)	−0.02 (0.02)	−0.06 (0.04)
	0.5	−0.13 (0.04)	0.04 (0.02)	−0.13 (0.04)	−0.13 (0.04)	0.01 (0.02)	−0.13 (0.04)
	0.7	−0.02 (0.04)	0.06 (0.01)	−0.04 (0.03)	−0.04 (0.04)	−0.01 (0.02)	−0.04 (0.04)
3	0.1	−0.05 (0.04)	0.01 (0.03)	0.01 (0.02)	−0.05 (0.04)	0.02 (0.03)	0.01 (0.02)
	0.3	−0.01 (0.04)	−0.06 (0.02)	0.02 (0.03)	−0.06 (0.04)	0.04 (0.03)	0.01 (0.03)
	0.5	0.03 (0.03)	−0.23 (0.03)	0.06 (0.02)	−0.09 (0.04)	0.01 (0.04)	0.04 (0.02)
	0.7	0.08 (0.04)	−0.37 (0.02)	0.06 (0.03)	−0.11 (0.04)	0.03 (0.03)	0.03 (0.03)

5 Discussion

The K-stage orthogonally partitioned EM (OPEM) algorithm provides an implementation that exploits the common assumption of unique parameterization, corrects for biases due to imperfect data, converges for the specified model when standard implementation of the EM algorithm has a low probability of convergence, and reduces a potentially complex algorithm into a sequence of smaller, simpler, self-contained EM algorithms. We showed, using the theoretical framework of the EM algorithm, that the OPEM can be applied to any situation where the joint probability distribution can be written as a product of conditional probability distributions. As well, it yields an optimal solution over the parameter space, and known extensions of the EM algorithm for accelerating convergence may be used for each stage of the OPEM [40]. The computation of standard errors may be performed stage-wise with an appropriate adaptation of Louis’ method for MCEM approaches [4, 38, 28] or through the adaptation of Oakes’ formula for standard errors from an MCEM by Casella [41], Appendix A.3). The finite sample simulation study revealed that partitioning the EM algorithm into simpler steps may provide better bias reduction in the estimation of parameters of the outcome model. This suggests that positioning the primary model at or near the end of the OPEM sequence is a strategy for better bias reduction in the estimators of the parameters of interest.

A potential application of this method is for the correction of bias due to mismeasured variables that are used to construct a Horvitz-Thompson [42] type weighting mechanism or propensity score for model correction [43, 44]. Additionally, this methodology could be adapted for related approaches, such as the inverse probability of treatment weighted marginal structural model estimator [45, 46, 47, 48]. Given that a likelihood approach could provide a unified method to correct for imperfect data, the EM algorithm is a natural tool to explore.

Beyond the potential application of the K-stage OPEM to a wide variety of Horvitz-Thompson type estimators, Section 3 strongly suggests broad applicability to complex, multi-model settings. The OPEM can be directly used for any situation with missing or mismeasured data where the joint distribution is expressed as a product of conditional, uniquely parameterized distributions. The ability to breakdown a complicated problem in to a series of simpler, more accessible problems will permit a broader implementation of the EM algorithm, permit the use of software packages that now implement and/or automate the EM algorithm for complex situations of imperfect data, and make the EM algorithm more accessible to a wider and more general audience.

A Appendix

A.1 Proof of Proposition 1

⇒ By the definition of unique parameterization, the direct sum θ1+⋯+θp=0, when θk=0 for all k, θk∈Θk, k=1,…,p. Thus Θ can be written as a direct sum of the orthogonal subspaces.

⇐ Now, if

Θ=⊕j∈{1,…,p}Θj

then there exist p linear operators H1,…,Hp on Θ such that each Hk, k=1,…,p, is a projection of Θ. Since each Hk is a projection of Θ, a direct-sum decomposition yields HkHl=0 if k≠l, where k,l∈{1,…,p} ([49], section 6.6). Any vector θl∈RHl, the range of Hl, is in the orthogonal complement of RHk.□

A.2 Proof of Proposition 3

The algorithm begins with an initial value θ(0)={θ1(0),…,θK(0)}. Applying the MCEM algorithm on the kth Q-function under the assumption that there exists a tk≥1 such that Qθktk+1|θktk≥Qθktk|θktk, for all θ∈Θk⊆Θ where

Θk⋂⊕j∈{1,…,K},j≠kΘj=∅

results in the kth Q-function being a Generalized EM algorithm. There are K MCEM algorithms for which maximizers are obtained. By induction, the entire optimization procedure is itself a Generalized EM algorithm.□

References

1. Dempster A, Laird N, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 1977;39:1–38.Search in Google Scholar

2. McLachlan GJ, Krishnan T. The EM algorithm and extensions. Hoboken, NJ: Wiley, 2008.10.1002/9780470191613Search in Google Scholar

3. Liu C, Rubin DB, Yu YN. Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika 1998;85:755–70.10.1093/biomet/85.4.755Search in Google Scholar

4. Louis TA. Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B 1982;44:226–33.Search in Google Scholar

5. Oakes D. Direct calculation of the information matrix via the EM algorithm. J R Stat Soc Ser B 1999;61:479–82.10.1111/1467-9868.00188Search in Google Scholar

6. Dawid AP, Skeme AM. Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl Stat 1979;28:20–8.10.2307/2346806Search in Google Scholar

7. Drews CD, Flanders WD, Kosinski AS. Use of two data sources to estimate odds-ratios in case-control studies. Epidemiology 1993;4:327–35.10.1097/00001648-199307000-00008Search in Google Scholar PubMed

8. Ganguli B, Staudenmayer J, Wand MP. Additive models with predictors subject to measurement error. Aust N Z J Stat 2005;47:193–202.10.1111/j.1467-842X.2005.00383.xSearch in Google Scholar

9. Gustafson P. Measurement error and misclassification in statistics and epidemiology: impacts and Bayesian adjustments. New York: Chapman & Hall/CRC, 2004.Search in Google Scholar

10. Buonaccorsi JP. Measurement error: models, methods, and applications. Boca Raton, FL: Chapman& Hall, 2010.10.1201/9781420066586Search in Google Scholar

11. Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res 2013;22:278–95.10.1177/0962280210395740Search in Google Scholar PubMed

12. Carpenter JR, Kenward MG, Vansteelandt S. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. J R Stat Soc Ser A 2006;169:571–84.10.1111/j.1467-985X.2006.00407.xSearch in Google Scholar

13. Kenward MG, Carpenter J. Multiple imputation: current perspectives. Stat Methods Med Res 2007;16:199–218.10.1177/0962280206075304Search in Google Scholar PubMed

14. White IR, Carlin JB. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med 2010;29:2920–31.10.1002/sim.3944Search in Google Scholar PubMed

15. Guolo A. The SIMEX approach to measurement error correction in meta-analysis with baseline risk as covariate. Stat Med 2014;33:2062–76.10.1002/sim.6076Search in Google Scholar PubMed

16. He W, Xiong J, Yi GY. SIMEX R Package for accelerated failure time. J Stat Software 2012.10.18637/jss.v046.c01Search in Google Scholar

17. Shang Y. Measurement error adjustment using the SIMEX method: an application to student growth percentiles. J Educ Meas 2012;49:446–65.10.1111/j.1745-3984.2012.00186.xSearch in Google Scholar

18. Buzas JS, Stefanski LA. Instrumental variable estimation in generalized linear measurement error models. J Am Stat Assoc 1996;91:999–1006.10.1080/01621459.1996.10476970Search in Google Scholar

19. Strand M, Sillau S, Grunwald GK, Rabinovitch N. Regression calibration for models with two predictor variables measured with error and their interaction, using instrumental variables and longitudinal data. Stat Med 2014;33:470–87.10.1002/sim.5904Search in Google Scholar PubMed PubMed Central

20. Cole SR, Chu H, Greenland S. Multiple-imputation for measurement-error correction. Int J Epidemiol 2006;35:1074–81.10.1093/ije/dyl097Search in Google Scholar PubMed

21. Edwards JK, Cole SR, Troester MA, Richardson DB. Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. Am J Epidemiol 2013;177:904–12.10.1093/aje/kws340Search in Google Scholar PubMed PubMed Central

22. Meng X-L, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 1993;80:267–78.10.1093/biomet/80.2.267Search in Google Scholar

23. Tsiatis AA. Semiparametric theory and missing data. New York: Springer, 2006.Search in Google Scholar

24. Chen W, Maitra R, Melnykov V (2013): EM algorithm for model-based clustering of finite mixture gaussian distribution, Version 0.2-4.Search in Google Scholar

25. Carpenter J, Kenward M. Multiple imputation and its application. Chichester, West Sussex: John Wiley & Sons, 2013.10.1002/9781119942283Search in Google Scholar

26. Little RJA, Rubin DB. Statistical analysis with missing data, 2nd ed. Hoboken, NJ: John Wiley & Sons, 2002.10.1002/9781119013563Search in Google Scholar

27. Wu CFJ. On the convergence properties of the EM algorithm. Ann Stat 1983;11:95–103.10.1214/aos/1176346060Search in Google Scholar

28. Wei GCG, Tanner MA. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 1990;85:699–704.10.1080/01621459.1990.10474930Search in Google Scholar

29. Carroll R, Ruppert D, Stefanski L, Crainiceanu C. Measurement error in non-linear models: a modern perspective, 2nd ed. Boca Raton, FL: Chapman & Hall/CRC, 2006.10.1201/9781420010138Search in Google Scholar

30. Rubin D. Inference and missing data. Biometrika 1976;63:581–92.10.1093/biomet/63.3.581Search in Google Scholar

31. Regier MD, Moodie EEM, Platt RW. The effect of error-in-confounders on the estimation of the causal parameter when using marginal structural models and inverse probability-of-treatment weights: a simulation study. Int J Biostat 2014;10:1–15.10.1515/ijb-2012-0039Search in Google Scholar PubMed

32. Chen J, Zhg D, Davidian M. A Monte Carlo EM algorithm for generalized linear mixed models with flexible random effects distribution. Biostatistics 2002;3:347–60.10.1093/biostatistics/3.3.347Search in Google Scholar PubMed

33. Ibrahim JG, Chen MH, Lipsitz SR. Monte Carlo EM for missing covariates in parametric regression models. Biometrics 1999;55:591–6.10.1111/j.0006-341X.1999.00591.xSearch in Google Scholar PubMed

34. Lipsitz SR, Ibrahim JG. A conditional model for incomplete covariates in parametric regression models. Biometrika 1996;83:916–22.10.1093/biomet/83.4.916Search in Google Scholar

35. Liu W, Wu L. Simultaneous inference for semiparametric nonlinear mixed-effects models with covariate measurement errors and missing responses. Biometrics 2007;63:342–50.10.1111/j.1541-0420.2006.00687.xSearch in Google Scholar PubMed

36. Stubbendick A, Ibrahim JG. Maximum likelihood methods for nonignorable missing responses and covariates in random effects models. Biometrics 2003;59:1140–50.10.1111/j.0006-341X.2003.00131.xSearch in Google Scholar

37. Gilks W, Wild P. Adaptive rejection sampling for Gibbs sampling. Appl Stat 1992;41:337–48.10.2307/2347565Search in Google Scholar

38. Robert CP, Casella G. Monte Carlo statistical methods, 2nd edition. New York: Springer, 2004.10.1007/978-1-4757-4145-2Search in Google Scholar

39. Ibrahim JG, Chen MH, Lipsitz SR. Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. Biometrika 2001;88:551–64.10.1093/biomet/88.2.551Search in Google Scholar

40. Daigle B, Roh M, Petzold L, Niemi J. Accelerated maximum likelihood parameter estimation for stochastic biochemical systems. BMC Bioinformatics 2012;13:1–18.10.1186/1471-2105-13-68Search in Google Scholar PubMed PubMed Central

41. Casella G. Empirical Bayes Gibbs sampling. Biostatistics 2001;2:485–500.10.1093/biostatistics/2.4.485Search in Google Scholar PubMed

42. Horvitz D, Thompson D. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 1952;47:663–85.10.1080/01621459.1952.10483446Search in Google Scholar

43. Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med 2004;23:2937–60.10.1002/sim.7231Search in Google Scholar PubMed

44. Wooldridge JM. Inverse probability weighted estimation for general missing data problems. J Econometrics 2007;141:1281–301.10.1920/wp.cem.2004.0504Search in Google Scholar

45. Cole SR, Hudgens MG, Tien PC, Anastos K, Kingsley L, Chmiel JS, et al.. Marginal structural models for case-cohort study designs to estimate the association of antiretroviral therapy initiation with incident AIDS or death. Am J Epidemiol 2012;175:381–90.10.1093/aje/kwr346Search in Google Scholar PubMed PubMed Central

46. Hernán MA, Hernández-Daz S, Robins JM. A structural approach to selection bias. Epidemiology 2004;15:615–25.10.1097/01.ede.0000135174.63482.43Search in Google Scholar PubMed

47. Moodie EEM, Stephens D. Marginal structural models: unbiased estimation for longitudinal studies. Int J Public Health 2011;56:117–19.10.1007/s00038-010-0198-4Search in Google Scholar PubMed

48. Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000;11:550–60.10.1097/00001648-200009000-00011Search in Google Scholar PubMed

49. Hoffman K, Kunze R. Linear algebra, 2nd ed. Upple Sadle River, NJ: Prentice Hall, 1971.Search in Google Scholar

50. R Development Core Team (2014): R: a language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org, ISBN 3-900051-07-0.Search in Google Scholar

Published Online: 2016-5-26

Published in Print: 2016-5-1

Articles in the same Issue

https://doi.org/10.1515/ijb-2015-0016

Keywords for this article

EM algorithm; missing data; measurement error; imperfect data; bias correction