Cauchy difference priors for edge-preserving Bayesian inversion

Markku Markkanen; Lassi Roininen; Janne M. J. Huttunen; Sari Lasanen

doi:10.1515/jiip-2017-0048

Article Open Access

Cauchy difference priors for edge-preserving Bayesian inversion

Markku Markkanen , Lassi Roininen , Janne M. J. Huttunen and Sari Lasanen

Published/Copyright: January 30, 2019

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Journal of Inverse and Ill-posed Problems Volume 27 Issue 2

Abstract

We consider inverse problems in which the unknown target includes sharp edges, for example interfaces between different materials. Such problems are typical in image reconstruction, tomography, and other inverse problems algorithms. A common solution for edge-preserving inversion is to use total variation (TV) priors. However, as shown by Lassas and Siltanen 2004, TV-prior is not discretization-invariant: the edge-preserving property is lost when the computational mesh is made denser and denser. In this paper we propose another class of priors for edge-preserving Bayesian inversion, the Cauchy difference priors. We construct Cauchy priors starting from continuous one-dimensional Cauchy motion, and show that its discretized version, Cauchy random walk, can be used as a non-Gaussian prior for edge-preserving Bayesian inversion. We generalize the methodology to two-dimensional Cauchy fields, and briefly consider a generalization of the Cauchy priors to Lévy α-stable random field priors. We develop a suitable posterior distribution sampling algorithm for conditional mean estimates with single-component Metropolis–Hastings. We apply the methodology to one-dimensional deconvolution and two-dimensional X-ray tomography problems.

Keywords: Bayesian statistical inverse problems; Cauchy distribution; a priori information; X-ray tomography

MSC 2010: 49N45; 62F15; 60G20; 60J22

1 Introduction

We propose to use non-Gaussian random field priors that are constructed from Cauchy walks for edge-preserving Bayesian statistical inverse problems [18, 4]. Edge-preserving inversion is needed, for example, in seismic and medical imaging [11, 39]. Our basic building block for edge-preserving Bayesian inversion is continuous Cauchy motion, and its discretized version Cauchy random walk, whose every increment is independent and identically distributed Cauchy noise. Lévy α-stable random walks are generalizations of Cauchy (α=1) and Gaussian (α=2) random walks [37]. Hence, we will expand the one-dimensional Lévy α-stable random walk priors to two-dimensional random field priors. We shall concentrate mostly on the numerics of Cauchy priors, and we leave, for example, mathematically rigorous discussion on the interplay between finite-dimensional and infinite-dimensional posterior distributions, i.e. discretization-invariance, to the future papers. For discussion on Cauchy noise priors in the framework of infinite-dimensional Bayesian inversion, see Sullivan 2016 [40]. A similar approach was also considered by Hosseini 2016 in [17], where Lévy process priors of bounded variation are studied. In this paper, we focus on rougher sample paths that have unbounded variation. We note that Gaussian difference priors are known to promote smoothness, i.e. sharp features are smoothened in the inversion, and hence they cannot be used for edge-preserving inversion.

An example where Cauchy priors and edge-preserving Bayesian inversion is needed is X-ray tomography, also known as computer tomography. In the medical imaging, the unknown, like the organs, teeth and surrounding tissues have interfaces between the tissues which can be seen as sharp edges. X-ray tomography within Bayesian framework has been reported e.g. in [39, 30, 41]. A standard way of dealing with X-ray tomography is to use filtered back-projection [6, 7, 38]. In the case we have enough projection data, solving the tomography problem is numerically stable [29], and we can reconstruct smooth parts, as well as edges. These methods, however, have limitations when the number of measurements becomes smaller, the measurement angle is limited, or the measurement noise increases [10, 27, 32]. Bayesian inversion can absorb uncertainties and restrictions, presenting a powerful alternative for filtered back-projection.

A common choice for edge-preserving Bayesian inversion is to use total variation (TV) priors. However, finite-dimensional TV priors converge to continuous-parameter Gaussian priors in the discretization limit (Lassas and Siltanen 2004 [25]). This means that, when we make the computational mesh denser and denser, the solution of a Bayesian statistical inverse problem, i.e. posterior distribution, is not invariant with respect to the discretization of the unknown [21, 22]. In low-dimensional inverse problems, TV priors may exhibit edge-preserving inversion properties. However, high-dimensional TV priors start to loose their edge-preserving property when the discretization is made denser, and hence the sharp features are smoothened. Based on Donsker’s central limit theorem [19] and domain’s of attractions of stable random variables [12], we can see that the lack of discretization-invariance originates from the distributions of the increments of the random walk. Namely, the increments have finite second moments, which always leads to normally distributed limits. When one allows other distributions for the increments, it may become possible to maintain the edge-preserving properties for high-dimensional priors. By choosing Cauchy distributions for the increments, we can show that the discrete Cauchy walk X converges to continuous Cauchy motion 𝒳 when the discretization step h→0+, i.e. it has a non-Gaussian discretization limit. This is related to the fact that Cauchy random variable does not have finite moments of order greater than 1.

Lassas, Saksman and Siltanen 2009 [24] proposed to use non-Gaussian Besov space priors, which are constructed on wavelet basis. These can be shown to promote discretization-invariant inversion, and the method has been e.g. applied to X-ray tomography by Vänskä, Lassas and Siltanen 2009 [41]. However, Besov priors are often defined via wavelet expansions, and for edge-preserving inversion the Haar wavelet basis is often used. However due to the structure of the Haar basis, discontinuities are preferred on an underlying dyadic grid given by the discontinuities of the basis functions. For example, on the domain (0,1), discontinuity is vastly preferred at x=14 over x=13. This is, in most practical cases, both a strong and unrealistic assumption. The Cauchy priors do not have this restriction. Helin and Lassas 2011 [16], considered hierarchical Mumford–Shah priors. Other methods for tackling edges include for example level set Bayesian inversion, where, in practice, one reconstructs, for example, with Gaussian Markov random field priors [35] the unknown object, and then thresholds the reconstruction to predefined level sets [11, 20, 28].

Gaussian priors can also be expanded to non-Gaussian priors by applying hyperparametric models [3, 5]. For discussion on hierarchical regularization for edge-preserving reconstructions, see [1]. For discussion on discretisation-invariant hypermodels with Matérn fields, see [33], where the hyperpriors are modelled with continuous-parameter Gaussian and Cauchy random fields. These models are based on constructing inverse covariance matrix via discretization of certain stochastic partial differential equations, and by modelling length-scaling as a random field in the hyperprior. The idea is similar to the length-scale modelling in probabilistic deep learning algorithms [9, 31]. In conjunction with Lasanen 2012 results [21, 22], these hypermodels can be shown to be discretization-invariant. For general references on Gaussian Markov random fields in Bayesian inversion, see [36, 34], and, in spatial interpolation, see [26]. Construction of a Matérn-style random field with non-Gaussian noise has been studied by Bolin 2014 [2].

For numerical verification of the method proposed, we apply the methodology to one-dimensional deconvolution and two-dimensional X-ray tomography problems. For drawing estimates from posterior distributions with non-Gaussian priors, we use Markov chain Monte Carlo (MCMC) methods, as MCMC provides conditional mean estimate, which is our primary objective. We apply the so-called single-component Metropolis–Hastings sampling scheme, which is also known as Metropolis-within-Gibbs [11]. However, MCMC methods are known for their slowness, especially in high-dimensional problems with multimodal posterior distributions. For the tomography problem, we also discuss maximum a posteriori estimates with Gauss–Newton-type optimization, and compare tomography reconstructions with Cauchy priors to the filtered back-projection estimate and conditional mean estimates with Gaussian and total variation priors. However, we wish to emphasize that the applications of the Cauchy priors can include any inverse problem with sharp features.

The rest of this paper is organized as follows: In Section 2, we first introduce the Bayesian statistical inversion framework, and then construct Cauchy priors by starting from Lévy α-stable processes. We consider why Cauchy random walk priors promote edge-preserving inversion, and finally we generalize the priors to two dimensions. In Section 3, we review single-component Metropolis–Hastings algorithm. Numerical one-dimensional deconvolution and two-dimensional tomography examples are considered in Section 4. We compare the developed tomography algorithm against conditional mean estimates with Gaussian and TV priors as well as filtered back-projection reconstructions. In Section 5, we conclude the study and make some notes on future research.

2 Cauchy priors for Bayesian inversion

In Bayesian statistical inversion, our objective is to estimate posterior distribution of an unknown continuous-parameter object 𝒳⁢(x), x∈ℝd, d=1,2,…, given noise-perturbed indirect measurements. The measurements are formally given with an operator equation

(2.1)m=𝒜⁢𝒳+e,

where m∈ℝn is a known measurement vector, and 𝒜 is a known linear operator from some function space, such as separable Banach space, to a finite-dimensional vector space. We assume a Gaussian-distributed measurement noise e∼𝒩⁢(0,Σ), where Σ∈ℝn×n is a known covariance matrix. We furthermore assume that e is statistically independent of 𝒳.

For computational purposes, we discretize the continuous observation model in equation (2.1) and denote the discretized observation model as

(2.2)m=A⁢X+e,

where A∈ℝn×k, and X∈ℝk. The solution of a finite-dimensional Bayesian statistical inverse problem is an a posteriori probability distribution. Via the Bayes’ formula, we give the posterior as a probability density

(2.3)D(X|m)=D(X)D(m|X)D⁢(m)∝D(X)D(m|X),

where D(m|X) is the likelihood density constructed based on the discretized observation model (2.2) and D⁢(X) is the a priori probability density which reflects our information of the unknown before any actual measurement is done. Furthermore, D⁢(m) is simply a normalization constant, and hence we may drop it and use the unnormalized density. Prior is, in practice, the only tunable parameter in the estimation algorithm.

In the next subsections, we will consider Lévy α-stable processes from which Cauchy and Gaussian random walks can be obtained as special cases. We study edge-promoting inversion idea by showing that Cauchy prior can be either unimodal or bimodal, the latter promotes edge-preserving inversion. We also carry out a comparison with other common priors, and then generalize the priors to ℝ2.

2.1 Lévy α-stable, Cauchy and Gaussian random walk priors

Our main aim in this paper is doing numerics, however here we will introduce the basic, and more formal, concepts of the processes we are interested in, as we feel that this is important background information for the benefit of the reader. But more importantly this sets the direction of the future studies, where more technical results will be done in order to prove the discretization-invariance of the posterior distribution.

Let {𝒳⁢(t),t∈[0,T]} be a stochastic process for some 0<T<∞. Furthermore, let Sα⁢(τ,β,μ) be Lévy α-stable distribution with parameters α, τ, β and μ, for some 0<α≤2,-1≤β≤1. We call 𝒳⁢(t) a continuous Lévy α-stable process starting from zero, if 𝒳⁢(0)=0, 𝒳 has independent increments and

𝒳⁢(t)-𝒳⁢(s)∼Sα⁢((t-s)1α,β,0)

for any 0≤s<t≤T. For more details on Lévy α-stable processes, see [37]. In order to illustrate the choices of the parameters, in Figure 1, we have plotted certain α-stable probability density functions with varying α and β. The case α=2 is the Gaussian random walk, and it is a special case and high end of stable distributions. The Cauchy walk is obtained with α=1. Parameter β models the skewness of the probability density, a useful feature in many applications, but here we mostly concentrate on symmetric densities hence β=0.

Figure 1

Probability density functions with varying α and β parameters.

For example, α-stable processes can be constructed by using independently scattered measures. A random measure M on [0,T] maps (Lebesgue) measurable sets A⊂[0,T] to random variables M⁢(A) in such a way that M⁢(⋃kAk)=∑kM⁢(Ak) almost surely for disjoint measurable sets Ak⊂𝕀. A random measure M is called independently scattered if M⁢(A1),…,M⁢(An) are independent random variables whenever the measurable sets Ai⊂[0,T], i=1,…,n, n∈ℕ, are pairwise disjoint. We require that

M⁢(A)∼Sα⁢(m⁢(A)1α,β,0)

with constant skewness β. In our case, the measure m⁢(⋅) on [0,T] is the Lebesgue’s measure |⋅| multiplied with a constant λ>0. Then α-stable Lévy motion starting from zero is

𝒳⁢(t)=M⁢(χ⁢([0,t])),

where χA is the characteristic function of the set A.

For the continuous limit of the Lévy α-stable random walk, we apply independently scattered measures. Let us denote the discrete random walk at t=j⁢h by Xj, where j∈ℤ+ and h>0 is the discretization step. We obtain a random walk approximation (see in [37, Section 3.3])

(2.4)Xj-Xj-1∼Sα⁢((λ⁢h)1α,β,0).

It is easy to see that such a random walk approximation converges to the continuous Lévy α-stable motion as h→0 in distribution in the Skorokhod space of functions that are right-continuous and have left limits.

Our special interest is the Cauchy random walk. Given equation (2.4) and choosing α=1, β=0, we can write the joint distribution of the Cauchy random walk as a probability density

(2.5)D⁢(X)=C⁢∏j=1J(λ⁢h(λ⁢h)2+(Xj-Xj-1)2),

where λ is the regularizing constant and C is a normalization constant. We can apply this probability density directly as an a priori probability density in Bayes’ formula in equation (2.3).

In Figure 2, we have plotted realizations of Cauchy and Gaussian random walks. The latter is the Lévy α-stable random walk with α=2. The Cauchy random walk realizations have either small perturbations or big jumps. Hence, intuitively it feels plausible that using Cauchy distributions for differences lead to edge-preserving inversion. Gaussian random walk realizations do not prefer jumps, but continuous paths, hence the tendency for smoothing in Bayesian inversion.

Figure 2

Realizations of Cauchy and Gaussian random walks.

(a)

Cauchy random walk.

(b)

Gaussian random walk.

2.2 Edge-preserving inversion

To our best knowledge, for difference priors there has not been any simple criteria which would indicate, whether the prior favors edge-preserving solutions instead of smooth ones. The intuitive idea is to construct a prior, which promotes discontinuities, i.e. jumps. Hence, via a negation, we may say that in edge-preserving inversion, the priors do not promote smoothness everywhere.

Here, for simplicity, we will first consider a one-dimensional case and take h=λ=1. We will generalize the construction for two-dimensional case in the next sections. Hence, let us look at three consecutive coordinates Xj-1,Xj,Xj+1 of a random vector X. We could then say that X prefers smoothness at point j if Xj is (most probably) in the middle of Xj-1 and Xj+1, and that there is edge at point j if Xj is (most probably) either at Xj-1 or at Xj+1. The above mentioned idea means that we are interested in the conditional prior distribution of Xj on the condition that Xj-1 and Xj+1 are known. Without loss of generality we can then assume that Xj-1=-a and Xj+1=a. The conditional density for Cauchy difference prior for Xj can then be written as a probability density

D(Xj|Xj-1=-a,Xj+1=a)∝11+(Xj-a)211+(Xj+a)2.

This is simply a product of two Cauchy probability density functions, one from each neighbor of Xj. In order to see the properties of these functions, consider the second derivative of the probability density function with respect to Xj at zero:

D′′⁢(0)<0 when⁢|a|<1:maximum at 0, unimodal,

D′′⁢(0)=0 when⁢|a|=1:as flat as possible at 0,

D′′⁢(0)>0 when⁢|a|>1:minimum at 0, bimodal.

In Figure 3, we have plotted these probability density functions. With |a|<1 we have a unimodal function, hence regularization promotes smooth solution around Xj=0. With |a|>1 we have a bimodal function, hence regularization promotes solutions around Xj=±a with smooth variations around these two values. This means that this kind of prior promotes jumps or small local oscillations.

Figure 3

Upper panel: Cauchy probability density function for X2, given fixed X1=-a and X3=a. Bottom panel: The same case for Gaussian difference prior and total variation prior.

(a)

Unimodal Cauchy a<1.

(b)

Flat Cauchy with |a|=1.

(c)

Bimodal Cauchy a>1.

(d)

Gaussian.

(e)

Total variation.

We can make similar plots for Gaussian difference prior and total variation prior. The Gaussian difference prior is simply a Gaussian-shaped function with maximum at Xj=0. Hence, providing smoothing properties. The total variation priors is constant for [-a,a] and decays exponentially outside this region. Hence, total variation neither punishes discontinuities or promotes edge-preserving inversion as it promotes all values equally between these two points. This is quite close the case for the Cauchy prior with |a|=1, i.e. in a sense the shape of the density function is similar to the total variation prior density function. The effect on reconstruction for both α→0 and α=2-ϵ can be understood with similar arguments. For α→0, the modalities become spikes, while for α=2-ϵ is bimodal, and becomes trimodal and finally unimodal for very small ϵ. If α=2, this can be considered as a special case.

2.3 Two-dimensional difference priors

Let us consider a two-dimensional lattice (h⁢j,h′⁢j′), where (j,j′)∈𝕀2⊂ℤ2 and discretization steps to each coordinate directions h,h′>0. A Gaussian difference prior can be constructed via difference equations with i.i.d. increments

(2.6)Xj,j′-Xj-1,j′∼𝒩⁢(0,σ2⁢hh′),Xj,j′-Xj,j′-1∼𝒩⁢(0,σ2⁢h′h).

The term σ2>0 is the regularization parameter. The prior is then constructed through products of conditional normal distributions

Dpr⁢(X)∝exp⁡(-(XT⁢L1T⁢C1-1⁢L1⁢X+XT⁢L2T⁢C2-1⁢L2⁢X)),

where L1 and L2 are difference matrices to two different coordinate directions, and C1=σ2⁢hh′⁢I and C2=σ2⁢h′h⁢I, where I is an identity matrix. These constructions are rather well-known, see for example [23]. We emphasize that equation (2.6) is not explicitly solvable, as only the inverse covariance matrix exists. For uniqueness of the least squares solution, we would need to impose e.g. boundary conditions.

Let us consider, with a simple example, why we scale the variances by hh′ and h′h. It is enough to consider the first difference equation in equation (2.6), as the other one is obtained through same arguments. We consider the grid densification in two steps: (1) We consider denser discretization along j-direction, which we refer as the regularization coordinate direction, and (2) denser discretization along j′-direction. Let Xj,j′-Xj-1,j′∼𝒩⁢(0,σ2). If we densify the lattice along the regularization coordinate direction j, this would correspond to one-dimensional Gaussian random walk and hence we need to add the discretization step in the variance, hence Xj,j′-Xj-1,j′∼𝒩⁢(0,σ2⁢h).

The second coordinate direction is a bit trickier. Let us consider a simplified situation as depicted in Figure 4. Thus, we define pixelwise mean

X1=X11+X122 and X2=X21+X222.

The reason for scaling by 2 is because this is the standard pixelwise arithmetic mean, i.e. we want to have natural scaling of pixels. Now the increments

X2-X1=(X21+X22)-(X11+X12)2=(X21-X11)+(X22-X12)2.

If we choose X2-X1∼𝒩⁢(0,σ2), then we must choose X21-X11∼X22-X12∼𝒩⁢(0,2⁢σ2). Hence, from this follows the scaling by 1h′ in equation (2.6).

As an alternative, it could be appealing to use a stochastic partial differential equation Δ12⁢𝒳=𝒲, where Δ is the Laplacian. However, equivalently in the sense of probability distributions, you may consider a system of equations written as

(2.7)∂1⁡𝒳∼𝒲1,∂2⁡𝒳∼𝒲2

whose inverse covariance operator in the least squares sense is the Laplacian Δ. We prefer to use formulation in equation (2.7) instead of Δ12⁢𝒳=𝒲, as working with fractional order Laplacian Δ12 is quite tedious numerically.

Figure 4

Grid densification to the second coordinate direction.

Similarly for the Cauchy differences we may write with i.i.d. increments

(2.8)Xj,j′-Xj-1,j′∼Cauchy⁢(λ⁢h),Xj,j′-Xj,j′-1∼Cauchy⁢(λ⁢h′).

The result is obtained through similar argumentation as in the Gaussian case. If we make the grid denser along the regularization direction, this corresponds again to one-dimensional Cauchy random walk, i.e. we simply scale with discretization step h.

In the second coordinate direction we take again the same division as in Figure 4. If X2-X1∼Cauchy⁢(λ), then X21-X11∼X22-X12∼Cauchy⁢(λ) (note independent linear combination of Cauchy distributed random variable).

Through similar arguments, we can actually see that the scaling of the general Lévy α-stable difference prior is with i.i.d. increments

(2.9)Xj,j′-Xj-1,j′∼Sα⁢(h′⁣1-αα⁢h1α,β,0),Xj,j′-Xj,j′-1∼Sα⁢(h1-αα⁢h′⁣1α,β,0).

This, naturally, implies similar constructions in the higher dimensions.

Remark 2.1.

The solvability of equations (2.9) is guaranteed, when the right-hand sides of (2.9) are suitably conditioned. If we formally denote the (unsolvable) prior defining equations with L⁢X=Z, where Zk are independent stable random variables, then Z is conditioned to be in the range of L. The solvable equation is then LX=Z|R(L), where the conditioned random variable Z|R(L)∈R(L) with probability 1. Let us denote the orthogonal projection onto R⁢(L) with Q. The conditioned probability P(Z∈A|R(L))=∫AcDZ(z)dz for Borel set A⊂R⁢(L) can be derived (after a coordinate transform) with the usual limiting procedure

P(Z∈A|R(L))=limk→∞P(Z∈A×Bk)P⁢(R⁢(L)×Bk)

by considering sets of the type A×Bk on R⁢(L)×R⁢(L⊥), where decreasing hypercubes Bk of positive measure have intersection 0. Our prior density is then c⁢DZ⁢(Q⁢L⁢x)=c⁢DZ⁢(L⁢x), where an additional norming constant c reminds us of the performed conditioning. The additional prior norming constant cancels out in the expression of the posterior distribution. The prior density Dpr⁢(x) is formed by multiplying all the probability density functions

DSα⁢(h′(α-1)/α)⁢h1/α)⁢(Xj,j′-Xj-1,j′)⁢DSα⁢(h(α-1)/α)⁢h′⁣1/α)⁢(Xj,j′-Xj,j′-1)

for all j,j′.

Figure 5

Realizations of two-dimensional Cauchy, Gaussian and TV priors exhibiting blocky and smooth structures.

(a)

Realization of 2D Cauchy prior.

(b)

Realization of 2D Gaussian prior.

(c)

Realization of 2D TV prior.

We note that the prior density is not exponentially bounded. Therefore, special care is needed in estimation whenever the direct theory A in (2.2) has nontrivial null space.

In Figure 5, we have plotted realizations of two-dimensional Gaussian and Cauchy priors, defined in equations (2.6) and (2.8), respectively. We have also plotted a realization of a total variation prior. We have used zero-boundary conditions, but we have cropped the image in order to demonstrate blocky or smooth structures of the realizations. A notable feature in the Cauchy prior realization is that it looks blocky, and there are distinct sharp edges. For drawing the realizations, we use single-component Metropolis–Hastings, which is explained in next section, i.e. we generate a chain of realizations of the Cauchy prior.

We note that Cauchy priors are not rotation-invariant, and this can be clearly especially from the realization. This can explained through as a packing problem, i.e. in order to have rotation-invariant Cauchy priors, the edges should be spherical. But we cannot efficiently pack balls, hence the box-shaped realizations. In this current construction, the box-shapes are restricted to the coordinate directions. This can be changed by making a coordinate transformation and hence use directed nabla operator, but this would give us more complicated formulas. It would be indeed an interesting idea to use the directed derivatives, and let the direction be modeled in the hyperprior. Similarly TV prior is also neither rotation-invariant in its standard formulation, but it can be made by rotation-invariant by a reformulation as in [15]. The anisotropic TV prior, in a sense, has coordinate-wise edgy features as the Cauchy prior.

If we consider the edge-preserving features of the two-dimensional Cauchy prior, we can study processes along either of the coordinate directions. These one-dimensional processes look practically similar to the one-dimensional Cauchy random walks. For comparison purposes, we have also plotted realization of a Gaussian difference prior and TV priors obtained in a similar manner. As we can see, the Gaussian distribution promotes smoothness, not jumps. The realization of a TV prior is not as blocky as the Cauchy prior, but resembles more the Gaussian prior.

3 Single-component Metropolis–Hastings

In order to obtain practical solutions for Bayesian inverse problems, we need to compute point estimates from the posterior distribution. Most commonly used point estimates are maximum a posteriori (MAP) estimate or conditional mean (CM) estimate. For the MAP estimation, we use Gauss–Newton-type optimization methods. However, our interest lies mostly in the CM estimate, which is the mean of the posterior density (see equation (2.3)). In higher-dimensional problems, the explicit computation of the CM estimates, the integration over the posterior density, is rarely possible and numerical integration of posterior density typically leads to an infeasible approach. Therefore, CM estimates are usually computed using Markov Chain Monte Carlo (MCMC) sampling methods, such as Metropolis–Hastings (MH) algorithm or Gibbs sampler (see e.g. [14, 13]). The idea of the MCMC methods is to generate samples {X(s)} from the posterior distribution and approximate the CM estimate using the ensemble mean. This same algorithm can also be used for drawing realizations from the prior distributions, which was done in Section 2.

In this paper, we apply the proposed Cauchy priors to the one–dimensional deconvolution problem and the X-ray tomography problem. Both of these problems have a specific property that, when evaluating the posterior density, an update of a single component in the parameter vector is computationally very cheap. This is because the forward theory matrix is sparse, and similarly the prior is changed component-wise, i.e. we do not need to compute full posterior when updating one component, but simply the part which is changed. The commonly known algorithms that take advantage of the property are single-component Gibbs sampler and single-component Metropolis–Hastings (MH) algorithm; see for example [14].

The idea of these algorithms are to sample from component-wise conditional distributions. We start with some initial value

X(0)=(X1(0),X2(0),…,Xn(0))

and set s=0. First, we consider one–dimensional conditional distribution of X1 given data m and all other parameters:

(3.1)D(X1|m,X2(0),X3(0),…,Xn(0))=C1D(X1,X2(0),X3(0),…,Xn(0)|m),

where C1 is a normalization constant. We generate a sample X1(1) from this distribution. Then continue to the next parameter and draw a sample X2(1) from the distribution

D(X2|m,X1(1),X3(0),…,Xn(0))=C2D(X1(1),X2,X3(0),…,Xn(0)|m),

where C2 is a normalization constant. The procedure is continued until all parameters have been updated one at a time and we have X(s+1). We increase s←s+1 and repeat the sample generation Ns times to obtain a set of samples {X(s)}s=1Ns.

The difference between the Gibbs sampler and the single-component MH lies on the way how the samples from the one-dimensional distributions are drawn. In Gibbs sampler, the samples are drawn using direct sampling from the distribution. However, disadvantage of the Gibbs sampler is that the related one-dimensional conditional distributions are rarely known in a such form that fast sample generation is allowed and numerical (computationally inefficient) sample generation has to be used. See for example [14] for more details.

In the single-component MH, the direct sampling in the Gibbs sampler is replaced with Metropolis–Hastings-type sampling. For example, to draw a sample X1(1) from the distribution (3.1), we first draw a candidate sample X~1 from a proposal distribution q1(⋅|X1(0)), which is accepted with the probability

α1=min⁡(1,D(X~1|m,X2(0),…,Xn(0))q1(X1(0)|X~1)D(X1(0)|m,X2(0),…,Xn(0))q1(X~1|X1(0))).

If accepted, we set X1(1)=X~1. Otherwise we set X1(1)=X1(0). The procedure is repeated with the other components.

In principle, the choice of the proposal distribution is arbitrary, but the choice has an large effect to convergence of the algorithm and the badly chosen proposal distribution can require very large sample sizes for a sufficient convergence; see e.g. [14, 11]. We note that this effect could be mitigated by using proposal distributions based on meshfree sampling algorithms [8], and then the convergence on different meshes is similar. In this paper the proposal distribution is chosen to be Gaussian: qi(⋅|Xi(0))=𝒩(Xi(0),σ2). The variance σ2 is chosen such that 25 %–50 % of samples are accepted. It is common to ignore a number of samples at the beginning (burn-in period) due to the fact that it takes time to converge to the stationary distribution D(X|m). In this paper the burn-in period is chosen to be the half of the sample size.

4 Numerical examples

In order to demonstrate the edge-preserving and discretization-invariance properties of Bayesian inversion with Cauchy difference priors, we consider a one-dimensional deconvolution problem, and a two-dimensional X-ray tomography example.

4.1 One-dimensional deconvolution

In Figure 6, we have a synthetic deconvolution example with Cauchy and total variation priors on different grids. The original unknown is a piecewise constant function, and we observe convolved noisy measurements. The constant convolution kernel is cosine-shaped. In the measurements, we have added white noise with standard deviation σ=0.3, which corresponds to ∼10% relative ratio.

Figure 6

The effect the discretisation level on the CM-estimate with Cauchy and total variation priors: Top panel: noisy deconvolved observations (solid line). Middle panel: Cauchy prior CM-estimates on different lattices (solid lines), bottom panel – The same as middle panel, but with total variation priors. The dashed line in all figures corresponds to the true target and the gray area corresponds to 2⁢σ credibility intervals.

(a)

Measurements.

(b)

Cauchy N=131.

(c)

Cauchy N=261.

(d)

Cauchy N=521.

(e)

Total variation N=131.

(f)

Total variation N=261.

(g)

Total variation N=521.

We have made the reconstructions with a number of unknowns N=131,261,521. The Cauchy regularization parameter λ in equation (2.5) was chosen in such a way that reconstructions look edge-preserving. Some care needs to be taken in choosing λ, similarly to using any Gaussian priors, in order not to use either to too strong or loose priors – this demands some tuning, and in the future studies this parameter should be modeled as a hyperparameter similarly to e.g. [11], where Gaussian prior parameters where modeled in the hyperprior. On denser meshes, we simply then change regularization parameter by scaling with discretization step h accordingly. The reconstructions on different grids with Cauchy priors clearly provide edge-preserving features. We have also plotted the 2⁢σ credibility intervals, and these look also rather similar with different N except in the jumps, where edges do not see to converge but rather grow. We suspect that this is because the length of the chain is too short. The posterior is discretization-invariant in the sense of weak convergence of measures. A simple example of the posterior of the random variable X2 given Y1=Xj-1+ε and Y2=Xj+1+ε, when Y1=Y2=X0=0, shows that

Dpost⁢(z)=c⁢exp⁡(-12⁢x2-12⁢v2)⁢1(1+x2)⁢(1+(x-z)2)⁢(1+(z-v)2).

Thus the posterior variance of Xj is bounded even though we have only observations for the neighboring points. Similarly we scale the TV prior regularization parameter. We note that the discretization-invariance behavior of estimators cannot be obtained with total variation priors, as can be seen from the smoothing behavior when we increase the number of unknowns. For further discussion of the total variation prior examples, see Lassas and Siltanen 2004 [25].

In the single-component Metropolis–Hastings, we use 1,000,000 long chain in order to guarantee convergence. We need long chains in the single-component algorithm, as we are dealing with multimodal distributions, and the single-component algorithms are known get stuck in one mode and it takes a long run to get to the correct mode. Burn-in period is chosen to be half of the chain length for both total variation and Cauchy cases. The MCMC chain could be optimized, but as in this paper our goal is to simply demonstrate the Cauchy prior, we postpone the MCMC optimization studies for subsequent papers.

4.2 Two-dimensional X-ray tomography

Now we will apply the developed method to X-ray tomography. We take the Shepp–Logan phantom to be the original unknown. The measurements correspond to two-dimensional fan-beam geometry (see Figure 7 for depicted the geometry of the measurements). The distance between the source and the detector to the rotation axis is 4 and 2, respectively. The width of the detector is 3 and the detector element contains 200 pixels. We make measurements at 41 different angles (from -10∘ to 190∘ with 5∘ steps). In addition, we add white measurement noise with approximately 0.1 % noise level.

Figure 7

Measurement geometry for the synthetic two-dimensional tomography example.

We compute CM estates using single-component Metropolis–Hastings as described in Section 3. The chain is initiated from the back-projection estimate (AT⁢m scaled properly). In order to get convergence of the MCMC chain, we use 5 million as the chain length from which 2.5 million samples were neglected as the burn-in period. In order to save memory, we store only every tenth sample for the computation of the ensemble means. Here, chain length refers to the number of single component updates, hence in the highest resolution in 2D simulations we have 800×800×5×106=3.2×1012 samples. However, we do not need to evaluate the full likelihood, as we can simply use the sparsity of the forward theory matrix, and the iid property of the difference priors – this allows realistic computation times. Computation time for a reconstruction with 200×200 resolution is approximately 24 hours and, for a 400×400 resolution, the time is approximately 96 hours. This indicates that the algorithm’s computational cost is (nearly) proportional to the number of unknowns, as expected. The computations are carried with a PC workstation with dual Xeon E5-2680 CPUs and 128 GB memory. The core parts of the algorithm are implemented using the MEX/C interface.

For comparison, we also compute MAP estimates using Cauchy prior which is computationally less expensive. The MAP estimate is computed by minimizing the loss function

Lμ(X)=-logD(X|m)-μ∑ilog(Xi+ϵ),

where the additional penalty term is added to ensure non-negativity of the solution to improve the stability of the optimization. The minimization is carried out by employing Gauss-Newton optimization such that μ and ϵ are decreased during the iteration. We note that, as being non-convex optimization problems, the Gauss–Newton iteration has tendency can converge to a (suboptimal) local minima which also seems to be case here (see below).

Figure 8

Comparison of different estimates on lattice sized 400×400 pixels.

(a)

Unknown: Shepp–Logan phantom.

(b)

Filtered back-projection estimate.

(c)

MAP estimate with Cauchy prior.

(d)

CM estimate with Cauchy prior.

(e)

CM estimate with TV prior.

(f)

CM estimate with Gaussian prior.

In Figure 8, we have plotted the original unknown and reconstructions with five different methods. The filtered back-projection estimate recovers most of the features, but it has severe artifacts. The CM estimate with Cauchy prior captures the features well without significant artefacts. The CM estimate with TV prior captures also the features well, but it is not as sharp as the Cauchy prior estimates. Also, the Cauchy estimate seems to be more robust, when we decrease the number of measurements or increase the noise. The estimate with Gaussian prior recovers the main structures, but it has both bad artifacts and also smooth structures.

Figure 8 also shows the MAP estimate computed using the Cauchy prior. This estimate also captures the features well, while the CM estimate has less artifacts in the reconstructions. However, as noted above, Gauss-Newton iteration has tendency to converge to (suboptimal) local minimas and that seems to be also the case here: starting the iteration from the CM estimate gives an MAP estimate with lower cost L0 (i.e. higher posterior probability) and this estimate is visually indisquistable from the CM estimate. But of course, starting from the CM is not a practically very meaningful approach. However, the finding global (optimal) minimas of a non-convex cost function is commonly known as a difficult task and, since our main motivation is in MCMC estimates, finding optimal MAP estimates is left to future studies.

Figure 9

Conditional mean estimates of the X-ray tomography problem on three different lattices.

$(a) 200×200{200\times 200}.$

(a)

200×200.

$(b) 400×400{400\times 400}.$

(b)

400×400.

$(c) 800×800{800\times 800}.$

(c)

800×800.

Figure 10

Cross-sections of the conditional mean estimates from the middle of the computational domain.

(a)

Horizontal.

(b)

Horizontal zoom-in.

(c)

Vertical.

(d)

Vertical zoom-in.

In Figures 9 and 10, we have CM estimates on different size lattices. The discretization has been taken into account as described earlier in this section, and hence the reconstructions look essentially same.

5 Conclusion and discussion

Total variation priors have been traditionally common choices as non-Gaussian priors for edge-preserving Bayesian inversion. However, as demonstrated in this paper, and e.g. in [25], total variation priors do not promote discretization-invariant inversion. Previously, Besov-space priors have been proposed to tackle this problem. The Cauchy priors, and more generally Lévy α-stable priors can provide a class of non-Gaussian priors for discretization-invariant edge-preserving Bayesian inversion.

The attractiveness of the Cauchy priors comes from the fact that we can relate the priors to Cauchy walks, which have clear and well-established statistical properties. We constructed Cauchy priors with the help of independent identically distributed increments, which gives us an intuitive way for understanding the edge-preserving properties via the conditional distributions.

For numerical verification, we applied the developed Cauchy priors to one-dimensional deconvolution problems and two-dimensional X-ray tomography, in which we computed conditional mean estimates with single-component Metropolis–Hastings algorithm (the maximum a posteriori estimates were also included in the X-ray tomography). When compared to Bayesian inversion with total variation prior, the methodology proposed promotes edge-preserving Bayesian inversion and gives good reconstructions for the tomography problem also when we have low number of measurements, the measurement noise is increased or when a computational mesh is very dense.

In the future studies, we should consider the discretization-invariance with mathematical rigor. Also, an exhaustive numerical study of Lévy α-stable random walks should be made. While the Gaussian random walk with α=2 is obviously some kind of a special case and high end of stable distributions, the Cauchy random variable with α=1 is in the middle of values 0<α≤2. For 0<α<1 we probably have even more constant behavior between jumps, than we have for the Cauchy prior. Similarly for cases 1<α<2 between Cauchy and Gaussian, we may have less constant behavior, but however the smoothing inference is apparently only for the special case α=2. Also, we could try more general prior models through stochastic partial differential of form 𝒫⁢𝒳=𝒲, where 𝒫 is some linear differential operator, 𝒳 is unknown and 𝒲 Lévy noise. This is similar to e.g. Matérn field, i.e. a Gaussian Markov random field construction, except that we would have general Lévy noise instead of the special case α=2 Gaussian noise. Also, replacing the Cauchy noise with Student’s t distribution would be interesting development, as Cauchy is a special case of Student’s t. Further development for MCMC sampling schemes can be done also by using parallelisation and blocked Metropolis-within-Gibbs algorithm. To speed up MCMC convergence, also multimodal proposal distribution should be studied. These will also help to enhance our single-component samplers for multimodal distributions, as sometimes we get stuck in one mode and cannot get out of it.

Funding source: Engineering and Physical Sciences Research Council

Award Identifier / Grant number: EP/K034154/1

Funding source: Luonnontieteiden ja Tekniikan Tutkimuksen Toimikunta

Award Identifier / Grant number: 250215

Funding statement: This work has been funded by Engineering and Physical Sciences Research Council, United Kingdom (EPSRC Reference: EP/K034154/1 – Enabling Quantification of Uncertainty for Large-Scale Inverse Problems) and Academy of Finland (application numbers 250215, 307741 and 313709).

References

[1] J. M. Bardsley, D. Calvetti and E. Somersalo, Hierarchical regularization for edge-preserving reconstruction of PET images, Inverse Problems 26 (2010), no. 3, Article ID 035010. 10.1088/0266-5611/26/3/035010Search in Google Scholar

[2] D. Bolin, Spatial Matérn fields driven by non-Gaussian noise, Scand. J. Stat. 41 (2014), no. 3, 557–579. 10.1111/sjos.12046Search in Google Scholar

[3] D. Calvetti and E. Somersalo, A Gaussian hypermodel to recover blocky objects, Inverse Problems 23 (2007), no. 2, 733–754. 10.1088/0266-5611/23/2/016Search in Google Scholar

[4] D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, Surv. Tutor. Appl. Math. Sci. 2, Springer, New York, 2007. Search in Google Scholar

[5] D. Calvetti and E. Somersalo, Hypermodels in the Bayesian imaging framework, Inverse Problems 24 (2008), no. 3, Article ID 034013. 10.1088/0266-5611/24/3/034013Search in Google Scholar

[6] A. M. Cormack, Representation of a function by its line integrals, with some radiological applications. I, J. Appl. Phys. 34 (1963), Article ID 2722. 10.1063/1.1729798Search in Google Scholar

[7] A. M. Cormack, Representation of a function by its line integrals, with some radiological applications. II, J. Appl. Phys. 35 (1964), 195–207. 10.1063/1.1713127Search in Google Scholar

[8] S. L. Cotter, G. O. Roberts, A. M. Stuart and D. White, MCMC methods for functions: modifying old algorithms to make them faster, Statist. Sci. 28 (2013), no. 3, 424–446. 10.1214/13-STS421Search in Google Scholar

[9] A. C. Damianou and N. D. Lawrence, Deep Gaussian processes, Proceedings of the 16th International Conference on Artificial Intelligence and Statistics—AISTATS, PMLR, Scottsdale (2013), 207–215. Search in Google Scholar

[10] M. E. Davison, The ill-conditioned nature of the limited angle tomography problem, SIAM J. Appl. Math. 43 (1983), no. 2, 428–448. 10.1137/0143028Search in Google Scholar

[11] M. M. Dunlop, M. A. Iglesias and A. M. Stuart, Hierarchical Bayesian level set inversion, Stat. Comput. 27 (2017), no. 6, 1555–1584. 10.1007/s11222-016-9704-8Search in Google Scholar

[12] W. Feller, An Introduction to Probability Theory and its Applications. Vol. II, John Wiley & Sons, New York, 1966. Search in Google Scholar

[13] A. Gelman, J. B. Carlin, H. S. Stern and D. B. Rubin, Bayesian Data Analysis, 2nd ed., Texts Statist. Sci. Ser., Chapman & Hall/CRC, Boca Raton, 2004. 10.1201/9780429258480Search in Google Scholar

[14] W. R. Gilks, S. Richardson and D. J. Spiegelhalter, Markov Chain Monte Carlo in Practice, Interdiscip. Statist., Chapman & Hall, London, 1996. 10.1201/b14835Search in Google Scholar

[15] G. González, V. Kolehmainen and A. Seppänen, Isotropic and anisotropic total variation regularization in electrical impedance tomography, Comput. Math. Appl. 74 (2017), 564–576. 10.1016/j.camwa.2017.05.004Search in Google Scholar

[16] T. Helin and M. Lassas, Hierarchical models in statistical inverse problems and the Mumford–Shah functional, Inverse Problems 27 (2011), no. 1, Article ID 015008. 10.1088/0266-5611/27/1/015008Search in Google Scholar

[17] B. Hosseini, Well-posed Bayesian inverse problems with infinitely divisible and heavy-tailed prior measures, SIAM/ASA J. Uncertain. Quantif. 5 (2017), no. 1, 1024–1060. 10.1137/16M1096372Search in Google Scholar

[18] J. Kaipio and E. Somersalo, Statistical and Computational Inverse Problems, Appl. Math. Sci. 160, Springer, New York, 2005. 10.1007/b138659Search in Google Scholar

[19] O. Kallenberg, Foundations of Modern Probability, 2nd ed., Probab. Appl. (N. Y.), Springer, New York, 2002. 10.1007/978-1-4757-4015-8Search in Google Scholar

[20] E. Klann, R. Ramlau and W. Ring, A Mumford–Shah level-set approach for the inversion and segmentation of SPECT/CT data, Inverse Probl. Imaging 5 (2011), no. 1, 137–166. 10.3934/ipi.2011.5.137Search in Google Scholar

[21] S. Lasanen, Non-Gaussian statistical inverse problems. Part I: Posterior distributions, Inverse Probl. Imaging 6 (2012), no. 2, 215–266. 10.3934/ipi.2012.6.215Search in Google Scholar

[22] S. Lasanen, Non-Gaussian statistical inverse problems. Part II: Posterior convergence for approximated unknowns, Inverse Probl. Imaging 6 (2012), no. 2, 267–287. 10.3934/ipi.2012.6.267Search in Google Scholar

[23] S. Lasanen and L. Roininen, Statistical inversion with Green’s priors, 5th International Conference on Inverse Problems in Engineering: Theory and Practice, Taylor & Francis, London (2005), 3–32. Search in Google Scholar

[24] M. Lassas, E. Saksman and S. Siltanen, Discretization-invariant Bayesian inversion and Besov space priors, Inverse Probl. Imaging 3 (2009), no. 1, 87–122. 10.3934/ipi.2009.3.87Search in Google Scholar

[25] M. Lassas and S. Siltanen, Can one use total variation prior for edge-preserving Bayesian inversion?, Inverse Problems 20 (2004), no. 5, 1537–1563. 10.1088/0266-5611/20/5/013Search in Google Scholar

[26] F. Lindgren, H. v. Rue and J. Lindström, An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach, J. R. Stat. Soc. Ser. B Stat. Methodol. 73 (2011), no. 4, 423–498. 10.1111/j.1467-9868.2011.00777.xSearch in Google Scholar

[27] A. K. Louis, Incomplete data problems in x-ray computerized tomography. I. Singular value decomposition of the limited angle transform, Numer. Math. 48 (1986), no. 3, 251–262. 10.1007/BF01389474Search in Google Scholar

[28] D. Mumford and B. Gidas, Stochastic models for generic images, Quart. Appl. Math. 59 (2001), no. 1, 85–111. 10.1090/qam/1811096Search in Google Scholar

[29] F. Natterer, The Mathematics of Computerized Tomography, B. G. Teubner, Stuttgart, 1986. 10.1007/978-3-663-01409-6Search in Google Scholar

[30] E. Niemi, M. Lassas, A. Kallonen, L. Harhanen, K. Hämäläinen and S. Siltanen, Dynamic multi-source X-ray tomography using a spacetime level set method, J. Comput. Phys. 291 (2015), 218–237. 10.1016/j.jcp.2015.03.016Search in Google Scholar

[31] C. J. Paciorek and M. J. Schervish, Spatial modelling using a new class of nonstationary covariance functions, Environmetrics 17 (2006), no. 5, 483–506. 10.1002/env.785Search in Google Scholar PubMed PubMed Central

[32] E. T. Quinto, Singularities of the X-ray transform and limited data tomography in 𝐑2 and 𝐑3, SIAM J. Math. Anal. 24 (1993), no. 5, 1215–1225. 10.1137/0524069Search in Google Scholar

[33] L. Roininen, M. Girolami, S. Lasanen and M. Markkanen, Hyperpriors for Matérn fields with applications in Bayesian inversion, Inverse Probl. Imaging 13 (2019), no. 1, 1–29. 10.3934/ipi.2019001Search in Google Scholar

[34] L. Roininen, J. M. J. Huttunen and S. Lasanen, Whittle-Matérn priors for Bayesian statistical inversion with applications in electrical impedance tomography, Inverse Probl. Imaging 8 (2014), no. 2, 561–586. 10.3934/ipi.2014.8.561Search in Google Scholar

[35] L. Roininen, M. S. Lehtinen, S. Lasanen, M. Orispää and M. Markkanen, Correlation priors, Inverse Probl. Imaging 5 (2011), no. 1, 167–184. 10.3934/ipi.2011.5.167Search in Google Scholar

[36] L. Roininen, P. Piiroinen and M. Lehtinen, Constructing continuous stationary covariances as limits of the second-order stochastic difference equations, Inverse Probl. Imaging 7 (2013), no. 2, 611–647. 10.3934/ipi.2013.7.611Search in Google Scholar

[37] G. Samorodnitsky and M. S. Taqqu, Stable Non-Gaussian Random Processes, Stoch. Model., Chapman & Hall, New York, 1994. Search in Google Scholar

[38] L. A. Shepp and J. B. Kruskal, Computerized Tomography: The New Medical X-Ray Technology, Amer. Math. Monthly 85 (1978), no. 6, 420–439. 10.1080/00029890.1978.11994611Search in Google Scholar

[39] S. Siltanen, V. Kolehmainen, S. Järvenpää, J. P. Kaipio, P. Koistinen, M. Lassas, J. Pirttilä and E. Somersalo, Statistical inversion for medical x-ray tomography with few radiographs: I. General theory, Phys. Med. Biol. 48 (2003), 1437–1463. 10.1088/0031-9155/48/10/314Search in Google Scholar PubMed

[40] T. J. Sullivan, Well-posed Bayesian inverse problems and heavy-tailed stable quasi-Banach space priors, Inverse Probl. Imaging 11 (2017), no. 5, 857–874. 10.3934/ipi.2017040Search in Google Scholar

[41] S. Vänskä, M. Lassas and S. Siltanen, Statistical X-ray tomography using empirical Besov priors, Int. J. Tomogr. Stat. 11 (2009), no. S09, 3–32. Search in Google Scholar

Received: 2017-05-05

Revised: 2018-05-19

Accepted: 2018-12-25

Published Online: 2019-01-30

Published in Print: 2019-04-01

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Articles in the same Issue

https://doi.org/10.1515/jiip-2017-0048

Keywords for this article

Bayesian statistical inverse problems; Cauchy distribution; a priori information; X-ray tomography

Creative Commons

BY-NC-ND 3.0