Home Medicine Robust background normalization method for one-channel microarrays
Article Publicly Available

Robust background normalization method for one-channel microarrays

  • Tülay Akal , Vilda Purutçuoğlu EMAIL logo and Gerhard-Wilhelm Weber
Published/Copyright: January 20, 2017

Abstract

Background

Microarray technology, aims to measure the amount of changes in transcripted messages for each gene by RNA via quantifying the colour intensity on the arrays. But due to the different experimental conditions, these measurements can include both systematic and random erroneous signals. For this reason, we present a novel gene expression index, called multi-RGX (Multiple-probe Robust Gene Expression Index) for one-channel microarrays.

Methods

Multi-RGX, different from other gene expression indices, considers the long-tailed symmetric (LTS) density, covering a wider range of distributions for modelling gene expressions on the log-scale, resulting in robust inference and it takes into account both probe and gene specific intensities. Furthermore, we derive the covariance-variance matrix of model parameters from the observed Fisher information matrix and test the performance of the multi-RGX method in three different datasets.

Results

Our method is found to be a promising method regarding its alternatives in terms of accuracy and computational time.

Conclusion

Multi-RGX gives accurate results with respect to its alternatives, with a reduction in computational cost.

Özet

Genel bilgi

Hesaplamalı biyoloji, biyoteknoloji, biyoenformatik ve tıp bilimlerindeki yeni ve gelişmiş araçlardan biri olan mikrodizin teknolojisi, RNA tarafından her bir gen için transkrip edilen mesajdaki değişim miktarını, dizinlerdeki renk yoğunluğunu bularak ölçmeyi amaçlamaktadır. Fakat farklı deneysel koşullardan dolayı, bu ölçümler hem sistematik hem de rassal hatalı sinyalleri de içerebilir. Bu amaçla, çoklu-RGX (Çoklu Prob-Sağlam Gen İfade İndeksi) adlı, tek-koşullu dizinler için geliştirilmiş, yeni bir gen ifade indeksi sunmaktayız.

Metod

Diğer indekslerden farklı olarak çoklu-RGX, logaritmik ölçeklerde gen ifadelerinin modellenmesinde daha geniş bir yelpazeyi kapsayan uzun kuyruklu simetrik (LTS) dağılımlı ifadeleri kullanarak dirençli çıkarım olanağı sunmakta ve ölçümlerin tanımlanmasında her bir gen ve her bir probdaki tüm gen ifadelerini göz önüne almaktadır. Ayrıca gözlemlenen Fisher bilgi matrisi yardımıyla, model parametrelerinin varyans-kovaryans matrisini çıkarmakta ve çoklu-RGX’in performansını ölçmek amacıyla, 59 dizin, 10.864 prob çifti ve 11 gen içeren Affymetrix bencmark ve iki tane simüle edilmiş olmak üzere üç farklı veri kümesi kullanmaktayız.

Bulgular

Metodumuz doğruluk ve hesaplama zamanı açısından alternatiflerine göre güvenilir bulunmuştur.

Sonuç

Çoklu-RGX alternatifleriyle karşılaştırıldığında hesaplama zamanını kısaltmakla beraber doğru sonuçlar da vermektedir.

Introduction

A microarray, also known as the DNA chip, gene chip or biochip, consists of thousands of DNA sequences, called probes, which are attached to a solid surface to measure the gene expressions. By collecting the amount of the mRNA transcripts measured, an approximation to the gene expression level, are turned into pictures [1].

The oligonucleotide is a type of single channel microarrays with a 25-base pair long and the Affymetrix GeneChip is an example of the popular oligonucleotides. In this array, we observe probes that are the known segments of particular gene sequences. Each probe contains two parts, namely, the perfect match (PM) and the mismatch (MM). The former stands for the perfect transcription of cRNA and the latter is aimed to measure the faulty signals on the arrays by changing the 13th base pair of PM.

In a microarray study, we can define three main sources of variations in signals, namely, the nonspecific hybridization, the background signal and the stray signal. They are also shortly called as the background errors. Accordingly, we can call these three types of variations as the background errors [2, 3].

Thereby, the term gene expression index stands for the mathematical method to estimate the true expression level in the oligonucleotides under such errorenous signals. There are a number of indices developed to infer the signals. In this study, we propose a novel gene expression index, called multiple-probe robust gene expression index, multi-RGX, for modelling the observed gene expressions in oligonucleotides.

Among many choices, one of the most common gene expression indices is the MBEI method [4], which cannot make inference for large numbers of arrays as the calculation is based on the least square approach. On the other hand, MAS5.0 [5], suggesting a smart solution to solve the less estimated intensities in PM regarding the estimated MM, may cause bias in inference of the true signal. Similarly, RMA [6] also fails under this condition. Similar to RMA, mgMOS [7] also accepts that the MM probe intensities are treated as only coming from the background signal. But different from RMA, this model makes use of latent variables, standing for different binding affinities of probes within a specified probe set in order to describe the correlation between the PM and MM intensities.

On the other side, GC-RMA [8] is the extended version of RMA by considering the GC content in the data and by adding, the first time, the presence of true signals in the MM probes. On the contrary, in practice, it equates the fraction of the true signal in MM to zero in order to simplify the calculation. But as the computation in the Bayesian framework is demanding, multi-mgMOS [9] considers performing it via a maximum likelihood approximation under the same model of BGX [10] so that the computational cost can reduce. However, being more efficient than BGX, the calculation is still time demanding for large datasets due to the application of the Bayesian method. Accordingly, FGX [2] is proposed to decrease the underlying cost by fully using the maximum likelihood approach in inference of model parameters. But, it is based on a strict normality assumption and the estimates are found for the probe mean of each gene.

Another recent method is called the Hook calibration method [11] which is a singlechip calibration alternative taking mismatched probes as the internal reference. Also, RMX [12] tries to find optimal robust estimators for multiple raw signals by making use of the Kolmogorov distance which is non-parametric. On the other hand, Lahti et al. [13] propose an online learning algorithm, called RPA which gives advantage to update probe-level parameters when the observations comes up. However, in modeling, it still uses the RMA method resulting in the same drawback in the description of the MM probes.

Hereby, in this study, we suggest a new gene expression index, which includes both gene and probe specific intensities, alike FGX and RGX, to model true signals in oligonucleotides. In this method, we again use the long-tailed symmetric family (LTS) like RGX in order to keep the robustness. The robust estimators are less sensitive to outliers in the data and tolerate departures from target distributions compared to RGX. In inference of the model parameters, we implement the modified maximum likelihood method, and get explicit and close form for each estimator. Moreover, we derive the variances and covariances of model parameters via the observed Fisher information matrix. From the analyses by different datasets, we conclude that the multi-RGX method is promising in terms of accuracy and computational demand and can be a helpful tool for the application in computational biology, bioinformatics and biomolecular engineering.

Materials and methods

Multiple-probe robust gene expression index: multi-RGX

As described in the previous section, the major aim of our study is to suggest a new gene expression index. Moreover, we aim that the new index can estimate the true signals more accurately, since we model not only gene specific intensities, but also both gene and probe intensities. Finally, multi-RGX covers a wider range of distributions for the signals, hence, does not ignore any information in the data. Accordingly, it suggests a robust method to model the true signals.

Hereby, in this novel model, called as multi-RGX (multiple-probe robust gene expression index), we assume that the intensities of the PM and MM probes on the logarithmic scale have the density from the LTS, similar to the idea of RGX. On the other hand, different from this, it does not follow an iterative process in the optimization of the likelihood function, rather, it works with the explicit estimates of the desired parameters.

In a standard Affymetrix chip, accepting that there are m probes, and in each probe, there exist n genes coming from the PM and MM probes, it is assumed that both PM and MM have true signals in such a way that PM covers the complete true signals, whereas, MM includes only a fraction of them. To define the amount of the underlying true signals in MM, a fraction term p is defined. Hereby, the intensities coming from the PM probes are distributed as LTS with a mean containing a constant term and another term, which stands for the true signal changing per gene. Additionally, its variance is constant over genes and probes. Indeed, this assumption may be considered strong in modelling. However, from the application of this assumption in FGX and RGX indices, it has been found that it is competitive in terms of the accuracies of the signals with respect to the gene specific variance as BGX implements. Moreover, from alternative modellings via the gene specific variance, it has been observed that none of the estimators can have explicit solutions under a gene specific variance [14], and the estimates can be obtained via optimization steps which cause an increase in the computational time. On the other hand, different from PM, MM has a mean containing the same constant term and the part of the true signal. Hence, in modelling of the PM and MM intensities for each probe j (j=1, …, m) and each gene i (i=1, …, n), we consider the following distributional assumption on the log-scale:

(1)PMijLTS(Si+μH,σ2) and MMijLTS(pSi+μH,σ2),

where Si shows the true signal for the gene i and μH denotes the constant background signal. Furthermore, σ2 indicates the constant variance and p stands for the fraction value within the range of zero to one that implies the ratio of the true signal in the MM probes. Thereby, the corresponding likelihood of the model in Equation (1) is found via

L(Si,μH,p|a,b)=i=1nj=1mf(PMij)f(MMij),L=i=1nj=1m12πσ2e(PMijSiμH)22σ212πσ2e(MMijpSiμH)22σ2,

which is proportional to

(2)L(1σ)i=1nj=1m(1+z PMij2k)v(1σ)i=1nj=1m(1+z MMij2k)v,

where v≥4 and k=2v−3.

Then, the first derivatives of the function lnL are taken with respect to each parameter as stated below:

(3)lnLμH=2vkσi=1nj=1m(PMijSiμH)σ(PMijSiμH)2kσ2+2vkσi=1nj=1m(MMijpSiμH)σ(MMijpSiμH)2kσ2,
(4)lnLp=2vkσi=1nj=1mSi(MMijpSiμH)σ(MMijpSiμH)2kσ2,
(5)lnLSi=2vkσj=1m(PMijSiμH)σ(PMijSiμH)2kσ2+2vkσj=1mp(MMijpSiμH)σ(MMijpSiμH)2kσ2,
(6)lnLσ=i=1nj=1m[1σ+(PMijSiμH)σ3(1+(PMijSiμH)2kσ2)2vk]+i=1nj=1m[1σ+(MMijpSiμH)σ3(1+(MMijpSiμH)2kσ2)2vk].

But it is observed that the underlying partial derivatives cannot produce unique solutions due to the repeated nonlinear functions in their formulas. Therefore, we suggest to use the modified maximum likelihood approach (MMLE) [15] that is based on approximating the nonlinear functions via the first order Taylor expansions around the ith population quantile t(i) of the student-t distribution with the (2v−1) degrees of freedom. In the calculation, we apply the order statistics and define the partial derivations of the objective log-likelihood function in Equation (2) in terms of small linear sub-pieces. Moreover, it is found that the estimators based on MMLE are asymptotically efficient and equivalent to MLE [15, 16].

In this study, we observe that the nonlinearity in the partial derivatives is caused by the following expressions such that

z PMij=(PMijSiμH)σ and zMMij=(MMijpSiμH)σ.

Then, by approximating the common nonlinear functions g(z)=z1+z2k for PM as

g(z PMij)=(PMijSiμH)σ1+(PMijSiμH)2kσ2

and for MM as

g(z MMij)=(MMijpSiμH)σ1+(MMijpSiμH)2kσ2;

their first-order Taylor expansions can be derived by

g(z PMi(j))=αj+βjzPMi(j) and g(zMMi(j))=αj+βjzMMi(j),

in which

αj=2tj3k(1+tj2k) and βj=(1tj2k)(1+tj2k)2.

In these expressions, zPMi(j) and zMMi(j) show the ordered probes in increasing magnitude for each gene i in the PM and MM standardized intensities, respectively. Accordingly, g(zPMi(j)) and g(zMMi(j)) are their associated linearized functions via the Taylor series. Thereby, the partial derivatives of MMLE are as follows:

(7)lnLμH=2vkσi=1nj=1mg(zPMi(j))+2vkσi=1nj=1mg(zMMi(j)),
(8)lnLp=2vkσi=1nj=1mSig(zMMi(j)),
(9)lnLSi=2vkσj=1mg(zPMi(j))+2vpkσj=1mg(zMMi(j)),
(10)lnLσ=2nmσ+2vkσj=1mg(zPMi(j))zPMi(j)+2vkσj=1mg(zMMi(j))zMMi(j).

Finally, by setting Equations (7)–(10) to zero, we get the optimal MML estimates of parameters.

As a result, μ^H is found as:

(11)μ^H=i=1nj=1mβj(PMi(j)+MMi(j))(p+1)i=1nj=1mβjSi2ni=1nj=1mβj.

On the other side, from Equation (9), we obtain the estimate of the true signal for each gene i as:

(12)S^i=j=1mβjPMi(j)+pj=1mβjMMi(j)(p+1)μHj=1mβj(p2+1)j=1mβj.     

However, by substituting Equation (12) into Equation (11), we can define μ^H in an alternative way as follows:

(14)μ^H=pi=1nj=1mβjPM(ij)i=1nj=1mβjMM(ij)n(p1)j=1mβj=pj=1mβjPM¯.jj=1mβjMM¯.j(p1)j=1mβj.

where PM¯.j=i=1nPMi(j)n and MM¯.j=i=1nMMi(j)n.

The estimate of the common fraction term p is computed by four roots and the optimal one which maximizes the log-likelihoods is found as the following equation:

p^=(SSMM SSPM )+(SSPM SSMM )2+4SSPM,MM 22SSPM,MM ,

where

SSPM =i=1nj=1mβj[j=1mβjPMi(j)j=1mβjPM ¯.j]2,SSMM =i=1nj=1mβj[j=1mβjMMi(j)j=1mβjMM ¯.j]2,SSPM,MM =i=1nj=1mβj[(j=1mβjPMi(j)j=1mβjPM ¯.j)×(j=1mβjMMi(j)j=1mβjMM ¯.j)]

Finally, from Equation (10), we can get MMLE of σ by:

σ^=B+B2+4nmCnm,

Adjusting the degree of freedom as 2nm−(n+2):

σ^=B+B2+4nmCnmn2,

in which

B=vk[i=1nj=1mαj(PMi(j)MMi(j))],C=vk[i=1nj=1mβj(PMi(j)S^iμ^H)2+i=1nj=1mβj(MMi(j)p^S^iμ^H)2].

Observed Fisher information matrix and estimators for variances and covariances

The variance of the asymptotic maximum likelihood estimator can be found from the Fisher information matrix [17] and as stated previously that MMLE is asymptotically equivalent to it, resulting in the same feature of estimators.

Accordingly, the Fisher information matrix I is generated from the second partial derivatives of the loglikelihood function with respect to each parameter as stated below:

I11=2vkσ2i=1nj=1m[1(PMi(j)SiμH)2kσ2(1+(PMi(j)SiμH)2kσ2)2+1(MMi(j)pSiμH)2kσ2(1+(MMi(j)pSiμH)2kσ2)2],I22=2vkσ2i=1nj=1mSi21(MMi(j)pSiμH)2kσ2(1+(MMi(j)pSiμH)2kσ2)2,Iii=2vkσ2j=1m1(PMi(j)SiμH)2kσ2(1+(PMi(j)SiμH)2kσ2)2+2vp2kσ2j=1m1(MMi(j)pSiμH)2kσ2(1+(MMi(j)pSiμH)2kσ2)2,I12=I21=2lμHp=2vkσ2i=1nj=1mSi1(MMi(j)pSiμH)2kσ2(1+(MMi(j)pSiμH)2kσ2)2,I1i=Ii1=2vkσ2j=1m1(PMi(j)SiμH)2kσ2(1+(PMi(j)SiμH)2kσ2)2+2vpkσ2j=1m1(MMi(j)pSiμH)2kσ2(1+(MMi(j)pSiμH)2kσ2)2,I2i=Ii2=2vkσj=1m(MMi(j)pSiμH)σ(1+(PMi(j)SiμH)2kσ2)2+2vpkσ2j=1mSi1(MMi(j)pSiμH)2kσ2(1+(MMi(j)pSiμH)2kσ2)2,Iik=Iki=2lSiSk=0,

where i, k=1, …, n, present the gene and j=1, …, m, shows the probe indicator, respectively. Moreover, l stands for the loglikelihood function lnL. Then, the I matrix is derived via

I=[2lμH22lμHp2lμHS12lμHSn2lpμH2lp22lpS12lpSn2lS1μH2lS1p2lS1202lSnμH2lSnp02lSn2].

In order to find the variance-covariance matrix, we partition the above matrix into the four sub-matrices. The details of this partition can be found in [18] and its application can be seen in [2, 3]. Once I is reorganized with respect to its sub matrices, the variances of the model parameters can be derived from simultaneous solutions of partial derivatives from the reorganized I. For an illustration, we show the variance of μ^H as follows:

V(μ^H)=1C0[2vkσ2i=1nj=1mSi21(MMi(j)pSiμH)2kσ2(1+(MMi(j)pSiμH)2kσ2)2i=1n2vkσj=1m(MMi(j)pSiμH)σ(1+(PMi(j)SiμH)2kσ2)2+2vpkσ2j=1mSi1(MMi(j)pSiμH)2kσ2(1+(MMi(j)pSiμH)2kσ2)22vkσ2j=1m1(PMi(j)SiμH)2kσ2(1+(PMi(j)SiμH)2kσ2)2+2vp2kσ2i=1n1(MMi(j)pSiμH)2kσ2(1+(MMi(j)pSiμH)2kσ2)2],

in which

C0=(2lμH2i=1n(2l/SiμH)22l/Si2)×(2lp2i=1n(2l/Sip)22l/Si2)(2lpμHi=1n(2l/SiμH)(2l/Sip)2l/Si2).

Results

Application via a real dataset

In the analysis, we use a benchmark Affymetrix spike-in dataset. This dataset consists of 11 spike-in genes coming from four bacterial ancestor genes and 59 arrays with 10,864 probe pairs.

For the analysis, the 16 spike-in probe pairs are taken. Here, the gene expression values of the individual cRNA fragments, which are hybridized to U95A GeneChip arrays at the same concentration are analyzed under 14 different concentration levels [0.0, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256.0, 512.0, and 1024.0 pM (picoMolar)]. Accordingly, every gene is described by 14 probes in each array, resulting in 224 observations per array.

Finally, for the analysis, we use the R programme language with our original codes for all calculations.

Hereby, in the estimation of the model parameters for each array, we check the possible shape parameters for the LTS distribution from 2 to 40 with the jump size 0.05. In the computation, we find that most of the array can be optimally defined under the shape parameter 40 or close to 40, which practically indicates the normal distribution.

As presented in Figure 1A, while the level of concentrations increases, a nonlinear pattern is seen in the structure of the estimated signals, like other methods. Figure 1B shows the associated graph via all other methods.

Figure 1: Average estimated spike-in signal versus concentrations on the nominal logarithmic scale (log2) for Affymetrix data via (A) multi-RGX and (B) MAS5.0, MBEI, RMA, mgMOS, GC-RMA, multi-mgMOS, FGX and RGX methods.
Figure 1:

Average estimated spike-in signal versus concentrations on the nominal logarithmic scale (log2) for Affymetrix data via (A) multi-RGX and (B) MAS5.0, MBEI, RMA, mgMOS, GC-RMA, multi-mgMOS, FGX and RGX methods.

Furthermore, the average estimated signals via multi-RGX versus nominal log-concentrations are plotted for each gene and the graph is reported in Figure 2.

Figure 2: Signal versus concentrations on log-scale for Affymetrix data.
Figure 2:

Signal versus concentrations on log-scale for Affymetrix data.

Then to evaluate the performance in accuracy for each level of concentrations, we regress the spike-in genes which have the same concentration according to the distinct log-concentrations and apply the simple linear regression method. The computed R2 is called as the signal detect R2 and its associated slope term is named as the signal detect slope. Then, we compute the R2 values and slope terms for each gene separately and later calculate the mean of both values individually. We call this measure as simply R2. Hereby as seen in Table 1, the signal detect R2, signal detect slope and R2 are found as 0.79, 1.04 and 0.92, respectively. From the results, it is seen that R2 is comparable with respect to its alternatives. Yet, under all concentrations simultaneously, the array indicates relatively more linear relation between estimated signals and concentrations on the nominal log-scale on the average, similar to multi-mgMOS, which has one of the perfection results for this type of modelling. But different from multi-mgMOS, our novel index is fully frequestist, i.e. based on the maximum likelihood approach. On the other hand, it indicates a stronger linearity between the relation of signals and concentrations with respect to the FGX and RGX outputs. This can be seen another advantage of multi-RGX regarding other choices in the analyses.

Table 1:

Signal detect R2, signal detect slope and average R2, of all methods for Affymetrix data.

MethodSignal detect R2Signal detect slopeR2
Perfection value1.001.001.00
MAS5.00.860.710.89
RMA0.800.630.99
MBEI0.850.530.99
GC-RMA0.840.970.99
mgMOS0.820.760.96
Multi-mgMOS0.801.030.96
FGX0.940.430.90
RGX0.960.440.92
Multi-RGX0.791.040.92

Besides, in order to assess the computational demand for alternative approaches, we compare the time for calculation via multi-RGX with other records in Table 2. It is seen that multi-RGX is significantly faster than BGX and multi-mgMOS. But with respect to the time for FGX, it is relatively slower. The reason is that we need to compute the optimal concomitant of probe sets, which converge to the true order in two or three iterations and this calculation causes extra time that is observed in tabulated values. Moreover, the calculation of FGX is based on the mean probe level, on the contrary, the computation of multi-RGX depends on both probe and gene specific values, which complicate the calculation.

Table 2:

Real computational time programme language of BGX, multi-mgMOS, FGX and multi-RGX, respectively, for Dataset Affymetrix data.

ModelProgramme languageComputational time
BGXC++32.5 h
Multi-mgMOSR and C50 min
FGXR4 s
Multi-RGXR34 s

Application via simulated dataset

To assess the quality of our novel gene expression index, when the data are far from normal or have outliers, we simulate two datasets from 10,000 Monte Carlo runs. Each Monte Carlo run is conducted for 10 genes under 20 probe pairs, resulting in 200 observations for every generated array. In the first data, we generate the PM and MM values under the normal distribution and in the second data, we produce the values under the mean shifted normal density. Then, we mix these two sets in order to get two location-mixture datasets with different ratios of outliers. Hereby, the mathematical description of the first set is presented as below:

(15)0.5N(Si+μH,σ2)+0.5N(Si+μH+δσ,σ2),

and the second location mixture can be described as the following structure:

(16)0.9N(Si+μH,σ2)+0.1N(Si+μH+δσ,σ2),

where N(·,·) indicates the univariate normal with the given parameters.

Here, the first data possess a large number of outliers. On the other hand, the second dataset owns a relatively moderate number of extreme observations. Moreover, in all simulations, we assume that every gene is measured under specific concentrations. Thereby, the data are simulated under Si setting to Si =2, …, 11, for the gene i=1, …, 10, respectively, on the original scale for each presumed concentration level. Furthermore, μH , p, σ are equated as 1, 0.7 and 1, in order. Finally, the shift in the location, δ, is set to δ=10 for both datasets.

In the assessment of accuracies for the estimated results, we select three criteria: the average error (AE), mean absolute error (MAE) and the root mean square error (RMSE) [19].

In Tables 3 and 4, we list the estimated values, i.e. Si , μH , p and σ, of FGX, RGX and multi-RGX for each gene and model parameters, in order, for the expression in Equations (15)–(16), respectively, by computing the associated absolute errors.

Table 3:

Estimated model parameters of the first simulated location-mixture dataset (Expression in 15) via FGX, RGX and multi-RGX with their absolute errors (AE) and true values.

ParameterTrue valueFGXAE of FGXRGXAE of RGXMulti-RGXAE of multi-RGX
p0.70.880.250.870.240.880.25
μH1−23.9124.91−23.9124.91−20.9921.99
σ10.440.560.270.730.220.79
S1226.5112.2526.5012.2523.5910.79
S2327.938.3127.938.3123.876.96
S3426.795.7026.785.7024.095.02
S4527.014.4027.004.4024.273.85
S5627.193.5327.193.5324.433.07
S6727.352.9127.352.9124.572.51
S7827.492.4427.482.4424.702.09
S8927.622.0727.612.0724.811.76
S91027.731.7727.721.7724.921.49
S101127.841.5327.831.5325.011.27
Table 4:

Estimated model parameters of the second simulated location-mixture dataset (Expression in 16) via FGX, RGX and multi-RGX with their absolute errors (AE) and true values.

ParameterTrue valueFGXAE of FGXRGXAE of RGXMulti-RGXAE of multi-RGX
p0.70.950.360.930.330.950.36
μH1−25.6326.63−25.6326.63−22.1823.18
σ10.610.390.440.560.320.68
S1227.3312.6727.3412.6723.9010.95
S2329.298.7729.308.7724.327.11
S3427.765.9427.775.9424.645.16
S4528.084.6228.084.6224.893.98
S5628.343.7228.343.7225.113.18
S6728.553.0828.553.0825.292.61
S7828.732.5928.742.5925.452.18
S8928.902.2128.902.2125.601.84
S91029.041.9029.051.9125.731.57
S101129.171.6529.181.6525.851.35

From Tables 3 and 4, we observe that there is a large amount of bias in the estimates whose cause can be explained as follows: The first dataset is generated for 10 genes on the original scale, representing very low concentrations. Then, we take the nominal log-scale of all these simulated measurements in order to calculate the estimated signals. By this way, indeed, we evaluate the performance of all three models under very low concentrations since the estimates of all indices are problematic under these levels. Furthermore, as observed in Figure 1B, FGX, RGX and multi-RGX are able to measure the amount of the noisy signals on the array when there is no effect on the gene. Accordingly, from the outcomes we can see that this amount is approximately 21 units as measured under μ^H, and if we subtract this shift from all estimated signals, we can find closer estimates of the signals with respect to their true values.

Additionally, in Table 5, we present the mean absolute error (MAE) and the root mean square error (RMSE) for the estimated signals via three selected models and in Table 6, we list the real and CPU time of each index in the calculation of the estimated model parameters for these two mixture data.

Table 5:

Mean absolute error (MAE) and root mean square error (RMSE) of FGX, RGX and multi-RGX in the calculation of the estimated signals in two simulated location-mixture datasets.

DataModelMAERMSE
First mixtureFGX20.8565.92
RGX20.8465.90
Multi-RGX17.9356.69
Second mixtureFGX22.0269.63
RGX22.0369.65
Multi-RGX18.5858.75
Table 6:

Real (in seconds) and central processing unit (CPU) time of FGX, RGX and multi-RGX in the calculation of the estimated model parameters in two simulated location-mixture datasets.

DataModelReal timeCPU time
First mixtureFGX46.440.55
RGX700.230.68
Multi-RGX2578.981.00
Second mixtureFGX45.150.56
RGX752.771.56
Multi-RGX2648.551.90

From all the tabulated values, it is seen that the estimates of multi-RGX are more accurate than the ones computed via FGX and RGX. But the computational demand of multi-RGX is more than other two alternatives in the real time. However, this difference is indeed can be tolerable if we compare the results of the CPU time. This output shows that the real time of multi-RGX can be even improved with the help of an effective programming since the calculation of the estimates is almost performed under the same computer time in the end.

Additionally, to evaluate the performance of each alternative in a large dataset, we extent the calculation via the simulated data by 10,000 and 20,000 genes. Finally, we compute AE, MAE and RMSE for estimated Si in which the first data have 10,000 genes (i=1, …, 10,000) and the second data have 20,000 genes (i=1, …, 20,000). In contrast, for the remaining model parameters p, μH and σ, we compute directly AE, MAE and RMSE, which are unique per study.

Hereby, in the simulation we set the true value of p, μH and σ to 0.7, 1 and 1, in order. For the true signals Si , we initially generate values according to the number of total genes ranging from 2 to 11 from uniform distributions, assuming that the genes are measured under 10 distinct concentrations similar to the previous dataset. Also, the intensities are measured on the original scale due to the fact that we aim to evaluate the performance of all three methods under the most problematic range of intensities, i.e. the worst scenario for the comparison. Then, alike other estimates, we calculate the average of all model selection criteria, i.e. AAE, MAE and RMSE, for all signals. Thereby, the average of these criteria for signals is found by using 10,000 and 20,000 values for the first and second dataset, respectively. Whereas, they are directly computed for estimates of p, μH and σ. The results are presented in Tables 79 for the first large dataset (with 10,000 genes) whose mixture proportions are arranged according to Equations (15)–(16), in order. Moreover, the outcomes of the second large dataset (with 20,000 genes) are listed in Tables 9 and 10 whose mixture model is generated with respect to Equations (15)–(16), respectively.

Table 7:

Absolute error (AE) of FGX, RGX and multi-RGX in the calculation of the estimated p, μH and σ in the large dataset with 10,000 genes according to two simulated location-mixture models.

True valueFGXAE of FGXRGXAE of RGXMulti-RGXAE of multi-RGX
First mixture
p0.71.140.631.000.431.140.63
μH15.654.655.654.655.664.66
σ10.790.210.240.760.230.76
Second mixture
p0.71.120.601.000.431.120.60
μH15.614.615.614.615.624.62
σ10.440.560.140.860.130.86
Table 8:

Average absolute error (AAE), mean absolute error (MAE) and root mean square error (RMSE) of FGX, RGX and multi-RGX in the calculation of the estimated signals Si (i=1, …, 10,000) in the large dataset with 10,000 genes according to the two simulated location-mixture models.

FGXRGXMulti-RGX
AAEMAERMSEAAEMAERMSEAAEMAERMSE
First mixture
Si1.418.73872.851.418.73872.851.458.73873.41
Second mixture
Si1.318.16816.431.318.16816.431.338.17817.08
Table 9:

Absolute error (AE) of FGX, RGX and multi-RGX in the calculation of the estimated p, μH and σ in the large dataset with 20,000 genes according to two simulated location-mixture models.

True valueFGXAE of FGXRGXAE of RGXMulti-RGXAE of multi-RGX
First mixture
p0.71.140.631.000.431.140.62
μH15.644.645.644.645.644.64
σ11.550.550.240.770.240.77
Second mixture
p0.71.110.591.000.431.110.59
μH15.674.675.674.675.684.68
σ10.780.220.140.860.140.86
Table 10:

Average absolute error (AAE), mean absolute error (MAE) and root mean square error (RMSE) of FGX, RGX and multi-RGX in the calculation of the estimated signals Si (i=1, …, 20,000) in the large dataset with 10,000 genes according to the two simulated location-mixture models.

FGXRGXMulti-RGX
First mixture
 AAE1.411.411.45
 MAE8.698.698.70
 RMSE1229.421129.421230.33
Second mixture
 AAE1.321.321.34
 MAE8.208.208.20
 RMSE1159.141159.141160.36

From all findings with large datasets, we observe that the performance of all three indices become very close to each other. In contrast, under small numbers of genes, indicating no unique best model, we find that FGX, RGX and multi-RGX estimates work well in the bench-mark dataset and for the simulated data with less numbers of genes, multi-RGX performs better than its strong alternatives. Hence, our suggested method can be seen as a promising approach to estimate the true signals for one-channel microarray datasets.

Conclusion

In this study, we have proposed a new gene expression index for one-channel microarrays, called multi-RGX, which uses the long-tailed symmetric distribution to describe the signals and takes into account both gene and probe specific measurements. Moreover, multi-RGX generates an explicit expression for each model parameter.

In the assessment of our proposal gene expression, we have compared the accuracy and the computational demand under certain criteria. According to the results, we have observed that multi-RGX can keep a high accuracy regarding alternative gene expression indices while preserving less computational cost under small datasets. But there is no significant difference among any model in large data from the simulation analyses.

As the extension of this study, we consider to implement the multiple adaptive regression splines (MARS) approach to model the observed nonlinear relationship between signals and concentrations, in particular, under low and high concentrations. This idea has been previously performed in the study of Kunjdaje et al. [20] to detect the function of histone modification levels. We believe that similar improvement in the extension and the proposed current model can open new avenues in the analyses of microarray for different biological researches.

  1. Conflict of interest: There are no conflicts of interest among the authors.

References

1. Sanchez A, Ruiz de Villa MC. A tutorial review of microarray data analysis. Technical report, Universitat de Barcelona, 2008.Search in Google Scholar

2. Purutçuoğlu V, Wit E. FGX: a frequentist gene expression index for Affymetrix arrays. Biostatistics 2007;8:433–7.10.1093/biostatistics/kxl020Search in Google Scholar PubMed

3. Purutçuoğlu V. Robust gene expression index. Math Probl Eng 2012;2012:1–17.10.1155/2012/182758Search in Google Scholar

4. Li C, Wong W. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 2001;98:31–6.10.1073/pnas.98.1.31Search in Google Scholar PubMed PubMed Central

5. Hubbell E, Liu W-M, Mui R. Robust estimators for expression analysis. Bioinformatics 2002;18:1585–92.10.1093/bioinformatics/18.12.1585Search in Google Scholar PubMed

6. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003;4:249–64.10.1093/biostatistics/4.2.249Search in Google Scholar PubMed

7. Milo M, Fazeli A, Niranjan M, Lawrence ND. A probabilistic model for the extraction of expression levels from oligonucleotide arrays. Biochem Soc Trans 2003;31:1510–2.10.1042/bst0311510Search in Google Scholar PubMed

8. Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. A model-based background adjustment for oligonucleotide expression arrays. J Am Stat Assoc 2004;99:909–17.10.1198/016214504000000683Search in Google Scholar

9. Liu X, Milo M, Lawrence N, Rattray M. A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips. Bioinformatics 2005;21:3637–44.10.1093/bioinformatics/bti583Search in Google Scholar PubMed

10. Hein AM, Richardson S, Causton HC, Ambler GK, Green PJ. BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data. Biostatistics 2005;6:349–73.10.1093/biostatistics/kxi016Search in Google Scholar PubMed

11. Binder H, Preibisch S. “Hook”-calibration of GeneChip-microarrays: theory and algorithm. Algorithms Mol Biol 2008;3:12.10.1186/1748-7188-3-12Search in Google Scholar PubMed PubMed Central

12. Kohl M, Deigner HP. Preprocessing of gene expression data by optimally robust estimators. BMC Bioinformatics 2010;11:1–15.10.1186/1471-2105-11-583Search in Google Scholar PubMed PubMed Central

13. Lahti L, Torrente A, Elo LL, Brazma A, Rung J. A fully scalable online pre-processing algorithm for short oligonucleotide micoarray atlases. Nuc Acids Res 2013;41:e110.10.1093/nar/gkt229Search in Google Scholar PubMed PubMed Central

14. Akal T. Gene expression indices for single-channel microarrays. Master’s thesis, Middle East Technical University, 2013.Search in Google Scholar

15. Tiku ML, Tan WY, Balakrishnan N. Robust inference. New York: Marcel Dekker, 1986.Search in Google Scholar

16. Bhattacharyya GK. The asymptotics of maximum likelihood and related estimators based on type II censored data. J Am Stat Assoc 1985;80:398–404.10.1080/01621459.1985.10478130Search in Google Scholar

17. Bain LJ, Engelhardt M. Introduction to probability and mathematical statistics. CA: Duxbury Classic Series, 1991.Search in Google Scholar

18. Healy MJ. Matrices for statistics. New York, USA: Oxford University Press, 2000.10.1093/oso/9780198507031.001.0001Search in Google Scholar

19. Batmaz İ, Kartal-Koç E, Weber GW. Advances in Intelligent Modelling and Simulation: Studies in Computational Intelligence, chapter Robust Regression Metamodelling of Complex Systems: The Case of Solid Rocket Motor Performance Metamodelling. Berlin Heidelberg: Springer-Verlag, 2012:221–251.10.1007/978-3-642-28888-3_9Search in Google Scholar

20. Dong X, Greven MC, Kundaje A, Djebali S, Brown JB, Cheng C, et al. Modeling gene expression using chromatin features in various cellular contexts. Genome Biol 2012;13:1–17.10.1186/gb-2012-13-9-r53Search in Google Scholar PubMed PubMed Central

Received: 2016-02-04
Accepted: 2016-06-23
Published Online: 2017-01-20
Published in Print: 2017-04-01

©2017 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 20.1.2026 from https://www.degruyterbrill.com/document/doi/10.1515/tjb-2016-0231/html
Scroll to top button