Startseite Truncated rank correlation (TRC) as a robust measure of test-retest reliability in mass spectrometry data
Artikel
Lizenziert
Nicht lizenziert Erfordert eine Authentifizierung

Truncated rank correlation (TRC) as a robust measure of test-retest reliability in mass spectrometry data

  • Johan Lim , Donghyeon Yu EMAIL logo , Hsun-chih Kuo , Hyungwon Choi und Scott Walmsley
Veröffentlicht/Copyright: 30. Mai 2019

Abstract

In mass spectrometry (MS) experiments, more than thousands of peaks are detected in the space of mass-to-charge ratio and chromatographic retention time, each associated with an abundance measurement. However, a large proportion of the peaks consists of experimental noise and low abundance compounds are typically masked by noise peaks, compromising the quality of the data. In this paper, we propose a new measure of similarity between a pair of MS experiments, called truncated rank correlation (TRC). To provide a robust metric of similarity in noisy high-dimensional data, TRC uses truncated top ranks (or top m-ranks) for calculating correlation. A comprehensive numerical study suggests that TRC outperforms traditional sample correlation and Kendall’s τ. We apply TRC to measuring test-retest reliability of two MS experiments, including biological replicate analysis of the metabolome in HEK293 cells and metabolomic profiling of benign prostate hyperplasia (BPH) patients. An R package trc of the proposed TRC and related functions is available at https://sites.google.com/site/dhyeonyu/software.

Award Identifier / Grant number: NRF-2018R1C1B6001108

Funding statement: National Research Foundation of Korea, Funder Id: http://dx.doi.org/10.13039/501100003725, Grant Number: NRF-2018R1C1B6001108.

Appendix A

Comparison of Two MS Data sets and IID Bivariate samples

In Introduction, we mention that the repeatedly measured two MS samples are not from the independent and identically distributed bivariate random samples as assumed in the use of Kendall’s τ for truncated data. Here, we provide more details about the differences between the underlying models of these two data sets. For the sake of simplicity, we suppose U=(U1,U2,,Up) and V=(V1,V2,,Vp) are repeatedly measured log-transformed mass spectral data of p compounds. We further assume that M is a set of sites where the intensity (abundance) of analytes is measured. In this paper, we assume that each element of Uks and Vks are distributed as

Uk={Zk+ϵ1k,kMϵ1k,kM,

and

Vk={Zk+ϵ2k,kMϵ2k,kM,

where Zks denote shared intensities between U and V; ϵik are independent random noise from (0,σϵ2); and Zks and ϵiks are also independent of one another.

If either all shared signals are zero or M is an empty set (i.e., Zk = 0 for kM or M = ∅), the underlying distribution of two MS data is equivalent to IID bivariate distribution of (ϵ1k,ϵ2k) for k = 1, 2, … , p, where (ϵ1k,ϵ2k) are from the IID distribution of (0,σϵ2I2) and I2 is the two-dimensional identity matrix. However, the above conditions are unusual in real MS experiments since the analytes contain at least one chemical compound and then its intensity is measured by the instrument in the well-designed MS experiment.

The other possible condition that leads to the equivalence of underlying models of two MS data and IID bivariate samples is that M={1,2,,p} and Zk = μ for 1 ≤ kp. Then, two MS data can be represented as Uk=μ+ϵ1k and Vk=μ+ϵ2k and (Uk, Vk) follows IID bivariate distribution (μ12,σϵ2I2), where 12 is a two dimensional vector of ones and I2 is the two dimensional identity matrix. This condition is also unusual in the real MS experiments since the intensities of different chemical compounds are usually different.

In summary, the major difference between the underlying models of two MS data sets and IID bivariate samples is in the sampling unit; the sampling unit of the MS data is the p-dimensional vector and that of IID bivariate sample is a two-dimensional vector (Xk, Yk). Note that IID bivariate samples should have identical mean but the elements in the MS data are allowed to have different means.

From this observation, the de-centered two MS data (UkZk,VkZk) for k = 1, 2, … , p follow the IID bivariate distribution (0,σϵ2I2). Thus, as we stated in Section 2.1, there is no reason to believe that log-transformed intensities (Uk, Vk) are correlated over k (m/z) in the sense that (UkZk,VkZk) are independent for k = 1, 2, … , p. To investigate this, we show the sample autocorrelation function of (Ukμ^k) and plots of two original MS data sets X=exp(U) and Y=exp(V) in Figure 6, where Uk and Vk are obtained from the first and second samples (X1 and X2) in the real data example of Section 5.1 and μ^k=(Uk+Vk)/2. Here, to make one-dimensional data U and V from the two-dimensional observations X1 and X2 in Section 5.1, we summed intensities of X1 and X2 along with the retention time and log-transformed the data. From Figure 6, the autocorrelations are not significantly away from 0 as we explained above.

Figure 6: Plots of the ACF of de-centered X, the two MS data (exp(X)), (exp(Y)). Blue dotted line in (a) denotes the 95% confidence intervals of the ACF. (A) ACF of de-centered X, (B) Original MS (exp(X)) and (C) Original MS (exp(Y)).
Figure 6:

Plots of the ACF of de-centered X, the two MS data (exp(X)), (exp(Y)). Blue dotted line in (a) denotes the 95% confidence intervals of the ACF. (A) ACF of de-centered X, (B) Original MS (exp(X)) and (C) Original MS (exp(Y)).

Appendix B

Real data example of density estimation for generating synthetic data

In this section, we describe the distribution of the MS data through the gas chromatograph-mass spectrometry (GC-MS) data from Yu et al. (2015). The samples were collected from 30 species of Korean and 27 species of Chinese Schizandra chinensis to classify the country of origin. In this experiment, the observed intensities are pre-processed to generate the chromatogram by summing intensities in the pre-determined intervals from the instrument. After pre-processing, each MS data has 3209 intensities observed at retention time points from 6:20 to 60:00. Figure 7A and B show the plots of the intensities and log-transformed intensities along the retention times, respectively. More details on the data can be found from Yu et al. (2015). Note that we use the additional MS data to describe the distribution of the MS data instead of using the data sets in Section 5 since the data sets in Section 5 are the pre-processed data sets where the intensities from noise are removed by applying the peak detection methods.

Figure 7: Plots of the intensities (A) and log-transformed intensities (B) along the retention times of the Korean Schizandra chinensis GC/MS data. (A) Original MS and (B) Log-transformed MS.
Figure 7:

Plots of the intensities (A) and log-transformed intensities (B) along the retention times of the Korean Schizandra chinensis GC/MS data. (A) Original MS and (B) Log-transformed MS.

To describe the distribution of the MS data, we show the histograms of the intensities and log-transformed intensities in Figure 8A and B, respectively. As stated in Hastings, Norton, and Roy (2002), the log-transformed intensities in Figure 8B seems to follow the mixture of the normal distributions. Thus, in order to generate synthetic data sets that resemble the real data sets, we have applied the density estimation of the finite normal mixture model for each Korean Schizandra chinensis GC-MS data with the R package mixtools (Benaglia et al. 2009). To be more specific, let X1, … , Xp be random samples from a finite mixture F of k normal distributions with mean μk and variance σk2. The density of the mixture of k normal distributions for Xi can be written as

(10)fθ(xi)=j=1kπjfj(xi;μj,σj2),
Figure 8: Histograms of the intensities (A) and log-transformed intensities (B) along the retention times of the Korean Schizandra chinensis GC/MS data. (A) Original MS and (B) Log-transformed MS.
Figure 8:

Histograms of the intensities (A) and log-transformed intensities (B) along the retention times of the Korean Schizandra chinensis GC/MS data. (A) Original MS and (B) Log-transformed MS.

where θ=(π1,,πk,μ1,,μk,σ12,,σk2) denotes parameters of the mixture distribution F, πjs are mixture weights that add up to unity. For a given number of components k, the parameters of the mixture distribution F are estimated by the EM algorithm (Dempster, Laird, and Rubin 1977), which is implemented in mixtools as an R function normalmixEM(). In practice, the number of components k is usually unknown. Thus, we need to select the number of components k in the mixture distribution based on the observed data. In this paper, the number of components k is chosen by the likelihood ratio test (LRT) based on the parametric boostrap (McLachlan 1987), which is implemented in mixtools as an R function boot.comp().

From the results of LRT using boot.comp() with 200 bootstrap realizations, three or four components are chosen in 76.7% of samples (23 out of 30, n3 = 11 and n4 = 12), where nk denotes the number of data sets having k mixture components. Figure 9A and B show the estimated mixture densities of the three and four components, respectively. Table 6 reports the averages of the estimated mixture weights, means and standard deviations of components for k = 3,4. As reported in Table 6, the observed data sets do not have the same mixture components and can be grouped by the chosen number of components. The difference of the mixture components might be caused by the differences of the cultivation areas of the samples that were collected from several local regions in Korea. Note that this section aims at the estimation of the mixture density in order to generate a realistic synthetic data for the numerical study. We only focus on the estimated parameters of the mixture of three normal distributions (i.e., k = 3).

Figure 9: Histograms and plots of estimated mixture density functions for the 19th (k = 3) and 24th (k = 4) samples, where k denote the number of components chosen by the LRT. (A) k = 3 and (B) k = 4.
Figure 9:

Histograms and plots of estimated mixture density functions for the 19th (k = 3) and 24th (k = 4) samples, where k denote the number of components chosen by the LRT. (A) k = 3 and (B) k = 4.

Table 6:

The averages of estimated mixture weights (π^), means (μ^), and standard deviations (σ^) for k = 3,4, where k denotes the chosen number of components (Comp.) by the likelihood ratio test based on the parametric bootstrap. The numbers in parenthesis denote the standard errors of the estimates with n3 = 11 and n4 = 12, where nk denotes the number of data sets having k mixture components.

ParameterkComp. 1Comp. 2Comp. 3Comp. 4
π^30.52310.38420.0927
(0.0467)(0.0148)(0.0329)
40.28860.39340.28710.0308
(0.0412)(0.0224)(0.0231)(0.0036)
μ^37.97619.749914.2144
(0.0594)(0.1917)(0.5714)
47.80728.653110.583715.4188
(0.0856)(0.1607)(0.1738)(0.1559)
σ^30.43481.21611.2271
(0.0503)(0.1091)(0.2110)
40.24640.60691.37380.7857
(0.0398)(0.0593)(0.0453)(0.1074)

To specify the parameters of the synthetic data, recall the underlying model of the TRC:

Uk={Zk+ϵk,k=1,2,,m0ϵk,k=m0+1,m0+2,,p,

where Zk(μ(k),σz2); ϵk are independent random noise from (μϵ,σϵ2); and Zks and ϵ1ks are also independent of one another. Motivated by the mixture density estimation of the real data, we assume that the random variable Uk is from the finite mixture of three normal mixture distributions with mixture weights (π1,π2,π3), mean (μϵ,μϵ+μ1,μϵ+μ2) and variance (σϵ2,σϵ2+σz2,σϵ2+σz2). Based on the estimates in Table 6, we set

(π1,π2,π3)=(0.52,0.38,0.1),(μϵ,μ1,μ2)=(8,1.7,6.2),(σϵ,σz)=(0.45,1.15)

Then, we generate U1, … , Uk from the underlying model of the TRC with the above parameters, m0m01+m02=pπ2+pπ3, μ(k)=μ1 for k=1,2,,pπ2 and μ(k)=μ2 for k=pπ2+1,,m0. In addition, we randomly permute the generated X1, … , Xp to avoid systematic ordering of samples due to the generation scheme. Figure 10A and B show a histogram of the generated synthetic samples and a plot of the exponential value of the generated synthetic data that resembles the real data set depicted in Figure 7A and Figure 9A, respectively. We re-order the synthetic samples based on the order of the real data set shown in Figure 7 in order to reflect the site information of the retention time.

Figure 10: Histogram of $\{X_{1},\ldots,X_{p}\}${X1,…,Xp} and plot of $(\exp(X_{1}),\ldots,\exp(X_{p}))$(exp⁡(X1),…,exp⁡(Xp)), where $\{X_{1},\ldots,X_{p}\}${X1,…,Xp} are synthetic samples from the finite mixture of three normal distributions. (A) Histogram of X and (B) Plot of exp(X).
Figure 10:

Histogram of {X1,,Xp} and plot of (exp(X1),,exp(Xp)), where {X1,,Xp} are synthetic samples from the finite mixture of three normal distributions. (A) Histogram of X and (B) Plot of exp(X).

Appendix C

Approximated null distributions of correlation measures by permutation

In the numerical study, we calculate p-values of the independence test based on the correlation measures using three methods including the asymptotic null distribution, the approximated null distributions generated from the permutation method, and the empirical null distribution of the simulated samples under null. As shown in the numerical study, the sizes and powers of the rank-based measures (Kendall’s tau and TRC) are well-approximated by the permutation method but the sizes and powers of the Pearson’s correlation are poorly estimated by the permutation for relatively large noise level (σϵ = 1, 1.5).

In this section, we visually compare the approximated null distributions of each correlation measure for noise levels σϵ = 1, 1.5. For Pearson’s correlation and Kendall’s tau are transformed the following equations to compare their asymptotic distributions and the approximated null distribution with generated samples:

zρ=p32log(1+ρ^1ρ^)andzτ=τ^nt×9p(p1)/(4p+10).

In the simulation, we also consider the true null case (γ = 0) that two MS data are independent. Thus, the approximated null distribution of the null case using the permutation can be considered as a good approximation of the true null distribution. To investigate the quality of the approximation visually, we apply the kernel density estimation to the correlation measures calculated by the permutation. The estimated densities of the null distribution of Pearson’s correlation, Kendall’s tau, TRC with m0 and TRC with m^ are depicted in Figure 11Figure 14, respectively. From these visualizations, the null distribution of the Pearson’s correlation is poorly estimated by the permutation when the observed samples are from the non-null cases (γ > 0). On the other hand, the null distributions of the rank-based measures (Kendall’s tau and TRC) are well-estimated by the permutation. In addition, from Figure 12 and Figure 14, we can see that the mean of the null distribution of TRC tau with m^ is not centered at 0 and is close to 0.05, while the null distribution of TRC tau with m0 is centered at 0. This could be caused by the proposed selection rule that tends to choose relatively large m value to contain more true shared signals as described in the numerical study.

Figure 11: Estimated density of the transformed Pearson’s correlation Zρ from the approximated null distribution by the permutation. (A) Zρ (σϵ = 1) and (B) Zρ (σϵ = 1.5).
Figure 11:

Estimated density of the transformed Pearson’s correlation Zρ from the approximated null distribution by the permutation. (A) Zρ (σϵ = 1) and (B) Zρ (σϵ = 1.5).

Figure 12: Estimated density of the transformed Kendall’s tau Zτ from the approximated null distribution by the permutation. (A) Zτ (σϵ = 1) and (B) Zτ (σϵ = 1.5).
Figure 12:

Estimated density of the transformed Kendall’s tau Zτ from the approximated null distribution by the permutation. (A) Zτ (σϵ = 1) and (B) Zτ (σϵ = 1.5).

Figure 13: Estimated density of the TRC tau with true m0$\hat{\tau}(m_{0})$τ^(m0) from the approximated null distribution by the permutation. (A) $\hat{\tau}(m_{0})$τ^(m0) (σϵ = 1) and (B) $\hat{\tau}(m_{0})$τ^(m0) (σϵ = 1.5).
Figure 13:

Estimated density of the TRC tau with true m0τ^(m0) from the approximated null distribution by the permutation. (A) τ^(m0) (σϵ = 1) and (B) τ^(m0) (σϵ = 1.5).

Figure 14: Estimated density of the TRC tau with $\hat{m}$m^ by the proposed selection rule $\hat{\tau}(\hat{m})$τ^(m^) from the approximated null distribution by the permutation. (A) $\hat{\tau}(\hat{m})$τ^(m^) (σϵ = 1) and (B) $\hat{\tau}(\hat{m})$τ^(m^) (σϵ = 1.5).
Figure 14:

Estimated density of the TRC tau with m^ by the proposed selection rule τ^(m^) from the approximated null distribution by the permutation. (A) τ^(m^) (σϵ = 1) and (B) τ^(m^) (σϵ = 1.5).

References

Adam, B. L., Y. Qu, J. W. Davis, W. D. Ward, M. A. Clements, L. H. Cazares, O. J. Semmes, P. F. Schellhammer and Y. Yasui (2002): “Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men,” Cancer Res., 62, 3609–3614.Suche in Google Scholar

Andellini, M., V. Cannata, S. Gazzellini, B. Bernardic and A. Napolitano (2015): “Test-retest reliability of graph metrics of resting state MRI functional brain networks: A review,” J. Neurosci. Meth., 253, 183–192.10.1016/j.jneumeth.2015.05.020Suche in Google Scholar PubMed

Anderle, M., S. Roy, H. Lin, C. Becker and K. Joho (2004): “Quantifying reproducibility for differential proteomics: noise analysis for protein liquid chromatography-mass spectrometry of human serum,” Bioinformatics, 20, 3575–3582.10.1093/bioinformatics/bth446Suche in Google Scholar PubMed

Benaglia, T., D. Chauveau, D. Hunter and D. S. Young (2009): “Mixtools: an R package for analyzing finite mixture models,” J. Stat. Softw., 32, 1–29.10.18637/jss.v032.i06Suche in Google Scholar

Dempster, A. P., N. M. Laird and D. B. Rubin (1977): “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Stat. Soc. B, 39, 1–38.10.1111/j.2517-6161.1977.tb01600.xSuche in Google Scholar

Efron, B. and V. Petrosian (1992): “A simple test of independence for truncated data with applications to redshift surveys,” Astrophys. J., 399, 345–352.10.1086/171931Suche in Google Scholar

Hastings, C. A., S. M. Norton and S. Roy (2002): “New algorithms for processing and peak detection in liquid chromatography/mass spectrometry data,” Rapid. Commun. Mass. Sp., 16, 462–467.10.1002/rcm.600Suche in Google Scholar PubMed

Kim, Y., J. Lim and D.-H. Park (2015). “Testing independence of bivariate interval-censored data using modified Kendall’s τ statistic,” Biometrical J., 57, 1131–1145.10.1002/bimj.201300162Suche in Google Scholar PubMed

Lin, S. 2010, ‘Space oriented rank-based data integration’, Stat. Appl. Genet. Mol. Biol., vol. 9. Article 20.10.2202/1544-6115.1534Suche in Google Scholar PubMed

McLachlan, G. J. (1987): “On Bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture,” J. Roy. Stat. Soc. C-App., 36, 318–324.10.2307/2347790Suche in Google Scholar

Oakes, D. A. (1982): “Concordance test for independence in the present of censoring,” Biometrics, 38, 451–455.10.2307/2530458Suche in Google Scholar

Oakes, D. (2008): “On consistency of Kendall’s τ under censoring,” Biometrika, 95, 997–1001.10.1093/biomet/asn037Suche in Google Scholar

Rapin, J., A. Souloumiac, J. Bobin, A. Larue, C. Junot, M. Ouethrani and J.-L. Starck (2016): “Application of non-negative matrix factorization to LC/MS data,” Signal Process-Image., 123, 75–83.10.1016/j.sigpro.2015.12.014Suche in Google Scholar

Shou, H., A. Eloyan, S. Lee, V. Zipunnikov, M. B. Crainiceanu, M. B. Nebel, B. Caffo, M. A. Lindquist and C. M. Crainiceanu (2013): “Quantifying the reliability of image replication studies: the image intra-class correlation coefficient (I2C2),” Cogn. Affect. Behav. Ne., 13, 714–724.10.3758/s13415-013-0196-0Suche in Google Scholar

Somandepalli, K., C. Kelly, P. T. Reiss, X.-N. Zuo, R. C. Craddock, C.-G. Yan, E. Petkova, F. X. Castellanos, M. P. Milham and A. D. Martino (2015): “Short-term test-retest reliability of resting state fMRI metrics inchildren with and without attention-deficit/hyperactivity disorder,” Dev. Cogn. Neuros., 15, 83–93.10.1016/j.dcn.2015.08.003Suche in Google Scholar

Tabb, D. L., L. Vega-Montoto, P. A. Rudnick, A. M. Variyath, A. J. Ham, D. M. Bunk, L. E. Kilpatrick, D. D. Billheimer, R. K. Blackman, H. L. Cardasis, S. A. Carr, K. R. Clauser, J. D. Jaffe, K. A. Kowalski, T. A. Neubert, F. E. Regnier, B. Schilling, T. J. Tegeler, M. Wang, P. Wang, J. R. Whiteaker, L. J. Zimmerman, S. J. Fisher, B. W. Gibson, C. R. Kingsinger, M. Mesri, H. Rodriguez, S. E. Stein, P. Tempst, A. G. Paulovich, D. C. Liebler and C. Spiegelman (2010): “Repeatability and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry,” J. Proteome. Res., 9, 761–776.10.1021/pr9006365Suche in Google Scholar

van den Berg, R. A., H. C. J. Hoefsloot, J. A. Westerhuis, A. K. Smilde and M. J. van der Werf (2006): “Centering, scaling, and transformations: improving the biological information content of metabolomics data,” BMC Genomics, 7, 142.10.1186/1471-2164-7-142Suche in Google Scholar

Vannucci, M., N. Sha and P. J. Brown (2005): “NIR and mass spectra classification: Bayesian methods for wavelet-based feature selection,” Chemometr. Intell. Lab., 77, 139–148.10.1016/j.chemolab.2004.10.009Suche in Google Scholar

Weier, D. R. and A. P. Basu (1980): “An investigation of Kendall’s modified for censored data with applications,” J. Stat. Plan. Infer., 4, 381–390.10.1016/0378-3758(80)90023-3Suche in Google Scholar

Yu, D., S. J. Lee, W. J. Lee, S. C. Kim, J. Lim and S. W. Kwon (2015): “Classification of spectral data using fused lasso logistic regression,” Chemometr. Intell. Lab., 142, 70–77.10.1016/j.chemolab.2015.01.006Suche in Google Scholar

Zhurov, K. O., A. N. Kozhinov, L. Fornelli and Y. O. Tsybin (2014): “Distinguishing analyte from noise components in mass spectra of complex samples: where to cut the noise?” Anal. Chem., 86, 3308–3316.10.1021/ac403278tSuche in Google Scholar PubMed

Zhvansky, E. S., S. I. Pekov, A. A. Sorokin, V. A. Shurkhay, V. A. Eliferov, A. A. Potapov, E. N. Nikolaev and I. A. Popov (2019): “Metrics for evaluating the stability and reproducibility of mass spectra,” Sci. Rep., 9, 914.10.1038/s41598-018-37560-0Suche in Google Scholar PubMed PubMed Central

Zou, X.-N. and X.-X. Xing (2014): “Test-retest reliabilities of resting-state FMRI measurements in human brain functional connectomics: A systems neuroscience perspective,” Neurosci. Biobehav. R., 45, 100–118.10.1016/j.neubiorev.2014.05.009Suche in Google Scholar PubMed

Published Online: 2019-05-30

©2019 Walter de Gruyter GmbH, Berlin/Boston

Heruntergeladen am 25.10.2025 von https://www.degruyterbrill.com/document/doi/10.1515/sagmb-2018-0056/html
Button zum nach oben scrollen