Abstract
Mass spectrometry is an important high-throughput technique for profiling small molecular compounds in biological samples and is widely used to identify potential diagnostic and prognostic compounds associated with disease. Commonly, this data generated by mass spectrometry has many missing values resulting when a compound is absent from a sample or is present but at a concentration below the detection limit. Several strategies are available for statistically analyzing data with missing values. The accelerated failure time (AFT) model assumes all missing values result from censoring below a detection limit. Under a mixture model, missing values can result from a combination of censoring and the absence of a compound. We compare power and estimation of a mixture model to an AFT model. Based on simulated data, we found the AFT model to have greater power to detect differences in means and point mass proportions between groups. However, the AFT model yielded biased estimates with the bias increasing as the proportion of observations in the point mass increased while estimates were unbiased with the mixture model except if all missing observations came from censoring. These findings suggest using the AFT model for hypothesis testing and mixture model for estimation. We demonstrated this approach through application to glycomics data of serum samples from women with ovarian cancer and matched controls.
We wish to thank Drs. Renee Ruhaak and Carlito Lebrilla for their dedicated work in generating the ovarian cancer glycomics data. The project described was supported by the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), through grant #UL1 TR000002. This work was supported by NIH/NIA grant P01AG025532 and the Ovarian Cancer Research Fund.
References
Burow, M., B. A. Halkier, D. J. Kliebenstein (2010): “Regulatory networks of glucosinolates shape Arabidopsis thaliana fitness,” Curr. Opin. Plant. Biol., 13, 348–353.Suche in Google Scholar
Chai, H. S. and K. R. Bailey (2008): “Use of log-skew-normal distribution in analysis of continuous data with a discrete component at zero,” Stat. Med., 27, 3643–3655.Suche in Google Scholar
Duan, N., W. G. Manning, Jr., C. N. Morris, and J. P. Newhouse (1983): “A comparison of alternative models for the demand for medical care,” J. Bus. Econ. Stat., 1, 115–126.Suche in Google Scholar
Enot, D. P., B. Haas, and K. M. Weinberger (2011): “Bioinformatices for mass-spectrometry-based metabolomics,” Method Mol. Biol., 719, 351–375.Suche in Google Scholar
Hastie, T., R. Tibshirani, B. Narasimhan, and C. Gilbert (2012): “Impute: imputation for microarray data,” R package version 1.32.9.Suche in Google Scholar
Hrydziuszko, O. and M. R. Viant (2012): “Missing values in mass spectrometry based metabolomics, an undervalued step in the data processing pipeline,” Metabolomics 8, 161–174.10.1007/s11306-011-0366-4Suche in Google Scholar
Karpievitch, Y., J. Stanley, T. Taverner, J. Huang, J. N. Adkins, C. Ansong, F. Heffron, T. O. Metz, W. -J. Quan, H. Yoon, R. D. Smith, and A. R. Dabney (2009): “A statistical framework for protein quantitation in bottom-up MS-based proteomics,” Bioinformatics, 25, 2028–2034.10.1093/bioinformatics/btp362Suche in Google Scholar
Karpievitch, Y., A. R. Dabney, and R. D. Smith (2012): “Normalization and missing value imputation for label-free LC-MS analysis,” BMC Bioinformatics, 13(Suppl 16), 55.10.1186/1471-2105-13-S16-S5Suche in Google Scholar
Klein, J. P. and M. L. Moeschberger (2003): Survival Analysis: Techniques for Censored and Truncated Data, 2nd edition. Springer-Verlag, New York.10.1007/b97377Suche in Google Scholar
Lachenbruch, P. A. (1976): “Analysis of data with clumping at zero,” Biometrische Zeitschrift 18, 351–356.Suche in Google Scholar
Lachenbruch, P. A. (1992): Utility of logistic regression analysis in epidemiologic studies of the elderly. In: Wallace, R. B., Woolson, R. F. (Eds.), Epidemiologic Methods in the Study of Aging, Oxford University Press, New York, pp. 371–381.Suche in Google Scholar
Lachenbruch, P. A. (2001): “Comparisons of two-part models with competitors,” Stat. Med. 20, 1215–1234.Suche in Google Scholar
Lee, M. L. (2004): Analysis of microarray gene expression data, Kluwer Academic Publishers, New York.Suche in Google Scholar
Little, R. J. A. and D. B. Rubin (2002): Statistical Analysis with Missing Data. 2nd Edition. John Wiley & Sons, Hoboken.10.1002/9781119013563Suche in Google Scholar
Michalski, A., J. Cox, and M. Mann (2011): “More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC-MS/MS,” J. Proteome Res., 10, 1785–1793.10.1021/pr101060vSuche in Google Scholar
Moulton, L. H. and N. A. Halsey (1995): “A mixture model with detection limits for regression analyses of antibody response to vaccine,” Biometrics, 51, 1570–1578.10.2307/2533289Suche in Google Scholar
Moulton, L. H. and N. A. Halsey (1996): “A mixed gamma model for regression analyses of quantitative assay data,” Vaccines, 14, 1154–1158.10.1016/0264-410X(96)00017-5Suche in Google Scholar
R Core Team. (2012): R, A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http,//www.R-project.org/.Suche in Google Scholar
Self, S. G. and K. -Y. Liang (1987): “Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions,” J. Am. Stat. Assoc., 82, 605–610.Suche in Google Scholar
Taylor, S. and K. Pollard (2009): “Hypothesis tests for point-mass mixture data with application to ‘omics data with many zero values,” Stat. Appl. Genet. Mo. B., 8(1), 1–43.Suche in Google Scholar
Tekwe, C. D., R. J. Carroll, and A. R. Dabney (2012): “Application of survival analysis methodology to the quantitative analysis of LC-MS proteomic data,” Bioinformatics, 28, 1998–2003.10.1093/bioinformatics/bts306Suche in Google Scholar PubMed PubMed Central
Therneau, T. and P. M. Grambsch (2000): Modeling Survival Data: Extending the Cox Model. Springer, N.Y. ISBN 0-387-98784-3.10.1007/978-1-4757-3294-8Suche in Google Scholar
Wang, X, G. A. Anderson, R. D. Smith, and A. R. Dabney (2012): “A hybrid approach to protein differential expression in mass spectrometry-based proteomics,” Bioinformatics, 28, 1586–1591.10.1093/bioinformatics/bts193Suche in Google Scholar PubMed PubMed Central
Want, E. and P. Masson (2011): “Processing and analysis of GC/LC-MS-Based metabolomic data,” Method Mol. Biol., 708, 277–298.Suche in Google Scholar
Wood, J., I. R. White, and P. Cutler (2004): “A likelihood-based approach to defining statistical significance in proteomic analysis where missing data cannot be disregarded,” Signal Process., 84, 1777–1788.Suche in Google Scholar
Wu, S. A., M. A. Black, R. A. North, K. R. Atkinson, and A. G. Rodrigo (2009): “A statistical model to identify differentially expressed proteins in 2D PAGE Gels,” PLOS Comp. Biol., 5(9), e1000509.Suche in Google Scholar
©2013 by Walter de Gruyter Berlin Boston
Artikel in diesem Heft
- Masthead
- Masthead
- Research Articles
- A new variance stabilizing transformation for gene expression data analysis
- Kernel approximate Bayesian computation in population genetic inferences
- Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology
- Accounting for undetected compounds in statistical analyses of mass spectrometry ‘omic studies
- Modeling, simulation and analysis of methylation profiles from reduced representation bisulfite sequencing experiments
- Estimation of weighted log partial area under the ROC curve and its application to MicroRNA expression data
- Random forests on distance matrices for imaging genetics studies
Artikel in diesem Heft
- Masthead
- Masthead
- Research Articles
- A new variance stabilizing transformation for gene expression data analysis
- Kernel approximate Bayesian computation in population genetic inferences
- Permutation tests for analyzing cospeciation in multiple phylogenies: applications in tri-trophic ecology
- Accounting for undetected compounds in statistical analyses of mass spectrometry ‘omic studies
- Modeling, simulation and analysis of methylation profiles from reduced representation bisulfite sequencing experiments
- Estimation of weighted log partial area under the ROC curve and its application to MicroRNA expression data
- Random forests on distance matrices for imaging genetics studies