Home Multiple Testing Issues in Discriminating Compound-Related Peaks and Chromatograms from High Frequency Noise, Spikes and Solvent-Based Noise in LC - MS Data Sets
Article
Licensed
Unlicensed Requires Authentication

Multiple Testing Issues in Discriminating Compound-Related Peaks and Chromatograms from High Frequency Noise, Spikes and Solvent-Based Noise in LC - MS Data Sets

  • Stephen O Nyangoma , Antoine A. H. C. van Kampen , Theo H Reijmers , Natalia I Govorukhina , Ate G. J. van der Zee , Lucinda J Billingham , Rainer Bischoff and Ritsert C. Jansen
Published/Copyright: September 8, 2007

Liquid Chromatography - Mass Spectrometry (LC-MS) is a powerful method for sensitive detection and quantification of proteins and peptides in complex biological fluids like serum. LC-MS produces complex data sets, consisting of some hundreds of millions of data points per sample at a resolution of 0.1 amu in the m/z domain and 7000 data points in the time domain. However, the detection of the lower abundance proteins from this data is hampered by the presence of artefacts, such as high frequency noise and spikes. Moreover, not all of the tens of thousands of the chromatograms produced per sample are relevant for the pursuit of the biomarkers. Thus in analysing the LC-MS data, two critical pre-processing issues arise. Which of the thousands of the: 1. chromatograms per sample are relevant for the detection of the biomarkers?, and 2. signals per chromatogram are truly compound-related? Each of these issues involves assessing the significance (deviation from noise) of multiple observations and the issue of multiple comparisons arises. Current methods disregard the multiplicity and provide no concrete threshold for significance. However, with such procedures, the probability of one or more false-positives is high as the number of tests to be performed is large, and must be controlled. Realizing that the cut-offs for declaring a chromatogram (or a signal) to be compound-related can hugely influence which proteins are detected, it seems natural to define thresholds that are neither arbitrary nor subjective. We suggest the choice of thresholds guided by the critical aim of controlling the False Discovery Rate (FDR) in multiple hypotheses testing for significance over a large set of features produced per sample. This involves the use of the regression diagnostics to characterize the signals of a chromatogram (e.g. as outliers or influential) and to suggest suitable tests statistics for the multiple testing procedures (MTP) for discriminating noise and spikes from true signals. The role of the Generalized Linear Models (GLM) in this MTP is investigated. The method is applied to LC-MS datasets from trypsin-digested serum spiked with varying levels of horse heart cytochrome C (cytoc).

Published Online: 2007-9-8

©2011 Walter de Gruyter GmbH & Co. KG, Berlin/Boston

Articles in the same Issue

  1. Article
  2. Accounting for Dependence in Similarity Data from DNA Fingerprinting
  3. Normalization of Dye Bias in Microarray Data Using the Mixture of Splines Model
  4. A Generalized Sidak-Holm Procedure and Control of Generalized Error Rates under Independence
  5. Using Duplicate Genotyped Data in Genetic Analyses: Testing Association and Estimating Error Rates
  6. Likelihood-Based Inference for Multi-Color Optical Mapping
  7. Sparse Logistic Regression with Lp Penalty for Biomarker Identification
  8. Super Learning: An Application to the Prediction of HIV-1 Drug Resistance
  9. Supervised Detection of Conserved Motifs in DNA Sequences with Cosmo
  10. Accurate Ranking of Differentially Expressed Genes by a Distribution-Free Shrinkage Approach
  11. Statistical Inference for Quantitative Polymerase Chain Reaction Using a Hidden Markov Model: A Bayesian Approach
  12. A Bayesian Model of AFLP Marker Evolution and Phylogenetic Inference
  13. Sequential Quantitative Trait Locus Mapping in Experimental Crosses
  14. Case-Control Inference of Interaction between Genetic and Nongenetic Risk Factors under Assumptions on Their Distribution
  15. Inference on the Limiting False Discovery Rate and the P-value Threshold Parameter Assuming Weak Dependence between Gene Expression Levels within Subject
  16. Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge
  17. Cox Survival Analysis of Microarray Gene Expression Data Using Correlation Principal Component Regression
  18. A Method for Meta-Analysis of Case-Control Genetic Association Studies Using Logistic Regression
  19. Approximating the Variance of the Conditional Probability of the State of a Hidden Markov Model
  20. Using Linear Mixed Models for Normalization of cDNA Microarrays
  21. Experimental Design for Two-Color Microarrays Applied in a Pre-Existing Split-Plot Experiment
  22. The Cyclohedron Test for Finding Periodic Genes in Time Course Expression Studies
  23. H-Tuple Approach to Evaluate Statistical Significance of Biological Sequence Comparison with Gaps
  24. Multiple Testing Issues in Discriminating Compound-Related Peaks and Chromatograms from High Frequency Noise, Spikes and Solvent-Based Noise in LC - MS Data Sets
  25. A Bayesian Approach to Estimation and Testing in Time-course Microarray Experiments
  26. Super Learner
  27. Testing for Trends in Dose-Response Microarray Experiments: A Comparison of Several Testing Procedures, Multiplicity and Resampling-Based Inference
  28. On the Operational Characteristics of the Benjamini and Hochberg False Discovery Rate Procedure
  29. A Comparison of Methods to Control Type I Errors in Microarray Studies
  30. Selection of Biologically Relevant Genes with a Wrapper Stochastic Algorithm
  31. T-BAPS: A Bayesian Statistical Tool for Comparison of Microbial Communities Using Terminal-restriction Fragment Length Polymorphism (T-RFLP) Data
  32. Population Structure and Covariate Analysis Based on Pairwise Microsatellite Allele Matching Frequencies
  33. Estimating the Arm-Wise False Discovery Rate in Array Comparative Genomic Hybridization Experiments
  34. An Expectation Maximization Approach to Estimate Malaria Haplotype Frequencies in Multiply Infected Children
  35. Estimation of Expression Levels in Spotted Microarrays with Saturated Pixels
  36. Improving Divergence Time Estimation in Phylogenetics: More Taxa vs. Longer Sequences
  37. Fully Bayesian Mixture Model for Differential Gene Expression: Simulations and Model Checks
  38. Multiple Testing for SNP-SNP Interactions
Downloaded on 9.9.2025 from https://www.degruyterbrill.com/document/doi/10.2202/1544-6115.1295/html
Scroll to top button