Shrinkage Estimation of Effect Sizes as an Alternative to Hypothesis Testing Followed by Estimation in High-Dimensional Biology: Applications to Differential Gene Expression
-
Zahra Montazeri
Research on analyzing microarray data has focused on the problem of identifying differentially expressed genes to the neglect of the problem of how to integrate evidence that a gene is differentially expressed with information on the extent of its differential expression. Consequently, researchers currently prioritize genes for further study either on the basis of volcano plots or, more commonly, according to simple estimates of the fold change after filtering the genes with an arbitrary statistical significance threshold. While the subjective and informal nature of the former practice precludes quantification of its reliability, the latter practice is equivalent to using a hard-threshold estimator of the expression ratio that is not known to perform well in terms of mean-squared error, the sum of estimator variance and squared estimator bias. On the basis of two distinct simulation studies and data from different microarray studies, we systematically compared the performance of several estimators representing both current practice and shrinkage. We find that the threshold-based estimators usually perform worse than the maximum-likelihood estimator (MLE) and they often perform far worse as quantified by estimated mean-squared risk. By contrast, the shrinkage estimators tend to perform as well as or better than the MLE and never much worse than the MLE, as expected from what is known about shrinkage. However, a Bayesian measure of performance based on the prior information that few genes are differentially expressed indicates that hard-threshold estimators perform about as well as the local false discovery rate (FDR), the best of the shrinkage estimators studied. Based on the ability of the latter to leverage information across genes, we conclude that the use of the local-FDR estimator of the fold change instead of informal or threshold-based combinations of statistical tests and non-shrinkage estimators can be expected to substantially improve the reliability of gene prioritization at very little risk of doing so less reliably. Since the proposed replacement of post-selection estimates with shrunken estimates applies as well to other types of high-dimensional data, it could also improve the analysis of SNP data from genome-wide association studies.
©2011 Walter de Gruyter GmbH & Co. KG, Berlin/Boston
Articles in the same Issue
- Article
- Epistatic Interactions
- Testing for Gene-Gene Interaction with AMMI Models
- A Bayesian Hierarchical Model for Quantitative Real-Time PCR Data
- Informative or Noninformative Calls for Gene Expression: A Latent Variable Approach
- Detecting Genotyping Error Using Measures of Degree of Hardy-Weinberg Disequilibrium
- Optimisation of HMM Topologies Enhances DNA and Protein Sequence Modelling
- The Apportionment of Total Genetic Variation by Categorical Analysis of Variance
- Dealing with Heterogeneity between Cohorts in Genomewide SNP Association Studies
- An Empirical Bayesian Method for Estimating Biological Networks from Temporal Microarray Data
- Parameter Estimation in Multiple-Hidden I.I.D. Models from Biological Multiple Alignment
- Asymptotic Distribution of the "Orthogonal" Quantitative Transmission Disequilibrium Test in a Structured Population: Exact Formula
- Comparing Spatial Maps of Human Population-Genetic Variation Using Procrustes Analysis
- An Internal Calibration Method for Protein-Array Studies
- Weighted-LASSO for Structured Network Inference from Time Course Data
- Trilocus Disequilibrium Analysis of Multiallelic Markers in Outcrossing Populations
- Sparse Partial Least Squares Classification for High Dimensional Data
- Reconstructability Analysis as a Tool for Identifying Gene-Gene Interactions in Studies of Human Diseases
- Sub-Modular Resolution Analysis by Network Mixture Models
- Space Oriented Rank-Based Data Integration
- The Generalized Odds Ratio as a Measure of Genetic Risk Effect in the Analysis and Meta-Analysis of Association Studies
- Network Enrichment Analysis in Complex Experiments
- Shrinkage Estimation of Effect Sizes as an Alternative to Hypothesis Testing Followed by Estimation in High-Dimensional Biology: Applications to Differential Gene Expression
- Buckley-James Boosting for Survival Analysis with High-Dimensional Biomarker Data
- A Random Coefficients Model for Regional Co-Expression Associated with DNA Copy Number
- Locating Multiple Interacting Quantitative Trait Loci with the Zero-Inflated Generalized Poisson Regression
- Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA
- Confidently Estimating the Number of DNA Replication Origins
- Generalizing Moving Averages for Tiling Arrays Using Combined P-Value Statistics
- Lasso Logistic Regression, GSoft and the Cyclic Coordinate Descent Algorithm: Application to Gene Expression Data
- Granger Causality Analysis of Human Cell-Cycle Gene Expression Profiles
- Mapping Quantitative Trait Loci in a Non-Equilibrium Population
- On the Optimal Design of Genetic Variant Discovery Studies
- On Optimal Selection of Summary Statistics for Approximate Bayesian Computation
- Assessment of LD Matrix Measures for the Analysis of Biological Pathway Association
- Optimal Tests Shrinking Both Means and Variances Applicable to Microarray Data Analysis
- The Detection of Blur in Affymetrix GeneChips
- Regression-Based Multi-Trait QTL Mapping Using a Structural Equation Model
- Spatial Clustering of Array CGH Features in Combination with Hierarchical Multiple Testing
- Predicting Patient Survival from Longitudinal Gene Expression
- Including Probe-Level Measurement Error in Robust Mixture Clustering of Replicated Microarray Gene Expression
- Reader's Reaction
- An Alternative Model of Type A Dependence in a Gene Set of Correlated Genes
- Letter to the Editor
- Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn
Articles in the same Issue
- Article
- Epistatic Interactions
- Testing for Gene-Gene Interaction with AMMI Models
- A Bayesian Hierarchical Model for Quantitative Real-Time PCR Data
- Informative or Noninformative Calls for Gene Expression: A Latent Variable Approach
- Detecting Genotyping Error Using Measures of Degree of Hardy-Weinberg Disequilibrium
- Optimisation of HMM Topologies Enhances DNA and Protein Sequence Modelling
- The Apportionment of Total Genetic Variation by Categorical Analysis of Variance
- Dealing with Heterogeneity between Cohorts in Genomewide SNP Association Studies
- An Empirical Bayesian Method for Estimating Biological Networks from Temporal Microarray Data
- Parameter Estimation in Multiple-Hidden I.I.D. Models from Biological Multiple Alignment
- Asymptotic Distribution of the "Orthogonal" Quantitative Transmission Disequilibrium Test in a Structured Population: Exact Formula
- Comparing Spatial Maps of Human Population-Genetic Variation Using Procrustes Analysis
- An Internal Calibration Method for Protein-Array Studies
- Weighted-LASSO for Structured Network Inference from Time Course Data
- Trilocus Disequilibrium Analysis of Multiallelic Markers in Outcrossing Populations
- Sparse Partial Least Squares Classification for High Dimensional Data
- Reconstructability Analysis as a Tool for Identifying Gene-Gene Interactions in Studies of Human Diseases
- Sub-Modular Resolution Analysis by Network Mixture Models
- Space Oriented Rank-Based Data Integration
- The Generalized Odds Ratio as a Measure of Genetic Risk Effect in the Analysis and Meta-Analysis of Association Studies
- Network Enrichment Analysis in Complex Experiments
- Shrinkage Estimation of Effect Sizes as an Alternative to Hypothesis Testing Followed by Estimation in High-Dimensional Biology: Applications to Differential Gene Expression
- Buckley-James Boosting for Survival Analysis with High-Dimensional Biomarker Data
- A Random Coefficients Model for Regional Co-Expression Associated with DNA Copy Number
- Locating Multiple Interacting Quantitative Trait Loci with the Zero-Inflated Generalized Poisson Regression
- Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA
- Confidently Estimating the Number of DNA Replication Origins
- Generalizing Moving Averages for Tiling Arrays Using Combined P-Value Statistics
- Lasso Logistic Regression, GSoft and the Cyclic Coordinate Descent Algorithm: Application to Gene Expression Data
- Granger Causality Analysis of Human Cell-Cycle Gene Expression Profiles
- Mapping Quantitative Trait Loci in a Non-Equilibrium Population
- On the Optimal Design of Genetic Variant Discovery Studies
- On Optimal Selection of Summary Statistics for Approximate Bayesian Computation
- Assessment of LD Matrix Measures for the Analysis of Biological Pathway Association
- Optimal Tests Shrinking Both Means and Variances Applicable to Microarray Data Analysis
- The Detection of Blur in Affymetrix GeneChips
- Regression-Based Multi-Trait QTL Mapping Using a Structural Equation Model
- Spatial Clustering of Array CGH Features in Combination with Hierarchical Multiple Testing
- Predicting Patient Survival from Longitudinal Gene Expression
- Including Probe-Level Measurement Error in Robust Mixture Clustering of Replicated Microarray Gene Expression
- Reader's Reaction
- An Alternative Model of Type A Dependence in a Gene Set of Correlated Genes
- Letter to the Editor
- Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn