The practical effect of batch on genomic prediction
-
Hilary S. Parker
Measurements from microarrays and other high-throughput technologies are susceptible to non-biological artifacts like batch effects. It is known that batch effects can alter or obscure the set of significant results and biological conclusions in high-throughput studies. Here we examine the impact of batch effects on predictors built from genomic technologies. To investigate batch effects, we collected publicly available gene expression measurements with known outcomes, and estimated batches using date. Using these data we show (1) the impact of batch effects on prediction depends on the correlation between outcome and batch in the training data, and (2) removing expression measurements most affected by batch before building predictors may improve the accuracy of those predictors. These results suggest that (1) training sets should be designed to minimize correlation between batches and outcome, and (2) methods for identifying batch-affected probes should be developed to improve prediction results for studies with high correlation between batches and outcome.
©2012 Walter de Gruyter GmbH & Co. KG, Berlin/Boston
Articles in the same Issue
- Article
- Exploring Multicollinearity Using a Random Matrix Theory Approach
- The Beta-Binomial SGoF method for multiple dependent tests
- Detecting Sample Misidentifications in Genetic Association Studies
- Borrowing Information Across Genes and Experiments for Improved Error Variance Estimation in Microarray Data Analysis
- Hierarchical Bayes Model for Predicting Effectiveness of HIV Combination Therapies
- The practical effect of batch on genomic prediction
- Normalization, bias correction, and peak calling for ChIP-seq
- Combining Multiple Laser Scans of Spotted Microarrays by Means of a Two-Way ANOVA Model
- Empirical Bayes Interval Estimates that are Conditionally Equal to Unadjusted Confidence Intervals or to Default Prior Credibility Intervals
- Detection of Differentially Expressed Gene Sets in a Partially Paired Microarray Data Set
- Non-Iterative, Regression-Based Estimation of Haplotype Associations with Censored Survival Outcomes
- Graph Selection with GGMselect
- Sample Size Calculations for Designing Clinical Proteomic Profiling Studies Using Mass Spectrometry
- A New Approach for the Joint Analysis of Multiple Chip-Seq Libraries with Application to Histone Modification
- Software Communication
- GENOVA: Gene Overlap Analysis of GWAS Results
Articles in the same Issue
- Article
- Exploring Multicollinearity Using a Random Matrix Theory Approach
- The Beta-Binomial SGoF method for multiple dependent tests
- Detecting Sample Misidentifications in Genetic Association Studies
- Borrowing Information Across Genes and Experiments for Improved Error Variance Estimation in Microarray Data Analysis
- Hierarchical Bayes Model for Predicting Effectiveness of HIV Combination Therapies
- The practical effect of batch on genomic prediction
- Normalization, bias correction, and peak calling for ChIP-seq
- Combining Multiple Laser Scans of Spotted Microarrays by Means of a Two-Way ANOVA Model
- Empirical Bayes Interval Estimates that are Conditionally Equal to Unadjusted Confidence Intervals or to Default Prior Credibility Intervals
- Detection of Differentially Expressed Gene Sets in a Partially Paired Microarray Data Set
- Non-Iterative, Regression-Based Estimation of Haplotype Associations with Censored Survival Outcomes
- Graph Selection with GGMselect
- Sample Size Calculations for Designing Clinical Proteomic Profiling Studies Using Mass Spectrometry
- A New Approach for the Joint Analysis of Multiple Chip-Seq Libraries with Application to Histone Modification
- Software Communication
- GENOVA: Gene Overlap Analysis of GWAS Results