Super Learner

Mark J. van der Laan; Eric C Polley; Alan E. Hubbard

doi:10.2202/1544-6115.1309

Home Super Learner

Article

Licensed

Unlicensed Requires Authentication

Super Learner

Mark J. van der Laan , Eric C Polley and Alan E. Hubbard

Published/Copyright: September 16, 2007

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistical Applications in Genetics and Molecular Biology Volume 6 Issue 1

MLA
APA
Harvard
Chicago
Vancouver

MLA
APA
Harvard
Chicago
Vancouver

van der Laan, Mark J., Polley, Eric C and Hubbard, Alan E.. "Super Learner" Statistical Applications in Genetics and Molecular Biology, vol. 6, no. 1, 2007. https://doi.org/10.2202/1544-6115.1309

van der Laan, M., Polley, E. & Hubbard, A. (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology, 6(1). https://doi.org/10.2202/1544-6115.1309

van der Laan, M., Polley, E. and Hubbard, A. (2007) Super Learner. Statistical Applications in Genetics and Molecular Biology, Vol. 6 (Issue 1). https://doi.org/10.2202/1544-6115.1309

van der Laan, Mark J., Polley, Eric C and Hubbard, Alan E.. "Super Learner" Statistical Applications in Genetics and Molecular Biology 6, no. 1 (2007). https://doi.org/10.2202/1544-6115.1309

van der Laan M, Polley E, Hubbard A. Super Learner. Statistical Applications in Genetics and Molecular Biology. 2007;6(1). https://doi.org/10.2202/1544-6115.1309

Copy

Copied to clipboard

BibTeX EndNote RIS

When trying to learn a model for the prediction of an outcome given a set of covariates, a statistician has many estimation procedures in their toolbox. A few examples of these candidate learners are: least squares, least angle regression, random forests, and spline regression. Previous articles (van der Laan and Dudoit (2003); van der Laan et al. (2006); Sinisi et al. (2007)) theoretically validated the use of cross validation to select an optimal learner among many candidate learners. Motivated by this use of cross validation, we propose a new prediction method for creating a weighted combination of many candidate learners to build the super learner. This article proposes a fast algorithm for constructing a super learner in prediction which uses V-fold cross-validation to select weights to combine an initial set of candidate learners. In addition, this paper contains a practical demonstration of the adaptivity of this so called super learner to various true data generating distributions. This approach for construction of a super learner generalizes to any parameter which can be defined as a minimizer of a loss function.

Keywords: cross-validation; loss-based estimation; machine learning; prediction

Published Online: 2007-9-16

You are currently not able to access this content.

Articles in the same Issue

Article
Accounting for Dependence in Similarity Data from DNA Fingerprinting
Normalization of Dye Bias in Microarray Data Using the Mixture of Splines Model
A Generalized Sidak-Holm Procedure and Control of Generalized Error Rates under Independence
Using Duplicate Genotyped Data in Genetic Analyses: Testing Association and Estimating Error Rates
Likelihood-Based Inference for Multi-Color Optical Mapping
Sparse Logistic Regression with Lp Penalty for Biomarker Identification
Super Learning: An Application to the Prediction of HIV-1 Drug Resistance
Supervised Detection of Conserved Motifs in DNA Sequences with Cosmo
Accurate Ranking of Differentially Expressed Genes by a Distribution-Free Shrinkage Approach
Statistical Inference for Quantitative Polymerase Chain Reaction Using a Hidden Markov Model: A Bayesian Approach
A Bayesian Model of AFLP Marker Evolution and Phylogenetic Inference
Sequential Quantitative Trait Locus Mapping in Experimental Crosses
Case-Control Inference of Interaction between Genetic and Nongenetic Risk Factors under Assumptions on Their Distribution
Inference on the Limiting False Discovery Rate and the P-value Threshold Parameter Assuming Weak Dependence between Gene Expression Levels within Subject
Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge
Cox Survival Analysis of Microarray Gene Expression Data Using Correlation Principal Component Regression
A Method for Meta-Analysis of Case-Control Genetic Association Studies Using Logistic Regression
Approximating the Variance of the Conditional Probability of the State of a Hidden Markov Model
Using Linear Mixed Models for Normalization of cDNA Microarrays
Experimental Design for Two-Color Microarrays Applied in a Pre-Existing Split-Plot Experiment
The Cyclohedron Test for Finding Periodic Genes in Time Course Expression Studies
H-Tuple Approach to Evaluate Statistical Significance of Biological Sequence Comparison with Gaps
Multiple Testing Issues in Discriminating Compound-Related Peaks and Chromatograms from High Frequency Noise, Spikes and Solvent-Based Noise in LC - MS Data Sets
A Bayesian Approach to Estimation and Testing in Time-course Microarray Experiments
Super Learner
Testing for Trends in Dose-Response Microarray Experiments: A Comparison of Several Testing Procedures, Multiplicity and Resampling-Based Inference
On the Operational Characteristics of the Benjamini and Hochberg False Discovery Rate Procedure
A Comparison of Methods to Control Type I Errors in Microarray Studies
Selection of Biologically Relevant Genes with a Wrapper Stochastic Algorithm
T-BAPS: A Bayesian Statistical Tool for Comparison of Microbial Communities Using Terminal-restriction Fragment Length Polymorphism (T-RFLP) Data
Population Structure and Covariate Analysis Based on Pairwise Microsatellite Allele Matching Frequencies
Estimating the Arm-Wise False Discovery Rate in Array Comparative Genomic Hybridization Experiments
An Expectation Maximization Approach to Estimate Malaria Haplotype Frequencies in Multiply Infected Children
Estimation of Expression Levels in Spotted Microarrays with Saturated Pixels
Improving Divergence Time Estimation in Phylogenetics: More Taxa vs. Longer Sequences
Fully Bayesian Mixture Model for Differential Gene Expression: Simulations and Model Checks
Multiple Testing for SNP-SNP Interactions

Search journal Search the content of this journal

https://doi.org/10.2202/1544-6115.1309

Keywords for this article

cross-validation; loss-based estimation; machine learning; prediction

Articles in the same Issue

Article
Accounting for Dependence in Similarity Data from DNA Fingerprinting
Normalization of Dye Bias in Microarray Data Using the Mixture of Splines Model
A Generalized Sidak-Holm Procedure and Control of Generalized Error Rates under Independence
Using Duplicate Genotyped Data in Genetic Analyses: Testing Association and Estimating Error Rates
Likelihood-Based Inference for Multi-Color Optical Mapping
Sparse Logistic Regression with Lp Penalty for Biomarker Identification
Super Learning: An Application to the Prediction of HIV-1 Drug Resistance
Supervised Detection of Conserved Motifs in DNA Sequences with Cosmo
Accurate Ranking of Differentially Expressed Genes by a Distribution-Free Shrinkage Approach
Statistical Inference for Quantitative Polymerase Chain Reaction Using a Hidden Markov Model: A Bayesian Approach
A Bayesian Model of AFLP Marker Evolution and Phylogenetic Inference
Sequential Quantitative Trait Locus Mapping in Experimental Crosses
Case-Control Inference of Interaction between Genetic and Nongenetic Risk Factors under Assumptions on Their Distribution
Inference on the Limiting False Discovery Rate and the P-value Threshold Parameter Assuming Weak Dependence between Gene Expression Levels within Subject
Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge
Cox Survival Analysis of Microarray Gene Expression Data Using Correlation Principal Component Regression
A Method for Meta-Analysis of Case-Control Genetic Association Studies Using Logistic Regression
Approximating the Variance of the Conditional Probability of the State of a Hidden Markov Model
Using Linear Mixed Models for Normalization of cDNA Microarrays
Experimental Design for Two-Color Microarrays Applied in a Pre-Existing Split-Plot Experiment
The Cyclohedron Test for Finding Periodic Genes in Time Course Expression Studies
H-Tuple Approach to Evaluate Statistical Significance of Biological Sequence Comparison with Gaps
Multiple Testing Issues in Discriminating Compound-Related Peaks and Chromatograms from High Frequency Noise, Spikes and Solvent-Based Noise in LC - MS Data Sets
A Bayesian Approach to Estimation and Testing in Time-course Microarray Experiments
Super Learner
Testing for Trends in Dose-Response Microarray Experiments: A Comparison of Several Testing Procedures, Multiplicity and Resampling-Based Inference
On the Operational Characteristics of the Benjamini and Hochberg False Discovery Rate Procedure
A Comparison of Methods to Control Type I Errors in Microarray Studies
Selection of Biologically Relevant Genes with a Wrapper Stochastic Algorithm
T-BAPS: A Bayesian Statistical Tool for Comparison of Microbial Communities Using Terminal-restriction Fragment Length Polymorphism (T-RFLP) Data
Population Structure and Covariate Analysis Based on Pairwise Microsatellite Allele Matching Frequencies
Estimating the Arm-Wise False Discovery Rate in Array Comparative Genomic Hybridization Experiments
An Expectation Maximization Approach to Estimate Malaria Haplotype Frequencies in Multiply Infected Children
Estimation of Expression Levels in Spotted Microarrays with Saturated Pixels
Improving Divergence Time Estimation in Phylogenetics: More Taxa vs. Longer Sequences
Fully Bayesian Mixture Model for Differential Gene Expression: Simulations and Model Checks
Multiple Testing for SNP-SNP Interactions