Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA

Agnieszka E Jach; Juan M Marín

doi:10.2202/1544-6115.1562

Home Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA

Article

Licensed

Unlicensed Requires Authentication

Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA

Agnieszka E Jach and Juan M Marín

Published/Copyright: July 2, 2010

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistical Applications in Genetics and Molecular Biology Volume 9 Issue 1

MLA
APA
Harvard
Chicago
Vancouver

MLA
APA
Harvard
Chicago
Vancouver

Jach, Agnieszka E and Marín, Juan M. "Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA" Statistical Applications in Genetics and Molecular Biology, vol. 9, no. 1. https://doi.org/10.2202/1544-6115.1562

Jach, A. & Marín, J. (). Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA. Statistical Applications in Genetics and Molecular Biology, 9(1). https://doi.org/10.2202/1544-6115.1562

Jach, A. and Marín, J. () Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA. Statistical Applications in Genetics and Molecular Biology, Vol. 9 (Issue 1). https://doi.org/10.2202/1544-6115.1562

Jach, Agnieszka E and Marín, Juan M. "Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA" Statistical Applications in Genetics and Molecular Biology 9, no. 1 (). https://doi.org/10.2202/1544-6115.1562

Jach A, Marín J. Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA. Statistical Applications in Genetics and Molecular Biology. ;9(1). https://doi.org/10.2202/1544-6115.1562

Copy

Copied to clipboard

BibTeX EndNote RIS

We present a new methodology for discriminating genomic symbolic sequences, which combines wavelet analysis and a self-organizing map algorithm. Wavelets are used to extract variation across various scales in the oligonucleotide patterns of a sequence. The variation is quantified by the estimated wavelet variance, which yields a feature vector. Feature vectors obtained from many genomic sequences, possibly of different lengths, are then classified with a nonparametric self-organizing map scheme. When applied to nearly 200 entire mitochondrial DNA sequences, or their fragments, the method predicts species taxonomic group membership very well, and allows the results to be visualized. When only thousands of nucleotides are available, wavelet-based feature vectors of short oligonucleotide patterns are more efficient in discrimination than frequency-based feature vectors of long patterns. This new data analysis strategy could be extended to numeric genomic data. The routines needed to perform the computations are readily available in two packages of software R.

Keywords: oligonucleotide patterns; binary sequences; MODWT; estimated wavelet variance; supervised Kohonen’s map

Published Online: 2010-7-2

You are currently not able to access this content.

Articles in the same Issue

Article
Epistatic Interactions
Testing for Gene-Gene Interaction with AMMI Models
A Bayesian Hierarchical Model for Quantitative Real-Time PCR Data
Informative or Noninformative Calls for Gene Expression: A Latent Variable Approach
Detecting Genotyping Error Using Measures of Degree of Hardy-Weinberg Disequilibrium
Optimisation of HMM Topologies Enhances DNA and Protein Sequence Modelling
The Apportionment of Total Genetic Variation by Categorical Analysis of Variance
Dealing with Heterogeneity between Cohorts in Genomewide SNP Association Studies
An Empirical Bayesian Method for Estimating Biological Networks from Temporal Microarray Data
Parameter Estimation in Multiple-Hidden I.I.D. Models from Biological Multiple Alignment
Asymptotic Distribution of the "Orthogonal" Quantitative Transmission Disequilibrium Test in a Structured Population: Exact Formula
Comparing Spatial Maps of Human Population-Genetic Variation Using Procrustes Analysis
An Internal Calibration Method for Protein-Array Studies
Weighted-LASSO for Structured Network Inference from Time Course Data
Trilocus Disequilibrium Analysis of Multiallelic Markers in Outcrossing Populations
Sparse Partial Least Squares Classification for High Dimensional Data
Reconstructability Analysis as a Tool for Identifying Gene-Gene Interactions in Studies of Human Diseases
Sub-Modular Resolution Analysis by Network Mixture Models
Space Oriented Rank-Based Data Integration
The Generalized Odds Ratio as a Measure of Genetic Risk Effect in the Analysis and Meta-Analysis of Association Studies
Network Enrichment Analysis in Complex Experiments
Shrinkage Estimation of Effect Sizes as an Alternative to Hypothesis Testing Followed by Estimation in High-Dimensional Biology: Applications to Differential Gene Expression
Buckley-James Boosting for Survival Analysis with High-Dimensional Biomarker Data
A Random Coefficients Model for Regional Co-Expression Associated with DNA Copy Number
Locating Multiple Interacting Quantitative Trait Loci with the Zero-Inflated Generalized Poisson Regression
Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA
Confidently Estimating the Number of DNA Replication Origins
Generalizing Moving Averages for Tiling Arrays Using Combined P-Value Statistics
Lasso Logistic Regression, GSoft and the Cyclic Coordinate Descent Algorithm: Application to Gene Expression Data
Granger Causality Analysis of Human Cell-Cycle Gene Expression Profiles
Mapping Quantitative Trait Loci in a Non-Equilibrium Population
On the Optimal Design of Genetic Variant Discovery Studies
On Optimal Selection of Summary Statistics for Approximate Bayesian Computation
Assessment of LD Matrix Measures for the Analysis of Biological Pathway Association
Optimal Tests Shrinking Both Means and Variances Applicable to Microarray Data Analysis
The Detection of Blur in Affymetrix GeneChips
Regression-Based Multi-Trait QTL Mapping Using a Structural Equation Model
Spatial Clustering of Array CGH Features in Combination with Hierarchical Multiple Testing
Predicting Patient Survival from Longitudinal Gene Expression
Including Probe-Level Measurement Error in Robust Mixture Clustering of Replicated Microarray Gene Expression
Reader's Reaction
An Alternative Model of Type A Dependence in a Gene Set of Correlated Genes
Letter to the Editor
Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn

Search journal Search the content of this journal

https://doi.org/10.2202/1544-6115.1562

Keywords for this article

oligonucleotide patterns; binary sequences; MODWT; estimated wavelet variance; supervised Kohonen’s map

Articles in the same Issue

Article
Epistatic Interactions
Testing for Gene-Gene Interaction with AMMI Models
A Bayesian Hierarchical Model for Quantitative Real-Time PCR Data
Informative or Noninformative Calls for Gene Expression: A Latent Variable Approach
Detecting Genotyping Error Using Measures of Degree of Hardy-Weinberg Disequilibrium
Optimisation of HMM Topologies Enhances DNA and Protein Sequence Modelling
The Apportionment of Total Genetic Variation by Categorical Analysis of Variance
Dealing with Heterogeneity between Cohorts in Genomewide SNP Association Studies
An Empirical Bayesian Method for Estimating Biological Networks from Temporal Microarray Data
Parameter Estimation in Multiple-Hidden I.I.D. Models from Biological Multiple Alignment
Asymptotic Distribution of the "Orthogonal" Quantitative Transmission Disequilibrium Test in a Structured Population: Exact Formula
Comparing Spatial Maps of Human Population-Genetic Variation Using Procrustes Analysis
An Internal Calibration Method for Protein-Array Studies
Weighted-LASSO for Structured Network Inference from Time Course Data
Trilocus Disequilibrium Analysis of Multiallelic Markers in Outcrossing Populations
Sparse Partial Least Squares Classification for High Dimensional Data
Reconstructability Analysis as a Tool for Identifying Gene-Gene Interactions in Studies of Human Diseases
Sub-Modular Resolution Analysis by Network Mixture Models
Space Oriented Rank-Based Data Integration
The Generalized Odds Ratio as a Measure of Genetic Risk Effect in the Analysis and Meta-Analysis of Association Studies
Network Enrichment Analysis in Complex Experiments
Shrinkage Estimation of Effect Sizes as an Alternative to Hypothesis Testing Followed by Estimation in High-Dimensional Biology: Applications to Differential Gene Expression
Buckley-James Boosting for Survival Analysis with High-Dimensional Biomarker Data
A Random Coefficients Model for Regional Co-Expression Associated with DNA Copy Number
Locating Multiple Interacting Quantitative Trait Loci with the Zero-Inflated Generalized Poisson Regression
Classification of Genomic Sequences via Wavelet Variance and a Self-Organizing Map with an Application to Mitochondrial DNA
Confidently Estimating the Number of DNA Replication Origins
Generalizing Moving Averages for Tiling Arrays Using Combined P-Value Statistics
Lasso Logistic Regression, GSoft and the Cyclic Coordinate Descent Algorithm: Application to Gene Expression Data
Granger Causality Analysis of Human Cell-Cycle Gene Expression Profiles
Mapping Quantitative Trait Loci in a Non-Equilibrium Population
On the Optimal Design of Genetic Variant Discovery Studies
On Optimal Selection of Summary Statistics for Approximate Bayesian Computation
Assessment of LD Matrix Measures for the Analysis of Biological Pathway Association
Optimal Tests Shrinking Both Means and Variances Applicable to Microarray Data Analysis
The Detection of Blur in Affymetrix GeneChips
Regression-Based Multi-Trait QTL Mapping Using a Structural Equation Model
Spatial Clustering of Array CGH Features in Combination with Hierarchical Multiple Testing
Predicting Patient Survival from Longitudinal Gene Expression
Including Probe-Level Measurement Error in Robust Mixture Clustering of Replicated Microarray Gene Expression
Reader's Reaction
An Alternative Model of Type A Dependence in a Gene Set of Correlated Genes
Letter to the Editor
Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn