Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length
-
Conrad J. Burden
The D2 statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as the variability in D2 may be dominated by the terms that reflect the noise in each of the single sequences only. We examine the extent of the problem and the effectiveness of overcoming it by using two mean-centred variants of this statistic, D2* and D2c. We conclude that all three statistics are potentially useful measures of sequence similarity, for which reasonably accurate p-values can be estimated under a null hypothesis of sequences composed of identically and independently distributed letters. We show that D2 and D2c, and to a somewhat lesser extent D2*, perform well in tests to classify moderate length query sequences as putative cis-regulatory modules.
©2012 Walter de Gruyter GmbH & Co. KG, Berlin/Boston
Articles in the same Issue
- Article
- The Inheritance Procedure: Multiple Testing of Tree-structured Hypotheses
- Optimality Criteria for the Design of 2-Color Microarray Studies
- Stopping-Time Resampling and Population Genetic Inference under Coalescent Models
- A Mixture-Model Approach for Parallel Testing for Unequal Variances
- Fast Identification of Biological Pathways Associated with a Quantitative Trait Using Group Lasso with Overlaps
- MicroRNA Transcription Start Site Prediction with Multi-objective Feature Selection
- A Context Dependent Pair Hidden Markov Model for Statistical Alignment
- Fast Wavelet Based Functional Models for Transcriptome Analysis with Tiling Arrays
- Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length
- Transcriptional Network Inference from Functional Similarity and Expression Data: A Global Supervised Approach
- Improving Hidden Markov Models for Classification of Human Immunodeficiency Virus-1 Subtypes through Linear Classifier Learning
Articles in the same Issue
- Article
- The Inheritance Procedure: Multiple Testing of Tree-structured Hypotheses
- Optimality Criteria for the Design of 2-Color Microarray Studies
- Stopping-Time Resampling and Population Genetic Inference under Coalescent Models
- A Mixture-Model Approach for Parallel Testing for Unequal Variances
- Fast Identification of Biological Pathways Associated with a Quantitative Trait Using Group Lasso with Overlaps
- MicroRNA Transcription Start Site Prediction with Multi-objective Feature Selection
- A Context Dependent Pair Hidden Markov Model for Statistical Alignment
- Fast Wavelet Based Functional Models for Transcriptome Analysis with Tiling Arrays
- Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length
- Transcriptional Network Inference from Functional Similarity and Expression Data: A Global Supervised Approach
- Improving Hidden Markov Models for Classification of Human Immunodeficiency Virus-1 Subtypes through Linear Classifier Learning