Characterizing the D2 Statistic: Word Matches in Biological Sequences

Sylvain Forêt; Susan R Wilson; Conrad J Burden

doi:10.2202/1544-6115.1447

Startseite Characterizing the D2 Statistic: Word Matches in Biological Sequences

Artikel

Lizenziert

Nicht lizenziert Erfordert eine Authentifizierung

Characterizing the D2 Statistic: Word Matches in Biological Sequences

Sylvain Forêt , Susan R Wilson und Conrad J Burden

Veröffentlicht/Copyright: 8. Oktober 2009

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Statistical Applications in Genetics and Molecular Biology Band 8 Heft 1

MLA
APA
Harvard
Chicago
Vancouver

MLA
APA
Harvard
Chicago
Vancouver

Forêt, Sylvain, Wilson, Susan R and Burden, Conrad J. "Characterizing the D2 Statistic: Word Matches in Biological Sequences" Statistical Applications in Genetics and Molecular Biology, vol. 8, no. 1. https://doi.org/10.2202/1544-6115.1447

Forêt, S., Wilson, S. & Burden, C. (). Characterizing the D2 Statistic: Word Matches in Biological Sequences. Statistical Applications in Genetics and Molecular Biology, 8(1). https://doi.org/10.2202/1544-6115.1447

Forêt, S., Wilson, S. and Burden, C. () Characterizing the D2 Statistic: Word Matches in Biological Sequences. Statistical Applications in Genetics and Molecular Biology, Vol. 8 (Issue 1). https://doi.org/10.2202/1544-6115.1447

Forêt, Sylvain, Wilson, Susan R and Burden, Conrad J. "Characterizing the D2 Statistic: Word Matches in Biological Sequences" Statistical Applications in Genetics and Molecular Biology 8, no. 1 (). https://doi.org/10.2202/1544-6115.1447

Forêt S, Wilson S, Burden C. Characterizing the D2 Statistic: Word Matches in Biological Sequences. Statistical Applications in Genetics and Molecular Biology. ;8(1). https://doi.org/10.2202/1544-6115.1447

Kopie

In die Zwischenablage kopiert

BibTeX EndNote RIS

Word matches are often used in sequence comparison methods, either as a measure of sequence similarity or in the first search steps of algorithms such as BLAST or BLAT. The D2 statistic is the number of matches of words of k letters between two sequences. Recent advances have been made in the characterization of this statistic and in the approximation of its distribution. Here, these results are extended to the case of approximate word matches.We compute the exact value of the variance of the D2 statistic for the case of a uniform letter distribution, and introduce a method to provide accurate approximations of the variance in the remaining cases. This enables the distribution of D2 to be approximated for typical situations arising in biological research. We apply these results to the identification of cis-regulatory modules, and show that this method detects such sequences with a high accuracy.The ability to approximate the distribution of D2 for both exact and approximate word matches will enable the use of this statistic in a more precise manner for sequence comparison, database searches, and identification of transcription factor binding sites.

Keywords: D2; alignment free sequence comparison; biological sequences

Published Online: 2009-10-8

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

Article
Sparse Canonical Correlation Analysis with Application to Genomic Data Integration
Orthology-Based Multilevel Modeling of Differentially Expressed Mouse and Human Gene Pairs
Sequential Analysis for Microarray Data Based on Sensitivity and Meta-Analysis
Dimension Reduction of Microarray Data in the Presence of a Censored Survival Response: A Simulation Study
A Nonlinear Mixed-Effects Model for Estimating Calibration Intervals for Unknown Concentrations in Two-Color Microarray Data with Spike-Ins
Composite Likelihood Modeling of Neighboring Site Correlations of DNA Sequence Substitution Rates
A Multiple Testing Approach to High-Dimensional Association Studies with an Application to the Detection of Associations between Risk Factors of Heart Disease and Genetic Polymorphisms
Hypothesis Tests for Point-Mass Mixture Data with Application to `Omics Data with Many Zero Values
Inferring Dynamic Genetic Networks with Low Order Independencies
Normalization Method for Transcriptional Studies of Heterogeneous Samples - Simultaneous Array Normalization and Identification of Equivalent Expression
A Bayesian Analysis Strategy for Cross-Study Translation of Gene Expression Biomarkers
Modified FDR Controlling Procedure for Multi-Stage Analyses
Detecting Outlier Samples in Microarray Data
Survival Analysis with High-Dimensional Covariates: An Application in Microarray Studies
Two-Stage Model-Based Clustering for Liquid Chromatography Mass Spectrometry Data Analysis
Score Statistics for Mapping Quantitative Trait Loci
Impact of Population Stratification on Family-Based Association Tests with Longitudinal Measurements
A Multilocus Model for Constructing a Linkage Disequilibrium Map in Human Populations
Testing of Chromosomal Clumping of Gene Properties
Balanced Gradient Boosting from Imbalanced Data for Clinical Outcome Prediction
Univariate Shrinkage in the Cox Model for High Dimensional Data
Multilevel Comparison of Dendrograms: A New Method with an Application for Genetic Classifications
Weighted Multiple Hypothesis Testing Procedures
Incorporating Duplicate Genotype Data into Linear Trend Tests of Genetic Association: Methods and Cost-Effectiveness
Increase of Rejection Rate in Case-Control Studies with the Differential Genotyping Error Rates
A Parametric Model for Analyzing Anticipation in Genetically Predisposed Families
Bayesian Unsupervised Learning with Multiple Data Types
Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data
A Non-Homogeneous Hidden-State Model on First Order Differences for Automatic Detection of Nucleosome Positions
Adaptive Transmission Disequilibrium Test for Family Trio Design
Model Selection Based on FDR-Thresholding Optimizing the Area under the ROC-Curve
Estimation of Selection Intensity under Overdominance by Bayesian Methods
A Multivariate Growth Curve Model for Ranking Genes in Replicated Time Course Microarray Data
Rotation Testing in Gene Set Enrichment Analysis for Small Direct Comparison Experiments
Ancestral Recombination Graphs under Non-Random Ascertainment, with Applications to Gene Mapping
Prediction of Motifs Based on a Repeated-Measures Model for Integrating Cross-Species Sequence and Expression Data
Identifying Individuals in a Complex Mixture of DNA with Unknown Ancestry
A Statistical Model for Genetic Mapping of Viral Infection by Integrating Epidemiological Behavior
Calculating Asymptotic Significance Levels of the Constrained Likelihood Ratio Test with Application to Multivariate Genetic Linkage Analysis
Modeling Dependence in Methylation Patterns with Application to Ovarian Carcinomas
M-quantile Regression Analysis of Temporal Gene Expression Data
MC-Normalization: A Novel Method for Dye-Normalization of Two-Channel Microarray Data
Characterizing the D2 Statistic: Word Matches in Biological Sequences
Transmission Disequilibrium Test Power and Sample Size in the Presence of Locus Heterogeneity
A Regularized Regression Approach for Dissecting Genetic Conflicts that Increase Disease Risk in Pregnancy
Statistical Screening Method for Genetic Factors Influencing Susceptibility to Common Diseases in a Two-Stage Genome-Wide Association Study
A Unified Mixed Effects Model for Gene Set Analysis of Time Course Microarray Experiments

Zeitschrift durchsuchen In dieser Zeitschrift suchen

https://doi.org/10.2202/1544-6115.1447

Schlagwörter für diesen Artikel

D2; alignment free sequence comparison; biological sequences

Artikel in diesem Heft

Article
Sparse Canonical Correlation Analysis with Application to Genomic Data Integration
Orthology-Based Multilevel Modeling of Differentially Expressed Mouse and Human Gene Pairs
Sequential Analysis for Microarray Data Based on Sensitivity and Meta-Analysis
Dimension Reduction of Microarray Data in the Presence of a Censored Survival Response: A Simulation Study
A Nonlinear Mixed-Effects Model for Estimating Calibration Intervals for Unknown Concentrations in Two-Color Microarray Data with Spike-Ins
Composite Likelihood Modeling of Neighboring Site Correlations of DNA Sequence Substitution Rates
A Multiple Testing Approach to High-Dimensional Association Studies with an Application to the Detection of Associations between Risk Factors of Heart Disease and Genetic Polymorphisms
Hypothesis Tests for Point-Mass Mixture Data with Application to `Omics Data with Many Zero Values
Inferring Dynamic Genetic Networks with Low Order Independencies
Normalization Method for Transcriptional Studies of Heterogeneous Samples - Simultaneous Array Normalization and Identification of Equivalent Expression
A Bayesian Analysis Strategy for Cross-Study Translation of Gene Expression Biomarkers
Modified FDR Controlling Procedure for Multi-Stage Analyses
Detecting Outlier Samples in Microarray Data
Survival Analysis with High-Dimensional Covariates: An Application in Microarray Studies
Two-Stage Model-Based Clustering for Liquid Chromatography Mass Spectrometry Data Analysis
Score Statistics for Mapping Quantitative Trait Loci
Impact of Population Stratification on Family-Based Association Tests with Longitudinal Measurements
A Multilocus Model for Constructing a Linkage Disequilibrium Map in Human Populations
Testing of Chromosomal Clumping of Gene Properties
Balanced Gradient Boosting from Imbalanced Data for Clinical Outcome Prediction
Univariate Shrinkage in the Cox Model for High Dimensional Data
Multilevel Comparison of Dendrograms: A New Method with an Application for Genetic Classifications
Weighted Multiple Hypothesis Testing Procedures
Incorporating Duplicate Genotype Data into Linear Trend Tests of Genetic Association: Methods and Cost-Effectiveness
Increase of Rejection Rate in Case-Control Studies with the Differential Genotyping Error Rates
A Parametric Model for Analyzing Anticipation in Genetically Predisposed Families
Bayesian Unsupervised Learning with Multiple Data Types
Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data
A Non-Homogeneous Hidden-State Model on First Order Differences for Automatic Detection of Nucleosome Positions
Adaptive Transmission Disequilibrium Test for Family Trio Design
Model Selection Based on FDR-Thresholding Optimizing the Area under the ROC-Curve
Estimation of Selection Intensity under Overdominance by Bayesian Methods
A Multivariate Growth Curve Model for Ranking Genes in Replicated Time Course Microarray Data
Rotation Testing in Gene Set Enrichment Analysis for Small Direct Comparison Experiments
Ancestral Recombination Graphs under Non-Random Ascertainment, with Applications to Gene Mapping
Prediction of Motifs Based on a Repeated-Measures Model for Integrating Cross-Species Sequence and Expression Data
Identifying Individuals in a Complex Mixture of DNA with Unknown Ancestry
A Statistical Model for Genetic Mapping of Viral Infection by Integrating Epidemiological Behavior
Calculating Asymptotic Significance Levels of the Constrained Likelihood Ratio Test with Application to Multivariate Genetic Linkage Analysis
Modeling Dependence in Methylation Patterns with Application to Ovarian Carcinomas
M-quantile Regression Analysis of Temporal Gene Expression Data
MC-Normalization: A Novel Method for Dye-Normalization of Two-Channel Microarray Data
Characterizing the D2 Statistic: Word Matches in Biological Sequences
Transmission Disequilibrium Test Power and Sample Size in the Presence of Locus Heterogeneity
A Regularized Regression Approach for Dissecting Genetic Conflicts that Increase Disease Risk in Pregnancy
Statistical Screening Method for Genetic Factors Influencing Susceptibility to Common Diseases in a Two-Stage Genome-Wide Association Study
A Unified Mixed Effects Model for Gene Set Analysis of Time Course Microarray Experiments