Abstract
A two-group comparison test is generally performed on RNA sequencing data to detect differentially expressed genes (DEGs). However, the accuracy of this method is low due to the small sample size. To address this, we propose a method using fuzzy clustering that artificially generates data with expression patterns similar to those of DEGs to identify genes that are highly likely to be classified into the same cluster as the initial cluster data. The proposed method is advantageous in that it does not perform any test. Furthermore, a certain level of accuracy can be maintained even when the sample size is biased, and we show that such a situation may improve the accuracy of the proposed method. We compared the proposed method with the conventional method using simulations. In the simulations, we changed the sample size and difference between the expression levels of group 1 and group 2 in the DEGs to obtain the desired accuracy of the proposed method. The results show that the proposed method is superior in all cases under the conditions simulated. We also show that the effect of the difference between group 1 and group 2 on the accuracy is more prominent when the sample size is biased.
-
Research ethics: Not applicable.
-
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Competing interests: The authors state no competing interests.
-
Research funding: None declared.
-
Data availability: The raw data can be obtained on request from the corresponding author.
References
1. Gunaratne, J, Schmidt, A, Quandt, A, Neo, SP, Saraç, ÖS, Gracia, T, et al.. Extensive mass spectrometry-based analisis of the fission yeast proteome. Mol Cell Proteomics 2013;12:1741–51. https://doi.org/10.1074/mcp.m112.023754.Suche in Google Scholar
2. Soneson, C, Delorenzi, M. A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. BMC Bioinf 2013;14. https://doi.org/10.1186/1471-2105-14-91.Suche in Google Scholar PubMed PubMed Central
3. Dudoit, S, Yang, YH, Callow, MJ, Speed, TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 2002;12:111–39.Suche in Google Scholar
4. Draghici, S. Statistics and data analysis for microarrays using R and bioconductor. New York: CRC Press; 2012.Suche in Google Scholar
5. Rajkumar, AP, Qvist, P, Lazarus, R, Lescai, F, Ju, J, Nyegaard, M, et al.. Experimental validation of methods for differential gene expression analysis and sample pooling in RNA-seq. BMC Genom 2015;16. https://doi.org/10.1186/s12864-015-1767-y.Suche in Google Scholar PubMed PubMed Central
6. Kadota, K, Nakai, Y, Shimizu, K. A weighted average difference method for detecting differentially expressed genes from microarray data. Algorithm Mol Biol 2008;3. https://doi.org/10.1186/1748-7188-3-8.Suche in Google Scholar PubMed PubMed Central
7. Breitling, R, Armengaud, P, Amtmann, A, Herzyk, P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 2004;573:83–92. https://doi.org/10.1016/j.febslet.2004.07.055.Suche in Google Scholar PubMed
8. Robinson, MD, McCarthy, DJ, Smyth, GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010;26:139–40. https://doi.org/10.1093/bioinformatics/btp616.Suche in Google Scholar PubMed PubMed Central
9. Anders, S, Huber, W. Differential expression analysis for sequence count data. Genome Biol 2010;11. https://doi.org/10.1038/npre.2010.4282.1.Suche in Google Scholar
10. Li, J, Tibshirani, R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat Methods Med Res 2013;22. https://doi.org/10.1177/0962280211428386.Suche in Google Scholar PubMed PubMed Central
11. Amaratunga, D, Cabrera, J, Shkedy, Z. Exploration and analysis of DNA microarray and other high-dimensional data. New Jersey: Wiley; 2014.10.1002/9781118364505Suche in Google Scholar
12. Horvath, S, Dong, J. Geometric interpretation of gene coexpression network analysis. PLoS Comput Biol 2008;4. https://doi.org/10.1371/journal.pcbi.1000117.Suche in Google Scholar PubMed PubMed Central
13. Love, MI, Huber, W, Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15. https://doi.org/10.1186/s13059-014-0550-8.Suche in Google Scholar PubMed PubMed Central
14. Bezdek, JC, Ehrlich, R, Full, W. FCM:The fuzzy c-meansclustering algorithm. Comput Geosci 1984;10:191–203. https://doi.org/10.1016/0098-3004(84)90020-7.Suche in Google Scholar
15. Verhoeven, KJF, Simonsen, KL, McIntyre, LM. Implementing false discovery rate control: increasing your power. Oikos 2005;108:643–7. https://doi.org/10.1111/j.0030-1299.2005.13727.x.Suche in Google Scholar
16. Benjamini, Y, Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Methodol 1995;57:289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.Suche in Google Scholar
17. Sun, J, Nishiyama, T, Shimizu, K, Kadota, K. TCC: an R package for comparing tag count data with robust normalization strategies. Bioinformatics 2013;14. https://doi.org/10.1186/1471-2105-14-219.Suche in Google Scholar PubMed PubMed Central
18. Sultan, M, Schulz, MH, Richard, H, Magen, A, Klingenhoff, A, Scherf, M, et al.. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 2008;321:956–60. https://doi.org/10.1126/science.1160342.Suche in Google Scholar PubMed
© 2024 Walter de Gruyter GmbH, Berlin/Boston
Artikel in diesem Heft
- Frontmatter
- Research Articles
- Random forests for survival data: which methods work best and under what conditions?
- Flexible variable selection in the presence of missing data
- An interpretable cluster-based logistic regression model, with application to the characterization of response to therapy in severe eosinophilic asthma
- MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination
- Detecting differentially expressed genes from RNA-seq data using fuzzy clustering
- Hypothesis testing for detecting outlier evaluators
- Response to comments on ‘sensitivity of estimands in clinical trials with imperfect compliance’
- Commentary
- Comments on “sensitivity of estimands in clinical trials with imperfect compliance” by Chen and Heitjan
- Research Articles
- Optimizing personalized treatments for targeted patient populations across multiple domains
- Statistical models for assessing agreement for quantitative data with heterogeneous random raters and replicate measurements
- History-restricted marginal structural model and latent class growth analysis of treatment trajectories for a time-dependent outcome
- Revisiting incidence rates comparison under right censorship
- Ensemble learning methods of inference for spatially stratified infectious disease systems
- The survival function NPMLE for combined right-censored and length-biased right-censored failure time data: properties and applications
- Hybrid classical-Bayesian approach to sample size determination for two-arm superiority clinical trials
- Estimation of a decreasing mean residual life based on ranked set sampling with an application to survival analysis
- Improving the mixed model for repeated measures to robustly increase precision in randomized trials
- Bayesian second-order sensitivity of longitudinal inferences to non-ignorability: an application to antidepressant clinical trial data
- A modified rule of three for the one-sided binomial confidence interval
- Kalman filter with impulse noised outliers: a robust sequential algorithm to filter data with a large number of outliers
- Bayesian estimation and prediction for network meta-analysis with contrast-based approach
- Testing for association between ordinal traits and genetic variants in pedigree-structured samples by collapsing and kernel methods
Artikel in diesem Heft
- Frontmatter
- Research Articles
- Random forests for survival data: which methods work best and under what conditions?
- Flexible variable selection in the presence of missing data
- An interpretable cluster-based logistic regression model, with application to the characterization of response to therapy in severe eosinophilic asthma
- MBPCA-OS: an exploratory multiblock method for variables of different measurement levels. Application to study the immune response to SARS-CoV-2 infection and vaccination
- Detecting differentially expressed genes from RNA-seq data using fuzzy clustering
- Hypothesis testing for detecting outlier evaluators
- Response to comments on ‘sensitivity of estimands in clinical trials with imperfect compliance’
- Commentary
- Comments on “sensitivity of estimands in clinical trials with imperfect compliance” by Chen and Heitjan
- Research Articles
- Optimizing personalized treatments for targeted patient populations across multiple domains
- Statistical models for assessing agreement for quantitative data with heterogeneous random raters and replicate measurements
- History-restricted marginal structural model and latent class growth analysis of treatment trajectories for a time-dependent outcome
- Revisiting incidence rates comparison under right censorship
- Ensemble learning methods of inference for spatially stratified infectious disease systems
- The survival function NPMLE for combined right-censored and length-biased right-censored failure time data: properties and applications
- Hybrid classical-Bayesian approach to sample size determination for two-arm superiority clinical trials
- Estimation of a decreasing mean residual life based on ranked set sampling with an application to survival analysis
- Improving the mixed model for repeated measures to robustly increase precision in randomized trials
- Bayesian second-order sensitivity of longitudinal inferences to non-ignorability: an application to antidepressant clinical trial data
- A modified rule of three for the one-sided binomial confidence interval
- Kalman filter with impulse noised outliers: a robust sequential algorithm to filter data with a large number of outliers
- Bayesian estimation and prediction for network meta-analysis with contrast-based approach
- Testing for association between ordinal traits and genetic variants in pedigree-structured samples by collapsing and kernel methods