Sparse factor model for co-expression networks with an application using prior biological knowledge

Yuna Blum; Magalie Houée-Bigot; David Causeur

doi:10.1515/sagmb-2015-0002

Article

Sparse factor model for co-expression networks with an application using prior biological knowledge

, and

Published/Copyright: May 11, 2016

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Statistical Applications in Genetics and Molecular Biology Volume 15 Issue 3

Abstract

Inference on gene regulatory networks from high-throughput expression data turns out to be one of the main current challenges in systems biology. Such networks can be very insightful for the deep understanding of interactions between genes. Because genes-gene interactions is often viewed as joint contributions to known biological mechanisms, inference on the dependence among gene expressions is expected to be consistent to some extent with the functional characterization of genes which can be derived from ontologies (GO, KEGG, …). The present paper introduces a sparse factor model as a general framework either to account for a prior knowledge on joint contributions of modules of genes to latent biological processes or to infer on the corresponding co-expression network. We propose an ℓ₁ – regularized EM algorithm to fit a sparse factor model for correlation. We demonstrate how it helps extracting modules of genes and more generally improves the gene clustering performance. The method is compared to alternative estimation procedures for sparse factor models of relevance networks in a simulation study. The integration of a biological knowledge based on the gene ontology (GO) is also illustrated on a liver expression data generated to understand adiposity variability in chicken.

Keywords: factor model; gene ontology; high dimension; regularized estimation; relevance network; sparsity

Funding source: Agence Nationale de la Recherche

Award Identifier / Grant number: FatInteger ANR-11-BSV7-0004

Funding statement: Agence Nationale de la Recherche, (Grant/Award Number: ‘FatInteger ANR-11-BSV7-0004’).

References

Aittokallio, T. and B. Schwikowski (2006): “Graph-based methods for analyzing networks in cell biology,” Brief. Bioinform., 7, 243–255.10.1093/bib/bbl022Search in Google Scholar PubMed

Banerjee, O., A. El Ghaoui and A. d’Aspremont (2008): “Model selection through sparse maximum likelihood estimation,” J. Mach. Learn. Res., 9, 485–516.Search in Google Scholar

Blum, Y., G. Le Mignon, S. Lagarrigue and D. Causeur (2010): “A factor model to analyze heterogeneity in gene expression,” BMC Bioinformatics, 11, 368.10.1186/1471-2105-11-368Search in Google Scholar PubMed PubMed Central

Buja, A. and N. Eyuboglu (1992): “Remarks on parallel analysis,” Multivar. Behav. Res., 27, 509–540.10.1207/s15327906mbr2704_2Search in Google Scholar PubMed

Butte, A., P. Tamayo, D. Slonim, T. Golub and I. Kohane (2000): “Discovering functional relationships between rna expression and chemotherapeutic susceptibility using relevance networks,” Proc. Natl. Acad. Sci., 97, 12182.10.1073/pnas.220392197Search in Google Scholar PubMed PubMed Central

Carter, S., C. Brechbühler, M. Griffin and A. Bond (2004): “Gene co-expression network topology provides a framework for molecular characterization of cellular state,” Bioinformatics, 20, 2242–2250.10.1093/bioinformatics/bth234Search in Google Scholar PubMed

Carvalho, C. M., J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang and M. West (2008): “High-dimensional sparse factor modeling: Applications in gene expression genomics,” J. Am. Stat. Assoc., 103, 1438–1456.10.1198/016214508000000869Search in Google Scholar PubMed PubMed Central

Dempster, A., N. Laird and D. Rubin (1977): “Maximum likelihood from incomplete data via the em algorithm,” J. Royal Stat. Soc. B Met., 39, 1–38.10.1111/j.2517-6161.1977.tb01600.xSearch in Google Scholar

Friedman, J., T. Hastie and R. Tibshirani (2008): “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, 9, 432–441.10.1093/biostatistics/kxm045Search in Google Scholar PubMed PubMed Central

Friedman, J., T. Hastie and R. Tibshirani (2010): “Regularization paths for generalized linear models via coordinate descent,” J. Stat. Softw., 33, 1–22.10.18637/jss.v033.i01Search in Google Scholar

Friguet, C., M. Kloareg and D. Causeur (2009): “A factor model approach to multiple testing under dependence,” J. Am. Stat. Assoc., 104, 1406–1415.10.1198/jasa.2009.tm08332Search in Google Scholar

Goldenberg, A., A.-X. Zheng, S. Fienberg and E.-M. Airoldi (2010): “A survey of statistical network models,” Foundations and Trends in Machine Learning, 2, 129–233.10.1561/2200000005Search in Google Scholar

Harris, M.-A., J. Clark, A. Ireland, J. Lomax, M. Ashburner, R. Foulger, K. Eilbeck, S. Lewis, B. Marshall, C. Mungall, J. Richter, G.-M. Rubin, J.-A. Blake, C. Bult, M. Dolan, H. Drabkin, J.-T. Eppig, D.-P. Hill, L. Ni, M. Ringwald, R. Balakrishnan, J.-M. Cherry, K.-R. Christie, M.-C. Costanzo, S.-S. Dwight, S. Engel, D.-G. Fisk, J.-E. Hirschman, E.-L. Hong, R.-S. Nash, A. Sethuraman, C.-L. Theesfeld, D. Botstein, K. Dolinski, B. Feierbach, T. Berardini, S. Mundodi, S.-Y. Rhee, R. Apweiler, D. Barrell, E. Camon, E. Dimmer, V. Lee, R. Chisholm, P. Gaudet, W. Kibbe, R. Kishore, E.-M. Schwarz, P. Sternberg, M. Gwinn, L. Hannick, J. Wortman, M. Berriman, V. Wood, N. de la Cruz, P. Tonellato, P. Jaiswal, T. Seigfried, R. White and Gene Ontology Consortium (2004): “The gene ontology (go) database and informatics resource,” Nuc. Acids Res., 32, D258.10.1093/nar/gkh036Search in Google Scholar

Jöreskog, K. (1969): “A general approach to confirmatory maximum likelihood factor analysis,” Psychometrika, 34, 183–202.10.1007/BF02289343Search in Google Scholar

Langfelder, P. and S. Horvath (2007): “Eigengene networks for studying the relationships between co-expression modules,” BMC Syst. Biol., 1, 54.10.1186/1752-0509-1-54Search in Google Scholar

Langfelder, P. and S. Horvath (2008): “WGCNA: an R package for weighted correlation network analysis,” BMC Bioinformatics, 9, 559.10.1186/1471-2105-9-559Search in Google Scholar

Langfelder, P., B. Zhang and S. Horvath (2008): “Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for R,” Bioinformatics, 24, 719–720.10.1093/bioinformatics/btm563Search in Google Scholar

Le Mignon, G., C. Désert, F. Pitel, S. Leroux, O. Demeure, G. Guernec, B. Abasht, M. Douaire, P. Le Roy and S. Lagarrigue (2009): “Using transcriptome profiling to characterize qtl regions on chicken chromosome 5,” BMC Genomics, 10, 575.10.1186/1471-2164-10-575Search in Google Scholar

Leek, J. and J. Storey (2007): “Capturing heterogeneity in gene expression studies by surrogate variable analysis,” PLoS Genet., 3, 1724–1735.10.1371/journal.pgen.0030161Search in Google Scholar

Leek, J. and J. Storey (2008): “A general framework for multiple testing dependence,” Proc. Natl. Acad. Sci., 105, 18718.10.1073/pnas.0808709105Search in Google Scholar

Miettinen, T. and H. Gylling (2000): “Cholesterol absorption efficiency and sterol metabolism in obesity,” Atherosclerosis, 153, 241–248.10.1016/S0021-9150(00)00404-4Search in Google Scholar

Rand, W. (1971): “Objective criteria for the evaluation of clustering methods,” J. Am. Stat. Assoc., 66, 846–850.10.1080/01621459.1971.10482356Search in Google Scholar

Rubin, D. and D. Thayer (1982): “Em algorithms for ml factor analysis,” Psychometrika, 47, 69–76.10.1007/BF02293851Search in Google Scholar

Schäfer, J. and K. Strimmer (2005): “A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics,” Stat. Appl. Genet. Mol. Biology, 4.10.2202/1544-6115.1175Search in Google Scholar PubMed

Stuart, J., E. Segal, D. Koller and S. Kim (2003): “A gene-coexpression network for global discovery of conserved genetic modules,” Science, 302, 249–255.10.1126/science.1087447Search in Google Scholar PubMed

Sun, Y., N.-R. Zhang and A.-B. Owen (2012): “Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data,” Ann. Appl. Stat., 6, 1664–1688.10.1214/12-AOAS561Search in Google Scholar

Swierczynski, J., L. Zabrocka, E. Goyke, S. Raczynska, W. Adamonis and Z. Sledzinski (2003): “Enhanced glycerol 3-phosphate dehydrogenase activity in adipose tissue of obese humans,” Mol. Cell. Biochem., 254, 55–59.10.1023/A:1027332523114Search in Google Scholar

Witten, D., R. Tibshirani and T. Hastie (2009): “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, 10, 515–534.10.1093/biostatistics/kxp008Search in Google Scholar PubMed PubMed Central

Woodbury, M. (1950): “Inverting modified matrices,” Memorandum report, 42, 106.Search in Google Scholar

Wu, C., J. Kang, L. Peng, H. Li, S. Khan, C. Hillard, D. Okar and A. Lange (2005): “Enhancing hepatic glycolysis reduces obesity: differential effects on lipogenesis depend on site of glycolytic modulation,” Cell Metab., 2, 131–140.10.1016/j.cmet.2005.07.003Search in Google Scholar PubMed

Wu, T. and K. Lange (2008): “Coordinate descent algorithms for lasso penalized regression,” Ann. Appl. Stat., 2, 224–244.10.1214/07-AOAS147Search in Google Scholar

Zhang, B. and S. Horvath (2005): “A general framework for weighted gene co-expression network analysis,” Stat. Appl. Genet. Mol. Biol., 4, 1128.10.2202/1544-6115.1128Search in Google Scholar PubMed

Published Online: 2016-5-11

Published in Print: 2016-6-1

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/sagmb-2015-0002

Keywords for this article

factor model; gene ontology; high dimension; regularized estimation; relevance network; sparsity