--- title: "dmGsea User's Guide" author: - name: Zongli Xu affiliation: Biostatistics & Computational Biology Branch, NIEHS - name: Alison A. Motsinger-Reif affiliation: Biostatistics & Computational Biology Branch, NIEHS - name: Liang Niu affiliation: Division of Biostatistics Bioinformatics, Univ. of Cincinnati package: dmGsea abstract: > A brief introduction of dmGsea R package for gene set enrichment analysis. vignette: > %\VignetteIndexEntry{dmGsea User's Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: toc_float: true --- # Introduction In DNA methylation data, genes are often represented by variable number of correlated probes, and a single probe can map to multiple genes. This complex data structure poses significant challenges for gene set enrichment analysis (GSEA), and can lead to biased enrichment results. The `r Biocpkg("dmGsea")` package offers several functions with novel methods specifically designed to perform efficient gene set enrichment analysis while addressing probe dependency and probe number bias. Compared to alternative packages for DNA methylation data, these methods effectively utilize probe dependency information, provide higher statistical power, can well control type I errer rates, and are computationally more efficient. The package fully supports enrichment analysis for Illumina DNA methylation array data, and is easily extendable to other types of omics data when provided with appropriate probe annotation information. # List of functions # Example Analysis The following examples are brief demonstrations on how to perform gene set enrichment analysis using dmGsea functions. ## Example 1: Differentially methylated probes from EWAS ```{r example1,eval=TRUE, results="hide", message=FALSE, warning=FALSE} require(dmGsea) #generating example data annopkg <- "IlluminaHumanMethylation450kanno.ilmn12.hg19" anno <- minfi::getAnnotation(eval(annopkg)) #Use a subset of the data in the example to speed up execution anno <- anno[1:10000,] probe.p <- data.frame(Name=rownames(anno),p=runif(nrow(anno))) probe.p$p[1:500] <- probe.p$p[1:500]/100000 Data4cor <- matrix(runif(nrow(probe.p)*100),ncol=100) rownames(Data4cor) <- rownames(anno) #geneset enrichment analysis with threshold based method #for top ranked 1000 genes gsGene(probe.p <- probe.p,Data4Cor=Data4cor,arrayType="450K",nTopGene=1000, outGenep=TRUE, method="Threshold",gSetName="KEGG",species="Human", outfile="gs1",ncore=1) file.remove("gs1_KEGG_KEGG.csv") ``` - To perform GSEA using significant genes, set `FDRthre = 0.05` to apply an FDR threshold of 0.05. - To perform GSEA using a ranking-based method, specify `method = "Ranking"`. - If the argument `Data4Cor` is not provided, GSEA will be performed without using the methylation data matrix to account for between-CpG correlation. - Use `gsPG()` to perform enrichment analysis based on probe group-level p-values. - To perform GSEA directly with probe-level p-values without combining them into gene-level statistics, use the `gsProbe()` function. It applies a noncentral hypergeometric test to adjust for bias introduced by variable numbers of probes per gene. ## Example 2: Enrichment analysis for arrays other than 450K and EPIC ```{r example2, eval=TRUE, results="hide", message=FALSE, warning=FALSE} #generate example dataset kegg <- getKEGG(species="Human") gene1 <- unique(as.vector(unlist(kegg[1:5]))) gene2 <- unique(as.vector(unlist(kegg[6:length(kegg)]))) gene1 <- rep(gene1,sample(1:10,length(gene1),replace=TRUE)) gene2 <- rep(gene2,sample(1:10,length(gene2),replace=TRUE)) p11 <- runif(length(gene1))*(1e-3) p2 <- runif(length(gene2)) geneid <- c(gene1,gene2) p <- c(p11,p2) Name <- paste0("cg",1:length(p)) probe.p <- data.frame(Name=Name,p=p) GeneProbeTable <- data.frame(Name=Name,entrezid=geneid) dat <- matrix(runif(length(p)*100),ncol=100) rownames(dat) <- Name #enrichment analysis gsGene(probe.p=probe.p,Data4Cor=dat,GeneProbeTable=GeneProbeTable, method="Threshold",gSetName="KEGG",species="Human",outfile="gs5", ncore=1) file.remove("gs5_KEGG_KEGG.csv") ``` - To perform GSEA using a ranking-based method, specify `method = "Ranking"`. ## Example 3: Enrichment analysis with user provided geneset ```{r example3, eval=TRUE, results="hide", message=FALSE, warning=FALSE} #generatin example dataset userGeneset <- getKEGG(species="Human") #enrichment analysis gsGene(probe.p=probe.p,Data4Cor=dat,GeneProbeTable=GeneProbeTable, method="Threshold",geneSet=userGeneset,species="Human",outfile="gs7", ncore=1) file.remove("gs7_userSet_userSet.csv") ``` - To perform GSEA using a ranking-based method, specify `method = "Ranking"`. ## Example 4: Enrichment analysis for gene expression type of data that do not ## need to combine test statistics ```{r example4, eval=TRUE, results="hide", message=FALSE, warning=FALSE} #generatin example dataset kegg <- getKEGG(species="Human") gene <- unique(as.vector(unlist(kegg))) p <- runif(length(gene)) names(p) <- gene stats <- -log(p)*sample(c(1,-1),length(p),replace=TRUE) #traditional GSEA analysis, enrichment toward higher or lower end of statstics stats <- sort(stats,decr=TRUE) gsRank(stats=stats,gSetName="KEGG",scoreType="std",outfile="gs9",nperm=1e4, ncore=1) file.remove("gs9_KEGG_KEGG.csv") file.remove("gsGene_genep.csv") #enrichment of genes with higher statistics stats <- sort(abs(stats),decr=TRUE) ``` # Gene set and pathway databases The package includes built-in support for KEGG, GO, MSigDB, and Reactome gene sets for both human and mouse pathways. All functions also offer options to incorporate custom, user-provided gene sets. ## Kyoto Encyclopedia of Genes and Genomes (KEGG) The [KEGG](https://www.kegg.jp/) pathway database is a widely used resource that provides a comprehensive collection of manually curated biological pathways. These pathways cover various biological processes, including metabolism, cellular processes, genetic information processing, and human diseases. KEGG pathways integrate information about molecular interactions, reactions, and relationships between genes,proteins, and other molecules, helping researchers understand complex biological functions at a systems level. ## Gene Ontology (GO) [GO](https://geneontology.org/) is a widely used framework for describing the roles of genes and their products (proteins, RNAs) in biological systems. Unlike pathway databases that focus on specific molecular interactions, GO provides a standardized vocabulary for annotating gene functions across species in three main categories: ## The Molecular Signatures Database (MSigDB) [MSigDB](https://www.gsea-msigdb.org/gsea/msigdb) is a comprehensive collection of gene sets used for interpreting high-throughput gene expression data in biological research. It is a key resource for gene set enrichment analysis (GSEA), helping researchers identify biological pathways, processes, and mechanisms that are overrepresented in a given dataset. The Molecular Signatures Database (MSigDB) is divided into several major collections, each of which contains different sub-categories. Here’s a list of all the main categories and their sub-categories: ## Reactome [Reactome](https://reactome.org/) is a freely accessible, curated database of biological pathways that provides detailed insights into molecular processes across a wide range of organisms. It covers various cellular and biochemical processes, such as signal transduction, metabolism, gene expression, and immune responses. Each pathway in Reactome is represented as a network of molecular interactions, where entities like proteins, small molecules, and complexes participate in reactions, including binding, transport, and modifications. These reactions are organized hierarchically, from individual molecular events to larger biological processes.Reactome pathways are extensively curated by domain experts and are used in functional analysis of high-throughput omics data (e.g., gene expression, proteomics). It integrates experimental data, enabling researchers to explore the molecular mechanisms behind diseases, drug responses, and other biological phenomena. # Types of gene set enrichment analyses Threshold-Based GSEA or Over-Representation Analysis. It requires a threshold to define "significant" genes (e.g., p-value, fold change) and tests whether the overlap between a predefined list of differentially expressed genes (DEGs) and a gene set is statistically significant. Ranking-Based or functional class scoring (FCS) GSEA, ranks all genes in a dataset based on a continuous metric (e.g., P-value, fold change, t-statistic) and assesses whether the genes in a predefined gene set cluster at the top or bottom of this ranked list. The choice between threshold-based and ranking-based Gene Set Enrichment Analysis (GSEA) depends on the nature of the hypothesis being tested, the type of data, and the goals of the analysis. Threshold-Based GSEA: Use for well-defined, significant subsets of genes/features when you have a justifiable cutoff and are interested in strong, specific signals. Ranking-Based GSEA: Use when you want to incorporate the full dataset, avoid arbitrary cutoffs, or have a hypothesis that requires considering a continuous spectrum of feature significance. # Method options to combine p-value: ```{r session info} sessionInfo() ```