---
title: "dmGsea User's Guide"
author:
- name: Zongli Xu
affiliation: Biostatistics & Computational Biology Branch, NIEHS
- name: Alison A. Motsinger-Reif
affiliation: Biostatistics & Computational Biology Branch, NIEHS
- name: Liang Niu
affiliation: Division of Biostatistics Bioinformatics, Univ. of Cincinnati
package: dmGsea
abstract: >
A brief introduction of dmGsea R package for gene set enrichment analysis.
vignette: >
%\VignetteIndexEntry{dmGsea User's Guide}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output:
BiocStyle::html_document:
toc_float: true
---
# Introduction
In DNA methylation data, genes are often represented by variable
number of correlated probes, and a single probe can map to multiple genes.
This complex data structure poses significant challenges for gene set
enrichment analysis (GSEA), and can lead to biased enrichment results.
The `r Biocpkg("dmGsea")` package offers several functions with novel
methods specifically designed to perform efficient gene set enrichment
analysis while addressing probe dependency and probe number bias.
Compared to alternative packages for DNA methylation data, these methods
effectively utilize probe dependency information, provide higher statistical
power, can well control type I errer rates, and are computationally more
efficient. The package fully supports enrichment analysis for Illumina DNA
methylation array data, and is easily extendable to other types of omics
data when provided with appropriate probe annotation information.
# List of functions
- `gsGene()`: GSEA based on aggregates association signals at gene level
- `gsPG()`: GSEA using summary statistics for independent probe groups based
on gene annotation
- `gsProbe()`: GSEA using probe level p-values
- `gsRank()`: Fast ranking-based GSEA with gene level statistics
# Example Analysis
The following examples are brief demonstrations on how to perform gene set
enrichment analysis using
dmGsea functions.
## Example 1: Differentially methylated probes from EWAS
```{r example1,eval=TRUE, results="hide", message=FALSE, warning=FALSE}
require(dmGsea)
#generating example data
annopkg <- "IlluminaHumanMethylation450kanno.ilmn12.hg19"
anno <- minfi::getAnnotation(eval(annopkg))
#Use a subset of the data in the example to speed up execution
anno <- anno[1:10000,]
probe.p <- data.frame(Name=rownames(anno),p=runif(nrow(anno)))
probe.p$p[1:500] <- probe.p$p[1:500]/100000
Data4cor <- matrix(runif(nrow(probe.p)*100),ncol=100)
rownames(Data4cor) <- rownames(anno)
#geneset enrichment analysis with threshold based method
#for top ranked 1000 genes
gsGene(probe.p <- probe.p,Data4Cor=Data4cor,arrayType="450K",nTopGene=1000,
outGenep=TRUE, method="Threshold",gSetName="KEGG",species="Human",
outfile="gs1",ncore=1)
file.remove("gs1_KEGG_KEGG.csv")
```
- To perform GSEA using significant genes, set `FDRthre = 0.05` to apply
an FDR threshold of 0.05.
- To perform GSEA using a ranking-based method, specify `method = "Ranking"`.
- If the argument `Data4Cor` is not provided, GSEA will be performed without
using the methylation data matrix to account for between-CpG correlation.
- Use `gsPG()` to perform enrichment analysis based on probe group-level
p-values.
- To perform GSEA directly with probe-level p-values without combining them
into gene-level statistics, use the `gsProbe()` function. It applies a
noncentral hypergeometric test to adjust for bias introduced
by variable numbers of probes per gene.
## Example 2: Enrichment analysis for arrays other than 450K and EPIC
```{r example2, eval=TRUE, results="hide", message=FALSE, warning=FALSE}
#generate example dataset
kegg <- getKEGG(species="Human")
gene1 <- unique(as.vector(unlist(kegg[1:5])))
gene2 <- unique(as.vector(unlist(kegg[6:length(kegg)])))
gene1 <- rep(gene1,sample(1:10,length(gene1),replace=TRUE))
gene2 <- rep(gene2,sample(1:10,length(gene2),replace=TRUE))
p11 <- runif(length(gene1))*(1e-3)
p2 <- runif(length(gene2))
geneid <- c(gene1,gene2)
p <- c(p11,p2)
Name <- paste0("cg",1:length(p))
probe.p <- data.frame(Name=Name,p=p)
GeneProbeTable <- data.frame(Name=Name,entrezid=geneid)
dat <- matrix(runif(length(p)*100),ncol=100)
rownames(dat) <- Name
#enrichment analysis
gsGene(probe.p=probe.p,Data4Cor=dat,GeneProbeTable=GeneProbeTable,
method="Threshold",gSetName="KEGG",species="Human",outfile="gs5",
ncore=1)
file.remove("gs5_KEGG_KEGG.csv")
```
- To perform GSEA using a ranking-based method, specify `method = "Ranking"`.
## Example 3: Enrichment analysis with user provided geneset
```{r example3, eval=TRUE, results="hide", message=FALSE, warning=FALSE}
#generatin example dataset
userGeneset <- getKEGG(species="Human")
#enrichment analysis
gsGene(probe.p=probe.p,Data4Cor=dat,GeneProbeTable=GeneProbeTable,
method="Threshold",geneSet=userGeneset,species="Human",outfile="gs7",
ncore=1)
file.remove("gs7_userSet_userSet.csv")
```
- To perform GSEA using a ranking-based method, specify `method = "Ranking"`.
## Example 4: Enrichment analysis for gene expression type of data that do not
## need to combine test statistics
```{r example4, eval=TRUE, results="hide", message=FALSE, warning=FALSE}
#generatin example dataset
kegg <- getKEGG(species="Human")
gene <- unique(as.vector(unlist(kegg)))
p <- runif(length(gene))
names(p) <- gene
stats <- -log(p)*sample(c(1,-1),length(p),replace=TRUE)
#traditional GSEA analysis, enrichment toward higher or lower end of statstics
stats <- sort(stats,decr=TRUE)
gsRank(stats=stats,gSetName="KEGG",scoreType="std",outfile="gs9",nperm=1e4,
ncore=1)
file.remove("gs9_KEGG_KEGG.csv")
file.remove("gsGene_genep.csv")
#enrichment of genes with higher statistics
stats <- sort(abs(stats),decr=TRUE)
```
# Gene set and pathway databases
The package includes built-in support for KEGG, GO, MSigDB, and Reactome gene
sets for both human and mouse pathways. All functions also offer options to
incorporate custom, user-provided gene sets.
## Kyoto Encyclopedia of Genes and Genomes (KEGG)
The [KEGG](https://www.kegg.jp/) pathway database is a widely used resource
that provides a comprehensive collection of manually curated biological
pathways. These pathways cover various biological processes, including
metabolism, cellular processes, genetic information processing, and human
diseases. KEGG pathways integrate information about molecular interactions,
reactions, and relationships between genes,proteins, and other molecules,
helping researchers understand complex biological functions at a systems level.
## Gene Ontology (GO)
[GO](https://geneontology.org/) is a widely used framework for describing the
roles of genes and their products (proteins, RNAs) in biological systems.
Unlike pathway databases that focus on specific molecular
interactions, GO provides a standardized vocabulary for annotating gene
functions across species in three main categories:
- Biological Process (BP): Describes the biological goals a gene or protein
contributes to, such as cell division or metabolic processes.
- Molecular Function (MF): Refers to the specific biochemical activities of
a gene product, like binding or catalysis.
- Cellular Component (CC): Indicates where in the cell a gene product is
active, such as the nucleus, membrane,or cytoplasm.
## The Molecular Signatures Database (MSigDB)
[MSigDB](https://www.gsea-msigdb.org/gsea/msigdb) is a comprehensive collection
of gene sets used for interpreting high-throughput gene expression data in
biological research. It is a key resource for gene set enrichment analysis
(GSEA), helping researchers identify biological pathways, processes, and
mechanisms that are overrepresented in a given dataset.
The Molecular Signatures Database (MSigDB) is divided into several major
collections, each of which contains different sub-categories. Here’s a
list of all the main categories and their sub-categories:
- H: Hallmark Gene Sets. These gene sets represent fundamental biological
processes, combining several similar gene sets into cohesive biological
themes.
- C1: Positional Gene Sets. Gene sets based on the chromosomal location of
genes.
- C2: Curated Gene Sets. C2.CP: Canonical Pathways: Gene sets derived from
well-known pathway databases, including KEGG, Reactome, BioCarta, and others.
C2.CGP: Chemical and Genetic Perturbations: Gene sets derived from published
studies of chemical/genetic perturbations, often based on experimental data.
- C3: Regulatory Target Gene Sets. C3.TFT: Transcription Factor Targets:
Gene sets defined by transcription factor binding motifs. C3.MIR: microRNA
Targets: Gene sets representing genes targeted by specific microRNAs.
- C4: Computational Gene Sets. C4.CGN: Cancer Gene Neighborhoods: Gene
sets computationally derived based on the relationships between genes in
cancer studies. C4.CM: Cancer Modules: Gene sets based on
modules of genes that co-vary across different cancers.
- C5: GO Gene Sets. C5.BP: Biological Process: Gene sets from the biological
process branch of Gene Ontology (GO). C5.CC: Cellular Component: Gene sets
from the cellular component branch of GO. C5.MF: Molecular
Function: Gene sets from the molecular function branch of GO.
- C6: Oncogenic Signatures. Gene sets representing signatures of oncogenic
pathway activation, often based on experimental perturbation of cancer-related
genes.
- C7: Immunologic Signatures. Gene sets derived from immunological studies,
such as immune cell expression profiles, cytokine treatments, and immune
responses.
- C8: Cell Type Signatures. Gene sets representing the expression profiles
of different cell types,
including cell states, tissue types, and developmental stages.
## Reactome
[Reactome](https://reactome.org/) is a freely accessible, curated database of
biological pathways that provides detailed insights into molecular processes
across a wide range of organisms. It covers various cellular and biochemical
processes, such as signal transduction, metabolism, gene expression, and immune
responses. Each pathway in Reactome is represented as a network of molecular
interactions, where entities like proteins, small molecules, and complexes
participate in reactions, including binding, transport, and modifications.
These reactions are organized hierarchically, from individual molecular
events to larger biological processes.Reactome pathways are extensively
curated by domain experts and are used in functional analysis of
high-throughput omics data (e.g., gene expression, proteomics). It integrates
experimental data, enabling researchers to explore the molecular mechanisms
behind diseases, drug responses, and other biological phenomena.
# Types of gene set enrichment analyses
Threshold-Based GSEA or Over-Representation Analysis. It requires a threshold
to define "significant" genes (e.g., p-value, fold change) and tests whether
the overlap between a predefined list of differentially expressed genes (DEGs)
and a gene set is statistically significant.
Ranking-Based or functional class scoring (FCS) GSEA, ranks all genes in a
dataset based on a continuous metric (e.g., P-value, fold change, t-statistic)
and assesses whether the genes in a predefined gene set cluster at the top or
bottom of this ranked list.
The choice between threshold-based and ranking-based Gene Set Enrichment
Analysis (GSEA) depends on the nature of the hypothesis being tested, the type
of data, and the goals of the analysis. Threshold-Based GSEA: Use for
well-defined, significant subsets of genes/features when you have a justifiable
cutoff and are interested in strong, specific signals. Ranking-Based GSEA: Use
when you want to incorporate the full dataset, avoid arbitrary cutoffs, or
have a hypothesis that requires considering a continuous spectrum of feature
significance.
# Method options to combine p-value:
- Fisher’s Method is sensitive to small p-values and works well when you
expect strong evidence in a few tests.
- Inverse Chi-square Method (invchisq) is more flexible than Fisher’s,
allowing weighting and handling of dependent or independent p-values.
- Stouffer’s Method balances contributions from small and large p-values
and works well with correlated p-values.
- Tippett’s Method is highly sensitive to the smallest p-value and is best
when one test is expected to show a strong signal.
```{r session info}
sessionInfo()
```