Self-training Gene Clustering Pipeline (SGCP
) is a framework for gene co-expression network construction and analysis. The goal in these networks is to group the genes in a way that those with similar expression pattern fall within the same network cluster, commonly called module. SGCP
consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner. But unlike all existing frameworks, it further incorporates a novel step that leverages Gene Ontology (GO) information in a semi-supervised clustering method that further improves the quality of the computed modules. SGCP
results in highly enriched modules. Preprint of manuscript describing SGCP
in details is available
in here.
SGCP
installationTo install the package SGCP
use the following. For more information please visit here). In this manual guide, SGCP
also relies on two more packages; SummarizedExperiment
and org.Hs.eg.db
.
library(BiocManager)
BiocManager::install('SGCP')
Let’s start.
library(SGCP)
SGCP
General InputSGCP
has three main input; gene expression, geneID, and genome wide annotation
database.
expData is a matrix or a dataframe of size m*n
where m
and n
are the number of genes and samples respectively and it can be either DNA-microarray or RNA-seq . In another words, in expData
, rows and columns correspond to genes and samples respectively and the entry i,j
is an expression value for gene i
in sample j
. SGCP
does not perform any normalization or correction for batch effects and it is assumed that these
preprocessing steps have been already performed.
geneID is a vector of size m
where entry at index i
denotes the
gene identifier for gene i
. Note that there is one-to-one correspondence
between the rows in expData
and geneID
vector where index i
in geneId
indicates the gene identifier for row i
in expData
.
annotation_db the name of a genome wide annotation package of the organism
of interest for gene ontology (GO) enrichment step. annotation_db
must be
installed by user prior using SGCP
.
Here, are some important annotation_db
along with its corresponding identifiers.
organism | annotation_db | gene identifier |
---|---|---|
Homo sapiens (Hs) | org.Hs.eg.db | Entrez Gene identifiers |
Drosophila melanogaster (Dm) | org.Dm.eg.db | Entrez Gene identifiers |
Rattus norvegicus (Rn) | org.Rn.eg.db | Entrez Gene identifiers |
Mus musculus (Mm) | org.Mm.eg.db | Entrez Gene identifiers |
Arabidopsis thaliana (At) | org.At.tair.db | TAIR identifiers |
Note that genes: * must have expression values across all samples (i.e. no missing value). * must have non-zero variance across all the samples. * must have exactly one unique identifier, say geneID. * must have GO annotation.
Note that SGCP
depends on GOstats
for GO enrichment, thus (geneID) and (annotation_db) must be compatible to
this package standard.
SGCP
Input ExampleFor illustrative purposes we will use an example of a gene expression provided with SGCP
.
This data originally represent 5000 genes in 57 samples for Homo sapiens,
for more information see (Cheng et al.) Here, to ease the computation, 1000 genes that have the most variance selected with 5 samples have been selected.
For this input example we need to install the following packages.
The input example is organized as an object of SummarizedExperiment
. The assay
field shows the expression matrix in which rows and columns correspond to gene and samples respectively. rowData
denotes the gene Entrez ids for the gene. Note that there is 1-to-1 correspondence between rows in assays
and rowData
fields, thus rowData
at index i
shows the gene Entrez id for gene at row i
in assays
field. We call this object cheng
.
annotation_db
is org.Hs.eg.db, and is required to be installed if is not, see
link. Let’s look at the input example.
library(SummarizedExperiment)
data(cheng)
cheng
## class: SummarizedExperiment
## dim: 1500 10
## metadata(0):
## assays(1): Normalized gene expression Of ischemic cardiomyopathy (ICM)
## rownames(1500): 7503 100462977 ... 1152 11082
## rowData names(2): SYMBOL ENTREZID
## colnames(10): SRR7426784 SRR7426785 ... SRR7426792 SRR7426793
## colData names(1): sampleName
print("gene expression...")
## [1] "gene expression..."
print("rownames and colnames correspond to gene Entrez ids and sample names")
## [1] "rownames and colnames correspond to gene Entrez ids and sample names"
head(assay(cheng))
## SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503 4.608516 4.822920 11.477016 11.164324 10.631023 5.372338
## 100462977 6.566498 11.170815 10.721761 7.120300 10.991193 12.889641
## 4878 6.925341 9.098948 9.550381 7.742856 7.539139 6.818301
## 4879 6.020677 10.545173 6.398164 8.415020 7.936110 5.492805
## 6192 9.256628 9.511192 4.276300 3.705296 3.392204 9.581542
## 9086 9.098932 9.110217 3.858846 3.077543 2.923131 9.426101
## SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503 4.618841 4.628351 4.398288 4.686936
## 100462977 12.024684 12.295130 7.481746 6.661691
## 4878 6.181450 6.083118 9.985733 7.622827
## 4879 5.255798 6.266800 8.812797 8.678046
## 6192 9.211812 9.109689 9.309379 9.158489
## 9086 9.567030 9.557452 9.316588 9.059718
print(" \n gene ids...")
## [1] " \n gene ids..."
print("rownames are the gene Entrez ids")
## [1] "rownames are the gene Entrez ids"
head(rowData(cheng))
## DataFrame with 6 rows and 2 columns
## SYMBOL ENTREZID
## <character> <character>
## 7503 XIST 7503
## 100462977 MTRNR2L1 100462977
## 4878 NPPA 4878
## 4879 NPPB 4879
## 6192 RPS4Y1 6192
## 9086 EIF1AY 9086
This data has the dimension of 1500 genes and 10 samples.
Now, we are ready to create the three main inputs. Using assay
and rowData
functions we can have access to expression matrix and gene Entrez ids respectively.
message("expData")
## expData
expData <- assay(cheng)
head(expData)
## SRR7426784 SRR7426785 SRR7426786 SRR7426787 SRR7426788 SRR7426789
## 7503 4.608516 4.822920 11.477016 11.164324 10.631023 5.372338
## 100462977 6.566498 11.170815 10.721761 7.120300 10.991193 12.889641
## 4878 6.925341 9.098948 9.550381 7.742856 7.539139 6.818301
## 4879 6.020677 10.545173 6.398164 8.415020 7.936110 5.492805
## 6192 9.256628 9.511192 4.276300 3.705296 3.392204 9.581542
## 9086 9.098932 9.110217 3.858846 3.077543 2.923131 9.426101
## SRR7426790 SRR7426791 SRR7426792 SRR7426793
## 7503 4.618841 4.628351 4.398288 4.686936
## 100462977 12.024684 12.295130 7.481746 6.661691
## 4878 6.181450 6.083118 9.985733 7.622827
## 4879 5.255798 6.266800 8.812797 8.678046
## 6192 9.211812 9.109689 9.309379 9.158489
## 9086 9.567030 9.557452 9.316588 9.059718
dim(expData)
## [1] 1500 10
message(" \n geneID")
##
## geneID
geneID <- rowData(cheng)
geneID <- geneID$ENTREZID
head(geneID)
## [1] "7503" "100462977" "4878" "4879" "6192" "9086"
length(geneID)
## [1] 1500
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
annotation_db <- "org.Hs.eg.db"
SGCP
Pipeline Parameters and WorkflowSGCP
is based on 5
main steps to produce the final modules.
Each step offers parameters that can be adjusted by the user as follow. Each step
is implemented in a single function.
adjacencyMatrix
function)
calibration
: boolean, default FALSE, if TRUE SGCP
performs calibration step
for more information see link.norm
: boolean, default TRUE, if TRUE SGCP
divides each genes (rows) by its norm2.tom
: boolean, default TRUE, if TRUE SGCP
adds TOM to the network.saveAdja
: boolean, default FALSE, if TRUE, the adjacency matrix will be saved .adjaNameFile
: string indicates the name of file for saving adjacency matrix.hm
: string indicates the name of file for saving adjacency matrix heat map.clustering
function)
kopt
: an integer, default NULL, denotes the optimal number of clusters k by the user.method
: either “relativeGap”, “secondOrderGap”, “additiveGap”, or NULL,
default NULL. Defines the method for number of cluster.func.GO
: a function for gene ontology validation, default is sum.func.conduct
: a function for conductance validation, default is min.maxIter
: an integer, identifies the maximum number of iteration in kmeans.numStart
: an integer, identifies the number of start in kmeans.saveOrig
: boolean, default TRUE, if TRUE, keeps the transformed matrix.n_egvec
: either “all” or an integer, default = 100, indicates the
number of columns of transformed matrix to be kept.sil
: boolean, default FALSE, if TRUE, calculates silhouette index per gene.
at the end of this step, initial clusters are produced.geneOntology
function)
direction
: test direction, default c(“over”, “under”), for over-represented, or under-represented GO termsontology
: GO ontologies, default c(“BP”, “CC”, “MF”), BP: Biological Process, CC: Cellular Component, MF: Molecular Function.hgCutoff
: a numeric value in (0,1) as the p-value baseline, default 0.05, GO terms smaller than hgCutoff
value are kept.cond
: boolean, default TRUE, if TRUE conditional hypergeometric test is performed.semiLabeling
function)
cutoff
: a numeric in (0, 1) default NULL, is a base line for GO term significancy to identify remarkable and unremarkable genes.percent
: a number in (0,1) default 0.1, indicate the percentile for finding top GO terms.stp
: a number in (0,1) default 0.01, indicates increasing value to be added to percent
parameter.semiSupevised
function)
model
: either “knn” or “lr” for classification model, knn: k nearest neighbors, lr: logistic regression.kn
: an integer default NULL indicating the number of neighbors in knn,
if kn
is NULL, then kn
= 20 : (20 + 2 * k) if 2 * k < 30 otherwise 20 : 30,
at the end of this step, final modules are produced.At the end, SGCP
performs on more step of Gene Ontology Enrichment on the final modules.
Detailed of the steps are available in the manuscript in SGCP.
In Network Construction, user can identify of any of steps to be
added to the network. In Network Clustering, If kopt
is not null,
SGCP
will find clusters based on kopt
. Otherwise, if method
is not null,
SGCP
will pick k based on the selected method. Otherwise, if geneID
and
annotation_db
is null, SGCP
will determine the optimal method and its
corresponding number of cluster based on conductance validation. It picks a
method that func.conduct
(default min) on its cluster is minimum. Otherwise,
SGCP
will use gene ontology validation (by default) to find the optimal method and its
corresponding number of clusters. To this end, it performs gene ontology enrichment
on the cluster with minimum conductance index per method and pick the one that
has the maximum func.GO
(default sum) over -log10 of p-values. In
Gene Semi-labeling step, if cutoff
is not NULL, SGCP
considers genes
associated to GO terms more significant than\(~\)cutoff
\(~\) as remarkable. Otherwise,
SGCP
collects all GO terms from all clusters and picks the percent
(default 0.1)
mot significant GO terms among them. If Genes associated to these significant terms
come from more than a single cluster, SGCP
takes these genes as remarkable.
Otherwise, it adds\(~\)stp
\(~\)to\(~\)percent
\(~\)and repeat this process until remarkable genes
come from at least two clusters. In Semi-supervised Classification
remarkable clusters are the clusters that have at least one remarkable gene.
ezSGCP
FunctionezSGCP
function implements the SGCP
pipeline in one function. Parameters are the same as
here except calib
corresponds to calibration
parameter in
Network Construction method_k
, f.GO
, f.conduct
,
maxIteration
, numberStart
parameters correspond to method
, func.GO
,
func.conduct
, maxIter
, numStart
in Network Clustering respectively.
dir
, onto
, hgCut
, condTest
correspond to direction
, ontology
,
hgCutoff
, and cond
in Gene Ontology Enrichment respectively. semilabel
also is a boolean parameter, that if is FALSE, SGCP
will stop after initial clusters. For full parameter description, see the help document.
Below shows how to call this function. Because, this may take up to 15 minutes to be completed, I have already stored the result as sgcp
and commented the function. For your practice, uncomment the function and run it. In this call, I set the sil
parameter to TRUE to get the gene silhouette index.
# sgcp <- ezSGCP(expData = expData, geneID = geneID, annotation_db = annotation_db,
# eff.egs = FALSE , saveOrig = FALSE, sil = TRUE, hm= NULL)
data(sgcp)
summary(sgcp, show.data=TRUE)
## Length Class Mode
## semilabel 1 -none- logical
## clusterLabels 3 data.frame list
## clustering 13 -none- list
## initial.GO 2 -none- list
## semiLabeling 2 -none- list
## semiSupervised 3 -none- list
## final.GO 2 -none- list
If you run the function, you will see the following during the execution, which tells you the current state of ezSGCP
function.
## starting network construction step...
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
## it may take time...
## network is created, done!...
## starting network clustering step...
## calculating normalized Laplacian
## it may take time...
## calculating eigenvalues/vectors
## it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3
## Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method secondOrderGap...
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method additiveGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index
## it may take time...
## network clustering is done...
## starting initial gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
## starting semi-labeling stage...
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..
## starting semi-supervised step...
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
## it may take time
## Loading required package: ggplot2
## Loading required package: lattice
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 4 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
## 24-nearest neighbor model
## Training set outcome distribution:
## 1 2
## 498 667
## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
## 1 2
## 130 205
## semi-supervised done!..
## starting final gene ontology enrichment step...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
## ezSGCP done!..
As it is seen, SGCP
print out information of each step.
In this example, there are three potential number of clusters, 2, 3, and 5, and based on gene ontology validation, 2 is selected as the optimal number of cluster.
We found out
SGCP
performs well with default parameters in most of cases. However, users have option to change the setting according to their need. For instance, there are cases thatcalib
increases the final module enrichment. Similarly, adding TOM is not always helpful. By defaultSGCP
assumes that user does not know the exact number of cluster, nor does it know the method for it. Therefore, it uses gene ontology validation to identify the initial clusters. We highly recommend users to performSGCP
on the three potential metod “relativeGap”, “secondOrderGap”, “additiveGap”.
SGCP
allows user to visualize the result. SGCP_ezPLOT
takes sgcp
result from
ezSGCP
function along with expData
and geneID
. It returns two files; excel
and pdf depending on the initial call. User also can keep the plotting object
by setting keep = TRUE. Here, we set silhouette_in
to TRUE to see the silhouette index plot.
message("PCA of expression data without label")
SGCP_ezPLOT(sgcp = sgcp, expreData = expData, silhouette_index = TRUE, keep = FALSE)
Note that in order to store files in xlsx format, excel must be installed on your system.
For the detailed parameter explanation visit the help document.
SGCP
In DetailThe ezSGCP
function explained above is a wrapper that calls several other SGCP
function in the following order. Now, we will go through each step in detail using example data.
adjacencyMatrix
function, performs Network Construction stepclustering
function, performs Network Clustering stepgeneOntology
function, performs Gene Ontology Enrichment step (produces initial clusters)semiLabeling
function, performs Semi-labeling stepsemiSupervised
function, performs Semi-supervised stepgeneOntology
function, performs Gene Ontology Enrichment step (produces final modules)Lets remove sgcp
for space efficiency.
rm(sgcp)
Here, we will skip the detail of function’s input and output as it is described in the help document. SGCP
allows user to visualize the PCA of the input gene expression data using SGCP_plot_pca
function.
message("Plotting PCA of expression data")
## Plotting PCA of expression data
pl <- SGCP_plot_pca(m = expData, clusLabs = NULL, tit = "PCA plot", ps = .5)
print(pl)
adjacencyMatrix
functionNetwork Construction step is implemented using adjacencyMatrix
function. By default, this function divides each vector of genes by its norm 2 and used Gaussian kernel metric for similarity function. It then adds the information of the second order of the genes’ neighborhood in the form of TOM to the network. Code below shows how to use this function.
resAdja <- adjacencyMatrix(expData = expData, hm = NULL)
## normalization...
## Gaussian kernel...
## it may take time...
## TOM...
## it may take time...
## network is created, done!...
resAdja[0:10, 0:5]
## 1 2 3 4 5
## 1 0.000000e+00 1.561704e-23 1.684533e-18 5.096042e-21 2.304447e-34
## 2 1.561704e-23 0.000000e+00 3.073427e-08 9.883634e-11 1.560804e-14
## 3 1.684533e-18 3.073427e-08 0.000000e+00 2.990119e-04 3.639579e-13
## 4 5.096042e-21 9.883634e-11 2.990119e-04 0.000000e+00 2.664746e-14
## 5 2.304447e-34 1.560804e-14 3.639579e-13 2.664746e-14 0.000000e+00
## 6 1.880926e-37 1.859059e-16 1.732464e-15 6.205653e-17 6.756836e-01
## 7 7.700277e-16 3.234158e-06 1.891155e-02 3.050233e-04 7.223951e-11
## 8 4.549591e-34 2.956936e-14 5.344484e-13 1.850700e-14 6.616417e-01
## 9 2.052374e-34 1.424767e-14 2.918285e-13 2.113249e-14 7.306804e-01
## 10 5.767117e-34 3.294000e-14 6.190725e-13 3.546475e-14 7.471861e-01
resAdja
is the adjacency matrix of the gene co-expression network. We can visualize the heatmap of the adjacency matrix using\(~\)SGCP_plot_heatMap
function as follow.To see the heatmap, uncomment the following line of codes.
#message("Plotting adjacency heatmap")
#pl <- SGCP_plot_heatMap(m = resAdja, tit = "Adjacency Heatmap",
# xname = "genes", yname = "genes")
#print(pl)
clustering
functionNetwork Clustering step is implemented using clustering
function. This function takes the adjacency matrix , geneID
, and annotation_db
, and returns a list of following, depending on the initial call. I set the sil
parameter to TRUE
to get the silhouette index per gene.
dropped.indices
: a vector of indices of dropped genes. Some genes may be noise, and therefore dropped throughout this function.
geneID
: a vector of geneID.
method
: indicates the selected method for number of cluster.
k
: number of clusters.
Y
: transformed matrix with 2*k columns.
X
: eigenvalues correspond to 2*k columns in Y.
cluster
: an object of class “kmeans”.
clusterLabels
: a vector containing the cluster label per gene, there is a 1-to-1 correspondence between geneID
and clusterLabes
.
conductance
: a list containing mean and median, and individual cluster conductance index for clusters per method. Index in clusterConductance
field denotes the cluster label and the value shows the conductance index.
cvGOdf
: a dataframe used for gene ontology validation. For each method, it returns the gene ontology enrichment result on the cluster with minimum conductance index.
cv
: an string indicates the validation method for number of cluster, “cvGO”: if gene ontology validation used, “cvConductance”: if conductance validation used, “userMethod”: if user defined the method, “userkopt”: if user defines the kopt.
clusterNumberPlot
: plotting object for relativeGap"“,”secondOrderGap“, and”additiveGap".
silhouette
: a dataframe that indicates the silhouette indices for genes.
original
: a list with matrix transformation and corresponding eigenvalues and n_egvec
, where n_egvec
top columns of transformation is kept.
Code below shows how to use this function. Again, since it takes time, I have commented the code, and use the storing resClus
as the output of this function. For your practice uncomment and run it.
# resClus <- clustering(adjaMat = resAdja, geneID = geneID, annotation_db = annotation_db,
# eff.egs = FALSE , saveOrig = FALSE, sil = TRUE)
data(resClus)
summary(resClus)
## Length Class Mode
## dropped.indices 0 -none- numeric
## geneID 1500 -none- character
## method 1 -none- character
## k 1 -none- numeric
## Y 6000 -none- numeric
## X 4 -none- numeric
## cluster 9 kmeans list
## clusterLabels 1500 -none- character
## conductance 3 -none- list
## cvGOdf 9 data.frame list
## cv 1 -none- character
## clusterNumberPlot 3 -none- list
## silhouette 3 data.frame list
# lets drop noisy genes from expData
geneID <- resClus$geneID
if(length(resClus$dropped.indices)>0 ){ expData <- expData[-resClus$dropped.indices, ]}
# removing the adjacency matrix
rm(resAdja)
If you run the code, you will see the following will be printout for you.
## calculating normalized Laplacian
## it may take time...
## calculating eigenvalues/vectors
## it may take time...
## n_egvec is 100
## number of clusters for relativeGap method is 2
## number of clusters for secondOrderGap method is 5
## number of clusters for additiveGap method is 3
## Gene Ontology Validation...
## method relativeGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method secondOrderGap...
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method additiveGap....
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## method relativeGap is selected using GO validation and k is 2
## calculating the Silhouette index
## it may take time...
## network clustering is done...
We saw that three methods (“relativeGap”, “secondOrderGap”, “additiveGap”) resulted in three distinct potential number of clusters. SGCP
picked “secondOrderGap” after gene ontology validation which is 2. cv
field is “cvGO”, which indicates that k is found based on gene ontology validation. In original
field, you can have the n_egvec first columns ( eigenvectors) and eigenvalues of the transformation matrix. This might be useful for further investigation.
Now we can see the plot of PCA on the expression and transformed data with and without the labels using SGCP_plot_pca
. For practice, uncomment the following and see the resulting PCAs.
message("Plotting PCA of expression data with label")
## Plotting PCA of expression data with label
# pl <- SGCP_plot_pca(m = expData, clusLabs = NULL, tit = "PCA plot without label", ps = .5)
# print(pl)
message("Plptting PCA of transformed data without label")
## Plptting PCA of transformed data without label
# pl <- SGCP_plot_pca(m = resClus$Y, clusLabs = NULL, tit = "PCA plot without label", ps = .5)
# print(pl)
message("Plotting PCA of transformed data with label")
## Plotting PCA of transformed data with label
# pl <- SGCP_plot_pca(m = resClus$Y, clusLabs = resClus$clusterLabels, tit = "PCA plot with label", ps = .5)
# print(pl)
In resClus
the plotting objects of methods of number of clusters is available.
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
print(resClus$clusterNumberPlot$relativeGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
print(resClus$clusterNumberPlot$secondOrderGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance"){
message("plotting relativeGap, secondOrderGap, additiveGap methods ...")
print(resClus$clusterNumberPlot$additiveGap)
}
## plotting relativeGap, secondOrderGap, additiveGap methods ...
In SGCP
, we can illustrate the conductance index per cluster using SGCP_plot_conductance
function. Conductance index is calculated if both kopt
and method
are NULL
. In other words, when one of these parameter is known, SGCP
does not need perform validation to find the cluster and therefore will skip the conductance computation.
# checking if conductance index is calculated
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance" ){
message("plotting cluster conductance index...")
pl <- SGCP_plot_conductance(conduct = resClus$conductance,
tit = "Cluster Conductance Index",
xname = "clusterLabel", yname = "conductance index")
print(pl)}
## plotting cluster conductance index...
In here we obsereved that cluster with lower conductance index tend to have higher enrichment. This analysis confirms this observation. In clustering
function, the cluster with minimum conductance for method relativeGap (cluster rg1), secondOrderGap (cluster sg4), and additiveGap (cluster ag1) are compared using their gene ontology enrichment. And among them cluster rg1 has higher enrichment. Interestingly, this cluster also has lower conductance index in compare with ag1, and sg4.
SGCP
also can plot the silhouette index per gene if sil
parameter is TRUE
in clustering
function.
# checking if silhouette index is calculated
if(resClus$cv == "cvGO" || resClus$cv == "cvConductance" ){
message("plotting gene silhouette index...")
pl <- SGCP_plot_silhouette(resClus$silhouette, tit = "Gene Silhouette Index",
xname = "genes", yname = "silhouette index")
print(pl)}
## plotting gene silhouette index...
Silhouette index ranges between -1 and 1. Closer to 1, the better the gene is explained by the cluster. As it is seen, majority of genes have Silhouette index very close to 1 which means that genes are well-described by the clusters based on Silhouette index perspective. Interestingly, in worse case scenario, genes have zero silhouette index, and there is no negative index for any gene.
geneOntology
functionGene Ontology Enrichment step is implemented using geneOntology
function. This function returns the gene ontology enrichment on the clusters as GOresults
. It also returns the FinalGOTermGenes
list which identifies the gene IDs correspond to the GO terms found GOresults
. Code below shows how to use this function. Again, since it takes time, I have commented the code, and use the storing resInitialGO
as the output of this function. For your practice uncomment and run it.
# resInitialGO <- geneOntology(geneUniv = geneID, clusLab = resClus$clusterLabels,
# annotation_db = annotation_db)
data(resInitialGO)
head(resInitialGO$GOresults)
## clusterNum GOtype GOID Pvalue OddsRatio ExpCount Count Size
## 1 1 underMF GO:0005201 5.280629e-15 0.1084677 41.36170 10 72
## 2 1 underMF GO:0005198 2.081705e-11 0.2448824 62.04255 29 108
## 3 1 underCC GO:0031012 1.276834e-10 0.3605085 107.53125 67 186
## 4 1 underCC GO:0030312 2.159404e-10 0.3663565 108.10938 68 187
## 5 1 underCC GO:0005576 1.076738e-09 0.5227524 350.34375 294 606
## 6 1 underCC GO:0071944 1.653199e-09 0.5292955 446.31250 390 772
## Term
## 1 extracellular matrix structural constituent
## 2 structural molecule activity
## 3 extracellular matrix
## 4 external encapsulating structure
## 5 extracellular region
## 6 cell periphery
If you run the code, you will see the following will be printout for you.
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
Above shows the enrichment dataframe returned from GOstats (link)[https://bioconductor.org/packages/release/bioc/html/GOstats.html] package. clusterNum
indicates the cluster label.
SGCP
has four functions that illustrate the gene ontology enrichment result; SGCP_plot_jitter
, SGCP_plot_density
, SGCP_plot_bar
, and SGCP_plot_pie
. Note that p-values are log-transformed, and therefore, The higher the point, the more significant the GO term is.
message("plotting initial GO p-values jitters...")
## plotting initial GO p-values jitters...
pl <- SGCP_plot_jitter(df = resInitialGO$GOresults,
tit = "Initial GO p-values",
xname = "cluster", yname = "-log10 p-value", ps = 3)
print(pl)
message("plotting initial GO p-values density")
## plotting initial GO p-values density
pl <- SGCP_plot_density(df = resInitialGO$GOresults,
tit = "Initial GO p-values Density",
xname = "cluster", yname = "-log10 p-value")
print(pl)
## Picking joint bandwidth of 0.199
message("plotting initial GO p-values mean")
## plotting initial GO p-values mean
pl <- SGCP_plot_bar(df = resInitialGO$GOresults, tit = "Cluster Performance",
xname = "cluster", yname = "mean -log10 p-value")
print(pl)
message("plotting initial GO p-values pie chart...")
## plotting initial GO p-values pie chart...
pl <- SGCP_plot_pie(df = resInitialGO$GOresults, tit = "Initial GO Analysis",
xname = "cluster", yname = "count", posx = 1.8)
print(pl)
Label beside each segment indicates the log-transformed p-value of the most significant term found in that segment.
semiLabeling
functionGene Semi-labeling step is implemented using semiLabeling
function. This function takes the result from geneOntology
function and returns a dataframe that identify remarkable and unremarkable genes along with the cutoff value. Code below shows how to use this function.
resSemiLabel <- semiLabeling(geneID = geneID,
df_GO = resInitialGO$GOresults,
GOgenes = resInitialGO$FinalGOTermGenes)
## cutoff value is 9.28130152105493e-05
## semiLabeling done!..
print(head(resSemiLabel$geneLabel))
## geneID label
## 1 7503 1
## 2 100462977 <NA>
## 3 4878 2
## 4 4879 2
## 5 6192 1
## 6 9086 <NA>
table(resSemiLabel$geneLabel$label)
##
## 1 2
## 667 498
Above table shows the genes and their corresponding cluster label if is remarkable otherwise NA. we say a cluster is remarkable if it has at least one remarkable gene. Therefore, both clusters are remarkable here. In here, we discussed that if number of clusters is larger than 2, then SGCP
will wipe out unremarkable clusters through the Semi-labeling and Semi-supervised steps. Gene in these clusters will be placed in remarkable clusters.. In other words, number of modules may drop but the module enrichment will increase. These steps, in fact, change the original clusters produced by clustering
function and produce a different set of clusters. Therefore, SGCP
returns to set of clusters, (i) immediately as the clustering
function output, and (ii) after semiSupervised
function. To distinguish these two sets, we call (i) and (ii) as clusters and modules respectively. In the manuscript, we also distinguish them using pSGCP
and SGCP
.
semiSupervised
functionGene Semi-supervised step is implemented using semiSupervised
function. Final module labeling is available in FinalLabeling
which is a dataframe of geneID with its corresponding semi-label and final label. Code below shows how to use this function.
resSemiSupervised <- semiSupervised(specExp = resClus$Y,
geneLab = resSemiLabel$geneLabel,
model = "knn", kn = NULL)
## creating train and test sets based on remarkable and unremarkable genes...
## number of remarkable genes is 1165
## number of unremarkable genes is 335
## performing knn...
## it may take time
## Loading required package: ggplot2
## Loading required package: lattice
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 4 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
## 21-nearest neighbor model
## Training set outcome distribution:
##
## 1 2
## 667 498
## class assignment for unremarkable genes...
## top class labels, and bottom number of predicted genes
## prediction
## 1 2
## 205 130
## semi-supervised done!..
print(head(resSemiSupervised$FinalLabeling))
## geneID semiLabel FinalLabel
## 1 7503 1 1
## 2 100462977 <NA> 2
## 3 4878 2 2
## 4 4879 2 2
## 5 6192 1 1
## 6 9086 <NA> 1
print(table(resSemiSupervised$FinalLabeling$FinalLabel))
##
## 1 2
## 872 628
In this step, k-nearest neighbor model is selected. It hyper-parameter is selected based on cross validation on accuracy metric. Table above shows the gene semi label and final labeling.
Now you can see the PCA plot with the final labeling using SGCP_plot_pca
function. For practice uncomment the following to see the resulting PCAs.
# message("Plotting PCA of transformed data with label")
# pl <- SGCP_plot_pca(m = expData,
# clusLabs = resSemiSupervised$FinalLabeling$FinalLabel,
# tit = "PCA plot with label", ps = .5)
print(pl)
# message("Plotting PCA of transformed data with label")
# pl <- SGCP_plot_pca(m = resClus$Y,
# clusLabs = resSemiSupervised$FinalLabeling$FinalLabel,
# tit = "PCA plot with label", ps = .5)
#print(pl)
geneOntology
functionFinally, Gene Ontology Enrichment step is performed one more time on the final modules for module enrichment.A gain, since it takes time, I have commented the code, and use the storing resFinalGO
as the output of this function. For your practice uncomment and run it.
# resFinalGO <- geneOntology(geneUniv = geneID, clusLab = resSemiSupervised$FinalLabeling$FinalLabel,
# annotation_db = annotation_db)
data(resFinalGO)
head(resFinalGO$GOresults)
## clusterNum GOtype GOID Pvalue OddsRatio ExpCount Count Size
## 1 1 underMF GO:0005201 4.792355e-15 0.1081310 41.41277 10 72
## 2 1 underMF GO:0005198 1.872816e-11 0.2440998 62.11915 29 108
## 3 1 underCC GO:0031012 1.118311e-10 0.3593320 107.65761 67 186
## 4 1 underCC GO:0030312 1.893638e-10 0.3651603 108.23641 68 187
## 5 1 underCC GO:0005576 8.138271e-10 0.5201268 350.75543 294 606
## 6 1 underCC GO:0071944 1.165513e-09 0.5259400 446.83696 390 772
## Term
## 1 extracellular matrix structural constituent
## 2 structural molecule activity
## 3 extracellular matrix
## 4 external encapsulating structure
## 5 extracellular region
## 6 cell periphery
If you run the code, you will see the following will be printout for you.
## GOenrichment for cluster 1
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## GOenrichment for cluster 2
## calling GOstats for overBP...
## identifying genes for each GOTerm...
## calling GOstats for overCC...
## identifying genes for each GOTerm...
## calling GOstats for overMF...
## identifying genes for each GOTerm...
## calling GOstats for underBP...
## identifying genes for each GOTerm...
## calling GOstats for underCC...
## identifying genes for each GOTerm...
## calling GOstats for underMF...
## identifying genes for each GOTerm...
## gene ontology done!..
Now let see the corresponding plots for final module enrichment.
message("plotting final GO p-values jitters...")
## plotting final GO p-values jitters...
pl <- SGCP_plot_jitter(df = resFinalGO$GOresults, tit = "Final GO p-values",
xname = "module", yname = "-log10 p-value", ps = 3)
print(pl)
message("plotting final GO p-values density...")
## plotting final GO p-values density...
pl <- SGCP_plot_density(df = resFinalGO$GOresults,
tit = "Final GO p-values Density",
xname = "module", yname = "-log10 p-value")
print(pl)
## Picking joint bandwidth of 0.197
rm(pl)
message("plotting final GO p-values mean...")
## plotting final GO p-values mean...
pl <- SGCP_plot_bar(df = resFinalGO$GOresults, tit = "Module Performance",
xname = "module", yname = "mean -log10 p-value")
print(pl)
rm(pl)
message("plotting final GO p-values pie xhart...")
## plotting final GO p-values pie xhart...
pl <- SGCP_plot_pie(df = resFinalGO$GOresults, tit = "Final GO Analysis",
xname = "module", yname = "count", posx = 1.9)
print(pl)
rm(pl)
Comparing this result with initial clusters, it is observed that gene ontology enrichment has been increased slightly. We found out, if all initial clusters are remarkable, like here, Semi-labeling and Semi-supervised steps does not change the genes cluster labels fundamentally, and initial clusters converges to final clusters. Otherwise, SGCP
will wiped out the non-significant clusters and increases the enrichment of remarkable clusters. And, final modules are different from initial clusters. You can see the detail in manuscript SGCP manuscript.
rm(resClus, resInitialGO, resSemiLabel, resSemiSupervised, resFinalGO, pl)
sI <- sessionInfo()
sI
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] caret_6.0-94 lattice_0.22-5
## [3] ggplot2_3.4.4 org.Hs.eg.db_3.18.0
## [5] AnnotationDbi_1.64.0 SummarizedExperiment_1.32.0
## [7] Biobase_2.62.0 GenomicRanges_1.54.0
## [9] GenomeInfoDb_1.38.0 IRanges_2.36.0
## [11] S4Vectors_0.40.0 BiocGenerics_0.48.0
## [13] MatrixGenerics_1.14.0 matrixStats_1.0.0
## [15] SGCP_1.2.0 knitr_1.44
## [17] BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.7
## [4] magrittr_2.0.3 magick_2.8.1 farver_2.1.1
## [7] rmarkdown_2.25 zlibbioc_1.48.0 vctrs_0.6.4
## [10] memoise_2.0.1 RCurl_1.98-1.12 htmltools_0.5.6.1
## [13] S4Arrays_1.2.0 cellranger_1.1.0 pROC_1.18.4
## [16] SparseArray_1.2.0 parallelly_1.36.0 sass_0.4.7
## [19] bslib_0.5.1 plyr_1.8.9 lubridate_1.9.3
## [22] rootSolve_1.8.2.4 cachem_1.0.8 lifecycle_1.0.3
## [25] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.6-1.1
## [28] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.11
## [31] future_1.33.0 digest_0.6.33 Exact_3.2
## [34] colorspace_2.1-0 RSpectra_0.16-1 RSQLite_2.3.1
## [37] labeling_0.4.3 timechange_0.2.0 fansi_1.0.5
## [40] httr_1.4.7 abind_1.4-5 compiler_4.3.1
## [43] proxy_0.4-27 bit64_4.0.5 withr_2.5.1
## [46] DBI_1.1.3 MASS_7.3-60 lava_1.7.2.1
## [49] DelayedArray_0.28.0 gld_2.6.6 ModelMetrics_1.2.2.2
## [52] Category_2.68.0 tools_4.3.1 zip_2.3.0
## [55] future.apply_1.11.0 nnet_7.3-19 glue_1.6.2
## [58] nlme_3.1-163 grid_4.3.1 reshape2_1.4.4
## [61] generics_0.1.3 recipes_1.0.8 gtable_0.3.4
## [64] class_7.3-22 data.table_1.14.8 lmom_3.0
## [67] utf8_1.2.4 XVector_0.42.0 foreach_1.5.2
## [70] pillar_1.9.0 stringr_1.5.0 genefilter_1.84.0
## [73] splines_4.3.1 dplyr_1.1.3 survival_3.5-7
## [76] bit_4.0.5 annotate_1.80.0 tidyselect_1.2.0
## [79] RBGL_1.78.0 GO.db_3.18.0 Biostrings_2.70.0
## [82] bookdown_0.36 xfun_0.40 expm_0.999-7
## [85] hardhat_1.3.0 timeDate_4022.108 stringi_1.7.12
## [88] yaml_2.3.7 boot_1.3-28.1 evaluate_0.22
## [91] codetools_0.2-19 tibble_3.2.1 Rgraphviz_2.46.0
## [94] BiocManager_1.30.22 graph_1.80.0 cli_3.6.1
## [97] rpart_4.1.21 xtable_1.8-4 DescTools_0.99.50
## [100] munsell_0.5.0 jquerylib_0.1.4 Rcpp_1.0.11
## [103] readxl_1.4.3 globals_0.16.2 png_0.1-8
## [106] parallel_4.3.1 XML_3.99-0.14 gower_1.0.1
## [109] blob_1.2.4 GOstats_2.68.0 AnnotationForge_1.44.0
## [112] bitops_1.0-7 listenv_0.9.0 mvtnorm_1.2-3
## [115] GSEABase_1.64.0 ipred_0.9-14 scales_1.2.1
## [118] prodlim_2023.08.28 e1071_1.7-13 ggridges_0.5.4
## [121] purrr_1.0.2 openxlsx_4.2.5.2 crayon_1.5.2
## [124] rlang_1.1.1 KEGGREST_1.42.0