This vignette introduces the usage of the Bioconductor package POWSC (POWer analysis for SCrna-seq experiment), which is specifically designed for power estimation and sample size estimation for scRNA-seq data. POWSC is a simulation-based method, to provide power evaluation and sample size recommen- dation for single-cell RNA-sequencing DE analysis. POWSC consists of a data simulator that creates realistic expres- sion data, and a power assessor that provides a comprehensive evaluation and visualization of the power and sam- ple size relationship. The data simulator in POWSC outperforms two other state-of-art simulators in capturing key characteristics of real datasets. The power assessor in POWSC provides a variety of power evaluations including stratified and marginal power analyses for DEs characterized by two forms (phase transition or magnitude tuning), under different comparison scenarios. In addition, POWSC offers information for optimizing the tradeoffs between sample size and sequencing depth with the same total reads.
POWSC 1.10.0
To install this package, start R (version > “4.1”) and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("POWSC")
We use one scRNA-seq template data about ES-MEF cell types (GSE29087). In this case, we adopt one cell type (fibro) for the two-group comparison corresponding to the 1st scenario.
suppressMessages(library(POWSC))
data("es_mef_sce")
sce = es_mef_sce[, colData(es_mef_sce)$cellTypes == "fibro"]
set.seed(12)
rix = sample(1:nrow(sce), 500)
sce = sce[rix, ]
est_Paras = Est2Phase(sce)
sim_size = c(100, 200) # A numeric vector
pow_rslt = runPOWSC(sim_size = sim_size, est_Paras = est_Paras,per_DE=0.05, DE_Method = "MAST", Cell_Type = "PW")
Determining the sample size for adequate power to detect statistical significance is a crucial step at the design stage for high-throughput experiments. Even though a number of methods and tools are available for sample size calculation for microarray and RNA-seq under the context of differential expression, this topic in the field of single-cell RNA sequencing is understudied. Moreover, the unique data characteristics present in scRNA-seq including sparsity and heterogeneity gain the challenge.
POWSC
is an R package designed for power assessment and sample size estimation in scRNA-seq experiment. It contains three main functionalities:
(1). Parameter Estimation: adopted and modified from the core of SC2P.
(2). Data Simulation: consider two cases: paired-wise comparison and multiple comparisons.
(3). Power Evaluation: provide both stratified (detailed) and marginal powers.
For the model assumption, we adopt the mixture of zero-inflated poisson (ZIP) and log-normal poisson (LNP).
The power assessor in POWSC provides a variety of power evaluations including stratified and marginal power analyses for DEs characterized by two forms (phase transition or magnitude tuning), under different comparison scenarios. In addition, POWSC offers information for optimizing the tradeoffs between sample size and sequencing depth with the same total reads.
In the context of differential expression (DE) analysis, scientists are usually interested in two different scenarios: (1) within cell type: comparing the same cell types across biological conditions such as case vs. control, which reveals the expression change of a particular cell type under different contexts. (2) between cell types: comparing different cell types under the same condition, which identifies biomarkers to distinguish cell types. In either case, the experiment starts from a number of cells randomly picked from a tissue sample consisting of a mixture of different cell types. The only factor one can control is the total number of cells.
In the first scenario, the numbers of cells for a particular cell type in different biological conditions are often similar, barring significant changes in cell composition. It uses one cell type as the benchmark data and perturbs the genes with mixture proportion (\(\pi\)) and log fold changes (\(lfc\)) as the DE genes according two forms.
# Users can customize how many cells they need as to change the number of n.
simData = Simulate2SCE(n=100, estParas1 = est_Paras, estParas2 = est_Paras)
de = runDE(simData$sce, DE_Method = "MAST")
estPower1 = Power_Disc(de, simData = simData)
estPower2 = Power_Cont(de, simData = simData)
In the second scenario, the numbers of cells for distinct cell types can be very different, so the power for DE highly depends on the mixing proportions. Here, we use 1000*57 expression matrix sampled from the patient ID AB_S7 from the GSE67835 data. The original data GSE67835 can be download here: https://www.dropbox.com/s/55zdktqfqiwfs3l/GSE67835.RData?dl=0. In this case, we utilize the most abundant three cell types including oligodendrocytes, hybrid, and neurons. The underlying cell type proportion is 20%, 30%, 50% respectively. We simulate a series of datasets corresponding to different sample sizes. For each dataset, we then perform pairwise comparison and report the power evaluation for each comparison
data("GSE67835_AB_S7_sce")
sim_size = 200 # use a large sample is preferrable
cell_per = c(0.2, 0.3, 0.5)
col = colData(sce)
exprs = assays(sce)$counts
(tb = table(colData(sce)$Patients, colData(sce)$cellTypes))
# use AB_S7 patient as example and take three cell types: astrocytes hybrid and neurons
estParas_set = NULL
celltypes = c("oligodendrocytes", "hybrid", "neurons")
for (cp in celltypes){
print(cp)
ix = intersect(grep(cp, col$cellTypes), grep("AB_S7", col$Patients))
tmp_mat = exprs[, ix]
tmp_paras = Est2Phase(tmp_mat)
estParas_set[[cp]] = tmp_paras
}
#########
######### Simulation part
#########
sim = SimulateMultiSCEs(n = sim_size, estParas_set = estParas_set, multiProb = cell_per)
#########
######### DE analysis part
#########
DE_rslt = NULL
for (comp in names(sim)){
tmp = runDE(sim[[comp]]$sce, DE_Method = "MAST")
DE_rslt[[comp]] = tmp
}
#########
######### Summarize the power result
#########
pow_rslt = pow1 = pow2 = pow1_marg = pow2_marg = NULL
TD = CD = NULL
for (comp in names(sim)){
tmp1 = Power_Disc(DE_rslt[[comp]], sim[[comp]])
tmp2 = Power_Cont(DE_rslt[[comp]], sim[[comp]])
TD = c(TD, tmp2$TD); CD = c(CD, tmp2$CD)
pow1_marg = c(pow1_marg, tmp1$power.marginal)
pow2_marg = c(pow2_marg, tmp2$power.marginal)
pow_rslt[[comp]] = list(pow1 = tmp1, pow2 = tmp2)
pow1 = rbind(pow1, tmp1$power)
pow2 = rbind(pow2, tmp2$power)
}
#########
######### Demonstrate the result by heatmap
#########
library(RColorBrewer); library(pheatmap)
breaksList = seq(0, 1, by = 0.01)
colors = colorRampPalette(rev(brewer.pal(n = 7, name = "RdYlBu")))(length(breaksList))
dimnames(pow1) = list(names(sim), names(tmp1$CD))
dimnames(pow2) = list(names(sim), names(tmp2$CD))
pheatmap(pow2, display_numbers = TRUE, color=colors, show_rownames = TRUE,
cellwidth = 30, cellheight = 40, legend = TRUE,
border_color = "grey96", na_col = "grey",
cluster_row = FALSE, cluster_cols = FALSE,
breaks = seq(0, 1, 0.01),
main = "")
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pheatmap_1.0.12 RColorBrewer_1.1-3
## [3] POWSC_1.10.0 MAST_1.28.0
## [5] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
## [7] GenomicRanges_1.54.0 GenomeInfoDb_1.38.0
## [9] IRanges_2.36.0 S4Vectors_0.40.0
## [11] MatrixGenerics_1.14.0 matrixStats_1.0.0
## [13] Biobase_2.62.0 BiocGenerics_0.48.0
## [15] BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 xfun_0.40 bslib_0.5.1
## [4] ggplot2_3.4.4 lattice_0.22-5 vctrs_0.6.4
## [7] tools_4.3.1 bitops_1.0-7 generics_0.1.3
## [10] tibble_3.2.1 fansi_1.0.5 pkgconfig_2.0.3
## [13] Matrix_1.6-1.1 data.table_1.14.8 lifecycle_1.0.3
## [16] GenomeInfoDbData_1.2.11 compiler_4.3.1 stringr_1.5.0
## [19] progress_1.2.2 statmod_1.5.0 munsell_0.5.0
## [22] htmltools_0.5.6.1 sass_0.4.7 RCurl_1.98-1.12
## [25] yaml_2.3.7 pillar_1.9.0 crayon_1.5.2
## [28] jquerylib_0.1.4 limma_3.58.0 DelayedArray_0.28.0
## [31] cachem_1.0.8 magick_2.8.1 abind_1.4-5
## [34] tidyselect_1.2.0 digest_0.6.33 stringi_1.7.12
## [37] dplyr_1.1.3 reshape2_1.4.4 bookdown_0.36
## [40] fastmap_1.1.1 grid_4.3.1 colorspace_2.1-0
## [43] cli_3.6.1 SparseArray_1.2.0 magrittr_2.0.3
## [46] S4Arrays_1.2.0 utf8_1.2.4 prettyunits_1.2.0
## [49] scales_1.2.1 rmarkdown_2.25 XVector_0.42.0
## [52] hms_1.1.3 evaluate_0.22 knitr_1.44
## [55] rlang_1.1.1 Rcpp_1.0.11 glue_1.6.2
## [58] BiocManager_1.30.22 jsonlite_1.8.7 R6_2.5.1
## [61] plyr_1.8.9 zlibbioc_1.48.0