Have you been following the vignette on
how to create supercells,
and wonder whether it is possible to use SuperCellCyto
as a replacement for
stratified sampling to avoid overcrowding UMAP/tSNE plot?
The short answer to that is, yes we can.
We call this stratified summarising, and SuperCellCyto
can be used
for this purpose.
To do this, all we need to do is simply set the sample column of our data
to not be the biological sample the cell came from, but rather` the column
we want to stratify the data based on.
For example, when drawing UMAP or tSNE plot, we commonly subsample each
cluster or cell type to avoid crowding the plot.
Instead of subsampling, we can generate supercells for each cluster or cell
type simply by specifying the column that denotes the cluster or cell type
each cell belong to as the sample_colname
parameter!
Let’s illustrate this using a clustered (using k-means) toy data.
library(SuperCellCyto)
set.seed(42)
# Simulate some data
dat <- simCytoData()
markers_col <- paste0("Marker_", seq_len(10))
cell_id_col <- "Cell_Id"
# Run kmeans
clust <- kmeans(
x = dat[, markers_col, with = FALSE],
centers = 5
)
clust_col <- "kmeans_clusters"
dat[[clust_col]] <- paste0("cluster_", clust$cluster)
To perform stratified summarising, we supply the cluster column
(kmeans_clusters
in the example above), as runSuperCellCyto
’s
sample_colname
parameter.
supercells <- runSuperCellCyto(
dt = dat,
markers = markers_col,
sample_colname = clust_col,
cell_id_colname = cell_id_col
)
Now, if we look at the supercell_expression_matrix
, each row
(each supercell) will be denoted with the cluster it belongs to, and
not the biological sample it came from:
# Inspect the top 3 and bottom 3 of the expression matrix and some columns.
rbind(
head(supercells$supercell_expression_matrix, n = 3),
tail(supercells$supercell_expression_matrix, n = 3)
)[, c("kmeans_clusters", "SuperCellId", "Marker_10")]
#> kmeans_clusters SuperCellId Marker_10
#> <char> <char> <num>
#> 1: cluster_4 SuperCell_1_Sample_cluster_4 14.64662
#> 2: cluster_4 SuperCell_2_Sample_cluster_4 14.66858
#> 3: cluster_4 SuperCell_3_Sample_cluster_4 14.41837
#> 4: cluster_5 SuperCell_498_Sample_cluster_5 16.99003
#> 5: cluster_5 SuperCell_499_Sample_cluster_5 17.09864
#> 6: cluster_5 SuperCell_500_Sample_cluster_5 15.85447
If we look at the number of supercells created and check how many cells
there were in each cluster, we will find that, for each cluster, we get
approximately n_cells_in_the_cluster/20
where 20 is the gam
parameter
value we used for runSuperCellCyto
(this is the default).
# Compute how many cells per cluster, and divide by 20, the gamma value.
table(dat$kmeans_clusters) / 20
#>
#> cluster_1 cluster_2 cluster_3 cluster_4 cluster_5
#> 120.25 130.30 119.75 129.70 500.00
table(supercells$supercell_expression_matrix$kmeans_clusters)
#>
#> cluster_1 cluster_2 cluster_3 cluster_4 cluster_5
#> 120 130 120 130 500
sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 parallel stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] future_1.67.0 Seurat_5.3.0
#> [3] SeuratObject_5.2.0 sp_2.2-0
#> [5] bluster_1.19.0 scater_1.37.0
#> [7] ggplot2_4.0.0 BiocSingular_1.25.0
#> [9] scran_1.37.0 scuttle_1.19.0
#> [11] SingleCellExperiment_1.31.1 SummarizedExperiment_1.39.2
#> [13] Biobase_2.69.1 GenomicRanges_1.61.5
#> [15] Seqinfo_0.99.2 IRanges_2.43.4
#> [17] S4Vectors_0.47.4 BiocGenerics_0.55.1
#> [19] generics_0.1.4 MatrixGenerics_1.21.0
#> [21] matrixStats_1.5.0 qs2_0.1.5
#> [23] flowCore_2.21.0 data.table_1.17.8
#> [25] BiocParallel_1.43.4 SuperCellCyto_0.99.2
#> [27] BiocStyle_2.37.1
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.22 splines_4.5.1 later_1.4.4
#> [4] tibble_3.3.0 polyclip_1.10-7 fastDummies_1.7.5
#> [7] lifecycle_1.0.4 edgeR_4.7.5 globals_0.18.0
#> [10] lattice_0.22-7 MASS_7.3-65 magrittr_2.0.4
#> [13] limma_3.65.4 plotly_4.11.0 sass_0.4.10
#> [16] rmarkdown_2.30 jquerylib_0.1.4 yaml_2.3.10
#> [19] metapod_1.17.0 httpuv_1.6.16 sctransform_0.4.2
#> [22] spam_2.11-1 spatstat.sparse_3.1-0 reticulate_1.43.0
#> [25] cowplot_1.2.0 pbapply_1.7-4 RColorBrewer_1.1-3
#> [28] abind_1.4-8 Rtsne_0.17 purrr_1.1.0
#> [31] ggrepel_0.9.6 irlba_2.3.5.1 listenv_0.9.1
#> [34] spatstat.utils_3.2-0 goftest_1.2-3 RSpectra_0.16-2
#> [37] spatstat.random_3.4-2 dqrng_0.4.1 fitdistrplus_1.2-4
#> [40] parallelly_1.45.1 codetools_0.2-20 DelayedArray_0.35.3
#> [43] tidyselect_1.2.1 farver_2.1.2 ScaledMatrix_1.17.0
#> [46] viridis_0.6.5 spatstat.explore_3.5-3 jsonlite_2.0.0
#> [49] BiocNeighbors_2.3.1 progressr_0.16.0 ggridges_0.5.7
#> [52] survival_3.8-3 tools_4.5.1 ica_1.0-3
#> [55] Rcpp_1.1.0 glue_1.8.0 gridExtra_2.3
#> [58] SparseArray_1.9.1 xfun_0.53 dplyr_1.1.4
#> [61] withr_3.0.2 BiocManager_1.30.26 fastmap_1.2.0
#> [64] digest_0.6.37 rsvd_1.0.5 R6_2.6.1
#> [67] mime_0.13 scattermore_1.2 tensor_1.5.1
#> [70] spatstat.data_3.1-8 dichromat_2.0-0.1 tidyr_1.3.1
#> [73] FNN_1.1.4.1 httr_1.4.7 htmlwidgets_1.6.4
#> [76] S4Arrays_1.9.1 uwot_0.2.3 pkgconfig_2.0.3
#> [79] gtable_0.3.6 RProtoBufLib_2.21.0 lmtest_0.9-40
#> [82] S7_0.2.0 XVector_0.49.1 htmltools_0.5.8.1
#> [85] dotCall64_1.2 bookdown_0.44 scales_1.4.0
#> [88] png_0.1-8 spatstat.univar_3.1-4 knitr_1.50
#> [91] reshape2_1.4.4 nlme_3.1-168 cachem_1.1.0
#> [94] zoo_1.8-14 stringr_1.5.2 KernSmooth_2.23-26
#> [97] miniUI_0.1.2 vipor_0.4.7 pillar_1.11.1
#> [100] grid_4.5.1 vctrs_0.6.5 RANN_2.6.2
#> [103] promises_1.3.3 stringfish_0.17.0 cytolib_2.21.0
#> [106] beachmat_2.25.5 xtable_1.8-4 cluster_2.1.8.1
#> [109] beeswarm_0.4.0 evaluate_1.0.5 tinytex_0.57
#> [112] magick_2.9.0 cli_3.6.5 locfit_1.5-9.12
#> [115] compiler_4.5.1 rlang_1.1.6 crayon_1.5.3
#> [118] future.apply_1.20.0 labeling_0.4.3 plyr_1.8.9
#> [121] ggbeeswarm_0.7.2 stringi_1.8.7 deldir_2.0-4
#> [124] viridisLite_0.4.2 SuperCell_1.0.1 lazyeval_0.2.2
#> [127] spatstat.geom_3.6-0 Matrix_1.7-4 RcppHNSW_0.6.0
#> [130] patchwork_1.3.2 statmod_1.5.0 shiny_1.11.1
#> [133] ROCR_1.0-11 igraph_2.1.4 RcppParallel_5.1.11-1
#> [136] bslib_0.9.0