scPipe
is a package designed to process single-cell
RNA-sequencing (scRNA-seq) data generated by different protocols. It is
designed for protocols with UMI, but can also adapt to non-UMI
protocols. scPipe
consist of two major components. The
first is data preprocessing with raw fastq as input and a gene count
matrix as output. The second component starts from the gene count
matrix, includes quality control and a shiny app for clustering and
visualization.
CEL-seq2 is a modified version of the CEL-seq method with higher sensitivity and lower cost. It uses a polyT primer with index sequences to capture a cell’s mRNA. The index sequence consists of the a cell specific index, which is designed, and a molecular index, also called unique molecular identifier (UMI), which is a random sequence that should be different for each primer (the number of distinct UMIs will depend on its length). The ouput fastq files from a CEL-seq2 experiment is paired-ended where the first read contains the transcript and second read contains the cellindex+UMI with some polyA sequence.
We begin by creating a folder to store the results.
To process the data, we need the genome fasta file, gff3 exon
annotation and a cell barcode annotation. The barcode annotation should
be a .csv
file with at least two columns, where the first
column has the cell id and the second column contains the barcode
sequence. We use ERCC spike-in genes for this demo. All files can be
found in the extdata
folder of the scPipe
package:
# file path:
ERCCfa_fn = system.file("extdata", "ERCC92.fa", package = "scPipe")
ERCCanno_fn = system.file("extdata", "ERCC92_anno.gff3", package = "scPipe")
barcode_annotation_fn = system.file("extdata", "barcode_anno.csv", package = "scPipe")
The read structure for this example is paired-ended, with the first
longer read mapping to transcripts and second shorter read consisting of
6bp UMIs followed by 8bp cell barcodes. NOTE: by
convention, scPipe
always assumes read1
refers
to the read with the transcript sequence, which is usually the longer
read. These data are also available in the in extdata
folder:
The pipeline starts with fastq file reformatting. We move the barcode
and UMI sequences to the read name and leave the transcript sequence as
is. This outputs a read name that looks like
@[barcode_sequence]*[UMI_sequence]#[readname]
… The read
structure of read 2 in our example dataset has the 8 bp long cell
barcode starting at position 6 and the 6 bp long UMI sequence starting
at the first position. So the read structure will be :
list(bs1=-1, bl1=0, bs2=6, bl2=8, us=0, ul=6)
.
bs1=-1, bl1=0
means we don’t have an index in read 1 so we
set a negative value as its start position and give it zero length.
bs2=6, bl2=8
means we have an index in read two which
starts at position 6 in the read and is 8 bases in length.
us=0, ul=6
means we have a UMI at the start of read 2 which
is 6 bases long. NOTE: we use a zero based index
system, so the indexing of the sequence starts at zero.
sc_trim_barcode(file.path(data_dir, "combined.fastq.gz"),
fq_R1,
fq_R2,
read_structure = list(bs1=-1, bl1=0, bs2=6, bl2=8, us=0, ul=6))
## trimming fastq file...
## pass QC: 46009
## removed_have_N: 0
## removed_low_qual: 0
## time elapsed: 95 milliseconds
Next we align reads to the genome. This example uses
Rsubread
but any aligner that support RNA-seq alignment and
gives standard BAM output can be used here. The Rsubread
package is not available on Windows, so the following alignment step
will only run on Mac OSX or Linux.
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.21.1
##
## //================================= setting ==================================\\
## || ||
## || Index name : ERCC_index ||
## || Index space : base space ||
## || Index split : no-split ||
## || Repeat threshold : 100 repeats ||
## || Gapped index : no ||
## || ||
## || Free / total memory : 48.3GB / 125.4GB ||
## || ||
## || Input files : 1 file in total ||
## || o ERCC92.fa ||
## || ||
## \\============================================================================//
##
## //================================= Running ==================================\\
## || ||
## || Check the integrity of provided reference sequences ... ||
## || There were 1 notes for reference sequences. ||
## || The notes can be found in the log file, '/tmp/RtmpZDFvs7/ERCC_index.log'. ||
## || Scan uninformative subreads in reference sequences ... ||
## || 1 uninformative subreads were found. ||
## || These subreads were excluded from index building. ||
## || Estimate the index size... ||
## || 8%, 0 mins elapsed, rate=115.5k bps/s ||
## || 49%, 0 mins elapsed, rate=653.7k bps/s ||
## || 66%, 0 mins elapsed, rate=702.0k bps/s ||
## || 3.0 GB of memory is needed for index building. ||
## || Build the index... ||
## || 8%, 0 mins elapsed, rate=12.3k bps/s ||
## || 49%, 0 mins elapsed, rate=73.3k bps/s ||
## || 66%, 0 mins elapsed, rate=97.4k bps/s ||
## || Save current index block... ||
## || [ 0.0% finished ] ||
## || [ 10.0% finished ] ||
## || [ 20.0% finished ] ||
## || [ 30.0% finished ] ||
## || [ 40.0% finished ] ||
## || [ 50.0% finished ] ||
## || [ 60.0% finished ] ||
## || [ 70.0% finished ] ||
## || [ 80.0% finished ] ||
## || [ 90.0% finished ] ||
## || [ 100.0% finished ] ||
## || ||
## || Total running time: 0.2 minutes. ||
## || Index /tmp/RtmpZDFvs7/ERCC_index was successfully built. ||
## || ||
## \\============================================================================//
Rsubread::align(index=file.path(data_dir, "ERCC_index"),
readfile1=file.path(data_dir, "combined.fastq.gz"),
output_file=file.path(data_dir, "out.aln.bam"), phredOffset=64)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.21.1
##
## //================================= setting ==================================\\
## || ||
## || Function : Read alignment (RNA-Seq) ||
## || Input file : combined.fastq.gz ||
## || Output file : out.aln.bam (BAM) ||
## || Index name : ERCC_index ||
## || ||
## || ------------------------------------ ||
## || ||
## || Threads : 1 ||
## || Phred offset : 64 ||
## || Min votes : 3 / 10 ||
## || Max mismatches : 3 ||
## || Max indel length : 5 ||
## || Report multi-mapping reads : yes ||
## || Max alignments per multi-mapping read : 1 ||
## || ||
## \\============================================================================//
##
## //=============== Running (18-Dec-2024 18:59:59, pid=1539763) ================\\
## || ||
## || Check the input reads. ||
## || The input file contains base space reads. ||
## || Initialise the memory objects. ||
## || Estimate the mean read length. ||
## || The range of Phred scores observed in the data is [1,1] ||
## || Create the output BAM file. ||
## || Check the index. ||
## || Init the voting space. ||
## || Global environment is initialised. ||
## || Load the 1-th index block... ||
## || The index block has been loaded. ||
## || Start read mapping in chunk. ||
## || 23% completed, 0.6 mins elapsed, rate=118.7k reads per second ||
## || 43% completed, 0.6 mins elapsed, rate=141.2k reads per second ||
## || 62% completed, 0.6 mins elapsed, rate=148.5k reads per second ||
## || 82% completed, 0.6 mins elapsed, rate=154.3k reads per second ||
## || 72% completed, 0.6 mins elapsed, rate=0.6k reads per second ||
## || 78% completed, 0.6 mins elapsed, rate=0.7k reads per second ||
## || 83% completed, 0.6 mins elapsed, rate=0.7k reads per second ||
## || 88% completed, 0.6 mins elapsed, rate=0.8k reads per second ||
## || 92% completed, 0.6 mins elapsed, rate=0.8k reads per second ||
## || 97% completed, 0.6 mins elapsed, rate=0.8k reads per second ||
## || 102% completed, 0.6 mins elapsed, rate=0.9k reads per second ||
## || 107% completed, 0.6 mins elapsed, rate=0.9k reads per second ||
## || 113% completed, 0.6 mins elapsed, rate=1.0k reads per second ||
## || ||
## || Completed successfully. ||
## || ||
## \\==================================== ====================================//
##
## //================================ Summary =================================\\
## || ||
## || Total reads : 46,009 ||
## || Mapped : 46,009 (100.0%) ||
## || Uniquely mapped : 46,009 ||
## || Multi-mapping : 0 ||
## || ||
## || Unmapped : 0 ||
## || ||
## || Indels : 0 ||
## || ||
## || Running time : 0.6 minutes ||
## || ||
## \\============================================================================//
## out.aln.bam
## Total_reads 46009
## Mapped_reads 46009
## Uniquely_mapped_reads 46009
## Multi_mapping_reads 0
## Unmapped_reads 0
## Indels 0
After the read alignment, we assign reads to exons based on the
annotation provided using the sc_exon_mapping
function.
Currently scPipe
only supports annotation in
gff3
or bed
format.
sc_exon_mapping(file.path(data_dir, "out.aln.bam"),
file.path(data_dir, "out.map.bam"),
ERCCanno_fn)
## adding annotation files...
## time elapsed: 0 milliseconds
##
## annotating exon features...
## updating progress every 3 minutes...
## 46009 reads processed, 46k reads/sec
## number of read processed: 46009
## unique map to exon: 46009 (100.00%)
## ambiguous map to multiple exon: 0 (0.00%)
## map to intron: 0 (0.00%)
## not mapped: 0 (0.00%)
## unaligned: 0 (0.00%)
Next we use the sc_demultiplex
function to split the
reads into separate .csv
files for each cell in the
/count
subfolder. Each file contains three columns and each
row corresponds to a distinct read. The first column contains the gene
ID that the read maps to, the second column gives the UMI sequence for
that read and the third column is the distance to the transcript end
position (TES) in bp. This file will be used for UMI deduplication and
to generate a gene count matrix by calling the
sc_gene_counting
function.
## demultiplexing reads by barcode...
## Warning: mitochondrial chromosome not found using chromosome name `MT`.
## time elapsed: 93 milliseconds
## summarising gene counts...
## time elapsed: 65 milliseconds
We have now completed the preprocessing steps. The gene count matrix
is available as a .csv
file in
data_dir/gene_count.csv
and quality control statistics are
saved in the data_dir/stat
folder. These data are useful
for later quality control (QC).
The greatest difference between scRNA-seq protocols is in their read
structure, which is specified by the read_structure
argument of the sc_trim_barcode
function in
scPipe
. The major difference between CEL-seq and
Drop-seq/Chromium 10x is that the cell barcode is unknown in advance for
the latter protocols, so an extra step sc_detect_bc
is
required before running sc_demultiplex
, to identify and
generate the cell barcode annotation. scPipe
is not
optimized for the full-lengh protocols like SMART-seq, but it can
process the data generated by such protocols by setting the
has_UMI
to FALSE
in the
sc_demultiplex
function.
The easiest way to create a SingleCellExperiment object from
the output of scPipe
preprocessing is using
create_sce_by_dir
function, that will read in the gene
count matrix together with the QC information available in the
stat
folder.
## [1] 92 10
The dataset we analysed above used ERCC simulated reads from 10 cells
of perfect quality. In order to demonstrate QC on a more reaslitic
example, we have included an experimental dataset in the
scPipe
software generated by Dr Christine Biben. This
experiment profiles gene expression in 383 blood cells for a subset of
1000 genes (to save space). We first create a
SingleCellExperiment object directly without using the
create_sce_by_dir
function:
data("sc_sample_data")
data("sc_sample_qc")
sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) # generate new sce with gene count matrix
QC_metrics(sce) = sc_sample_qc
demultiplex_info(sce) = cell_barcode_matching
## organism/gene_id_type not provided. Make a guess:mmusculus_gene_ensembl/ensembl_gene_id
There are several plots we can create to assess the overall quality of the experiment. The first is to look at the cell barcode demultiplexing statistics. To do this we generate a bar plot that shows the percentage of reads that uniquely match to the cell barcodes, as well as the unmatched proportion of reads and their alignment rates to introns and exons. If we observe a large proportion of unmatched reads that map to exons, this indicates a failure of the cell barcode demultiplexing.
A second plot shows the duplication rate which can be used to evaluate read depth. UMIs are routinely used to mark individual molecules and after PCR amplification, the same molecule with have multiple copys, which can be identified and removed if they are observed to have the same UMI sequence. Therefore, the copy number of UMIs is an indication of the PCR amplification rate.
## `geom_smooth()` using formula = 'y ~ x'
Next we calculate QC metrics and use the detect_outlier
function to identify poor quality cells. The detect_outlier
function has argument comp
to define the maximum component
of the gaussian mixture model. Using the default value of 1 is generally
sufficient, but in cases where the data are heterogeneous in terms of
quality control metrics, setting this value to 2 or 3 can give better
results. This function will remove low quality cells if
type="low"
, large cells if type="high"
or both
when type="both"
. The conf
argument specifies
the lower and upper confidence intervals for outliers and
detect_outlier
is insensistive to the interval values.
We can plot the alignment statistics per sample as a barplot using
and generate pairwise plots of the QC metrics as follows:
The final step will be to remove the low quality cells by the
remove_outliers
function.
## [1] 1000 359
Since the scater and scran packages both use the SingleCellExperiment class, it will be easy to further process this data using these packages for normalization and visualization. Other packages such as SC3 may be useful for clustering and MAST and edgeR for differential expression analysis.
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=en_US.UTF-8
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scPipe_2.7.1 SingleCellExperiment_1.29.1
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [7] IRanges_2.41.2 S4Vectors_0.45.2
## [9] BiocGenerics_0.53.3 generics_0.1.3
## [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9 httr2_1.0.7
## [4] biomaRt_2.63.0 rlang_1.1.4 magrittr_2.0.3
## [7] compiler_4.5.0 RSQLite_2.3.9 mgcv_1.9-1
## [10] dir.expiry_1.15.0 png_0.1-8 vctrs_0.6.5
## [13] stringr_1.5.1 pkgconfig_2.0.3 crayon_1.5.3
## [16] Rsubread_2.21.1 fastmap_1.2.0 dbplyr_2.5.0
## [19] XVector_0.47.0 labeling_0.4.3 Rsamtools_2.23.1
## [22] rmarkdown_2.29 UCSC.utils_1.3.0 purrr_1.0.2
## [25] bit_4.5.0.1 xfun_0.49 zlibbioc_1.53.0
## [28] cachem_1.1.0 jsonlite_1.8.9 progress_1.2.3
## [31] blob_1.2.4 DelayedArray_0.33.3 reshape_0.8.9
## [34] BiocParallel_1.41.0 parallel_4.5.0 prettyunits_1.2.0
## [37] R6_2.5.1 RColorBrewer_1.1-3 bslib_0.8.0
## [40] stringi_1.8.4 GGally_2.2.1 reticulate_1.40.0
## [43] rtracklayer_1.67.0 jquerylib_0.1.4 Rcpp_1.0.13-1
## [46] bookdown_0.41 knitr_1.49 org.Mm.eg.db_3.20.0
## [49] R.utils_2.12.3 splines_4.5.0 Matrix_1.7-1
## [52] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10
## [55] codetools_0.2-20 curl_6.0.1 lattice_0.22-6
## [58] tibble_3.2.1 plyr_1.8.9 withr_3.0.2
## [61] basilisk.utils_1.19.0 KEGGREST_1.47.0 evaluate_1.0.1
## [64] ggstats_0.7.0 BiocFileCache_2.15.0 xml2_1.3.6
## [67] mclust_6.1.1 Biostrings_2.75.3 pillar_1.10.0
## [70] BiocManager_1.30.25 filelock_1.0.3 RCurl_1.98-1.16
## [73] ggplot2_3.5.1 hms_1.1.3 munsell_0.5.1
## [76] scales_1.3.0 glue_1.8.0 tools_4.5.0
## [79] BiocIO_1.17.1 robustbase_0.99-4-1 data.table_1.16.4
## [82] GenomicAlignments_1.43.0 XML_3.99-0.17 grid_4.5.0
## [85] tidyr_1.3.1 colorspace_2.1-1 AnnotationDbi_1.69.0
## [88] nlme_3.1-166 GenomeInfoDbData_1.2.13 basilisk_1.19.0
## [91] Rhtslib_3.3.1 restfulr_0.0.15 cli_3.6.3
## [94] rappdirs_0.3.3 S4Arrays_1.7.1 dplyr_1.1.4
## [97] gtable_0.3.6 DEoptimR_1.1-3-1 hash_2.2.6.3
## [100] R.methodsS3_1.8.2 sass_0.4.9 digest_0.6.37
## [103] SparseArray_1.7.2 farver_2.1.2 org.Hs.eg.db_3.20.0
## [106] rjson_0.2.23 memoise_2.0.1 htmltools_0.5.8.1
## [109] R.oo_1.27.0 lifecycle_1.0.4 httr_1.4.7
## [112] MASS_7.3-61 bit64_4.5.2