1 The scpdata package

scpdata disseminates mass spectrometry (MS)-based single-cell proteomics (SCP) data sets formatted using the scp data structure. The data structure is described in the scp vignette.

In this vignette, we describe how to access the SCP data sets. To start, we load the scpdata package.

library("scpdata")

2 Load data from ExperimentHub

The data is stored using the ExperimentHub infrastructure. We first create a connection with ExperimentHub.

eh <- ExperimentHub()

You can list all data sets available in scpdata using the query function.

query(eh, "scpdata")
#> ExperimentHub with 21 records
#> # snapshotDate(): 2023-10-24
#> # $dataprovider: MassIVE, PRIDE, SlavovLab website
#> # $species: Homo sapiens, Mus musculus, Rattus norvegicus, Gallus gallus
#> # $rdataclass: QFeatures
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["EH3899"]]' 
#> 
#>            title             
#>   EH3899 | specht2019v2      
#>   EH3900 | specht2019v3      
#>   EH3901 | dou2019_lysates   
#>   EH3902 | dou2019_mouse     
#>   EH3903 | dou2019_boosting  
#>   ...      ...               
#>   EH7713 | brunner2022       
#>   EH8301 | leduc2022_pSCoPE  
#>   EH8302 | leduc2022_plexDIA 
#>   EH8303 | woo2022_macrophage
#>   EH8304 | woo2022_lung

Another way to get information about the available data sets is to call scpdata(). This will retrieve all the available metadata. For example, we can retrieve the data set titles along with the description to make an informed choice about which data set to choose.

info <- scpdata()
knitr::kable(info[, c("title", "description")])
title description
EH3899 specht2019v2 SCP expression data for monocytes (U-937) and macrophages at PSM, peptide and protein level
EH3900 specht2019v3 SCP expression data for more monocytes (U-937) and macrophages at PSM, peptide and protein level
EH3901 dou2019_lysates SCP expression data for Hela digests (0.2 or 10 ng) at PSM and protein level
EH3902 dou2019_mouse SCP expression data for C10, SVEC or Raw cells at PSM and protein level
EH3903 dou2019_boosting SCP expression data for C10, SVEC or Raw cells and 3 boosters (0, 5 or 50 ng) at PSM and protein level
EH3904 zhu2018MCP Near SCP expression data for micro-dissection rat brain samples (50, 100, or 200 µm width) at PSM level
EH3905 zhu2018NC_hela Near SCP expression data for HeLa samples (aproximately 12, 40, or 140 cells) at PSM level
EH3906 zhu2018NC_lysates Near SCP expression data for HeLa lysates (10, 40 and 140 cell equivalent) at PSM level
EH3907 zhu2018NC_islets Near SCP expression data for micro-dissected human pancreas samples (control patients or type 1 diabetes) at PSM level
EH3908 cong2020AC SCP expression data for Hela cells at PSM, peptide and protein level
EH3909 zhu2019EL SCP expression data for chicken utricle samples (1, 3, 5 or 20 cells) at PSM, peptide and protein level
EH6011 liang2020_hela Expression data for HeLa cells (0, 1, 10, 150, 500 cells) at PSM, peptide and protein level
EH7085 schoof2021 Single-cell proteomics data from OCI-AML8227 cell culture to reconstruct the cellular hierarchy.
EH7295 williams2020_lfq Single-cell label free proteomics data from a MCF10A cell line culture.
EH7296 williams2020_tmt Single-cell proteomics data from three acute myeloid leukemia cell line culture (MOLM-14, K562, CMK).
EH7712 derks2022 Single-cell and bulk (100-cell) proteomics data of PDAC, melanoma cells and monocytes.
EH7713 brunner2022 Single-cell proteomics data of cell cycle stages in HeLa.
EH8301 leduc2022_pSCoPE Single-cell proteomics data of 878 melanoma cells and 877 monocytes (pSCoPE).
EH8302 leduc2022_plexDIA Single-cell proteomics data of 126 melanoma cells (plexDIA).
EH8303 woo2022_macrophage Single-cell proteomics data from LPS-treated macrophages.
EH8304 woo2022_lung Single-cell proteomics data from primary human lung cells.

To get one of the data sets (e.g. dou2019_lysates) you can either retrieve it using the ExperimentHub query function

scp <- eh[["EH3901"]]
#> see ?scpdata and browseVignettes('scpdata') for documentation
#> loading from cache
scp
#> An instance of class QFeatures containing 4 assays:
#>  [1] Hela_run_1: SingleCellExperiment with 24562 rows and 10 columns 
#>  [2] Hela_run_2: SingleCellExperiment with 24310 rows and 10 columns 
#>  [3] peptides: SingleCellExperiment with 13934 rows and 20 columns 
#>  [4] proteins: SingleCellExperiment with 1641 rows and 20 columns

or you can the use the built-in functions from scpdata

scp <- dou2019_lysates()
#> see ?scpdata and browseVignettes('scpdata') for documentation
#> loading from cache
scp
#> An instance of class QFeatures containing 4 assays:
#>  [1] Hela_run_1: SingleCellExperiment with 24562 rows and 10 columns 
#>  [2] Hela_run_2: SingleCellExperiment with 24310 rows and 10 columns 
#>  [3] peptides: SingleCellExperiment with 13934 rows and 20 columns 
#>  [4] proteins: SingleCellExperiment with 1641 rows and 20 columns

3 Data sets information

Each data set has been extensively documented in a separate man page (e.g. ?dou2019_lysates). You can find information about the data content, the acquisition protocol, the data collection procedure as well as the data sources and reference.

4 Data manipulation

For more information about manipulating the data sets, check the scp package. The scp vignette will guide you through a typical SCP data processing workflow. Once your data is loaded from scpdata you can skip section 2 Read in SCP data of the scp vignette.

Session information

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SingleCellExperiment_1.24.0 scpdata_1.10.0             
 [3] ExperimentHub_2.10.0        AnnotationHub_3.10.0       
 [5] BiocFileCache_2.10.0        dbplyr_2.3.4               
 [7] QFeatures_1.12.0            MultiAssayExperiment_1.28.0
 [9] SummarizedExperiment_1.32.0 Biobase_2.62.0             
[11] GenomicRanges_1.54.0        GenomeInfoDb_1.38.0        
[13] IRanges_2.36.0              S4Vectors_0.40.0           
[15] BiocGenerics_0.48.0         MatrixGenerics_1.14.0      
[17] matrixStats_1.0.0           BiocStyle_2.30.0           

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0              dplyr_1.1.3                  
 [3] blob_1.2.4                    Biostrings_2.70.1            
 [5] filelock_1.0.2                bitops_1.0-7                 
 [7] fastmap_1.1.1                 lazyeval_0.2.2               
 [9] RCurl_1.98-1.12               promises_1.2.1               
[11] digest_0.6.33                 mime_0.12                    
[13] lifecycle_1.0.3               cluster_2.1.4                
[15] ellipsis_0.3.2                ProtGenerics_1.34.0          
[17] KEGGREST_1.42.0               interactiveDisplayBase_1.40.0
[19] RSQLite_2.3.1                 magrittr_2.0.3               
[21] compiler_4.3.1                rlang_1.1.1                  
[23] sass_0.4.7                    tools_4.3.1                  
[25] igraph_1.5.1                  utf8_1.2.4                   
[27] yaml_2.3.7                    knitr_1.44                   
[29] S4Arrays_1.2.0                bit_4.0.5                    
[31] curl_5.1.0                    DelayedArray_0.28.0          
[33] abind_1.4-5                   withr_2.5.1                  
[35] purrr_1.0.2                   grid_4.3.1                   
[37] fansi_1.0.5                   xtable_1.8-4                 
[39] MASS_7.3-60                   cli_3.6.1                    
[41] rmarkdown_2.25                crayon_1.5.2                 
[43] generics_0.1.3                httr_1.4.7                   
[45] BiocBaseUtils_1.4.0           DBI_1.1.3                    
[47] cachem_1.0.8                  zlibbioc_1.48.0              
[49] AnnotationDbi_1.64.0          AnnotationFilter_1.26.0      
[51] BiocManager_1.30.22           XVector_0.42.0               
[53] vctrs_0.6.4                   Matrix_1.6-1.1               
[55] jsonlite_1.8.7                bookdown_0.36                
[57] bit64_4.0.5                   clue_0.3-65                  
[59] jquerylib_0.1.4               glue_1.6.2                   
[61] BiocVersion_3.18.0            later_1.3.1                  
[63] tibble_3.2.1                  pillar_1.9.0                 
[65] rappdirs_0.3.3                htmltools_0.5.6.1            
[67] GenomeInfoDbData_1.2.11       R6_2.5.1                     
[69] evaluate_0.22                 shiny_1.7.5.1                
[71] lattice_0.22-5                png_0.1-8                    
[73] memoise_2.0.1                 httpuv_1.6.12                
[75] bslib_0.5.1                   Rcpp_1.0.11                  
[77] SparseArray_1.2.0             xfun_0.40                    
[79] MsCoreUtils_1.14.0            pkgconfig_2.0.3              

5 License

This vignette is distributed under a CC BY-SA license.