DelayedDataFrame 1.18.0
As the genetic/genomic data are having increasingly larger profile,
the annotation file are also getting much bigger than expected. the
memory space in R has been an obstable for fast and efficient data
processing, because most available R or Bioconductor packages are
developed based on in-memory data manipulation. With some newly
developed data structure as HDF5 or GDS, and the R interface
of DelayedArray to represent on-disk data structures with
different back-end in R-user-friendly array data structure (e.g.,
HDF5Array,GDSArray), the high-throughput genetic/genomic data
are now being able to easily loaded and manipulated within
R. However, the annotation files for the samples and features inside
the high-through data are also getting unexpectedly larger than
before. With an ordinary data.frame or DataFrame, it is still
getting more and more challenging for any analysis to be done within
R. So here we have developed the DelayedDataFrame, which has the
very similar characteristics as data.frame and DataFrame. But at
the same time, all column data could be optionally saved on-disk
(e.g., in DelayedArray structure with any back-end). Common
operations like constructing, subsetting, splitting, combining could
be done in the same way as DataFrame. This feature of
DelayedDataFrame could enable efficient on-disk reading and
processing of the large-scale annotation files, and at the same,
signicantly saves memory space with common DataFrame metaphor in R
and Bioconductor.
Download the package from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DelayedDataFrame")
The development version is also available to download through github:
BiocManager::install("Bioconductor/DelayedDataFrame")
Load the package into R session before using:
library(DelayedDataFrame)
DelayedDataFrame extends the DataFrame data structure, with an
additional slot called lazyIndex, which saves all the mapping
indexes for each column of the data inside DelayedDataFrame. It is
similar to data.frame in terms of construction, subsetting,
splitting, combining… The rownames are having same feature as
DataFrame. It will not be given automatically, but only by
explicitly specify in the constructor function DelayedDataFrame(, row.names=...) or using the slot setter function rownames()<-.
Here we use the GDSArray data as example to show the
DelayedDataFrame characteristics. GDSArray is a Bioconductor
package that represents GDS files as objects derived from the
DelayedArray package and DelayedArray class. It carries the
on-disk data path and represent the GDS nodes in a
DelayedArray-derived data structure.
The GDSArray() constructor takes 2 arguments: the file path and the
GDS node name inside the GDS file.
library(GDSArray)
## Loading required package: gdsfmt
file <- SeqArray::seqExampleFileName("gds")
gdsnodes(file)
## [1] "sample.id" "variant.id"
## [3] "position" "chromosome"
## [5] "allele" "genotype/data"
## [7] "genotype/~data" "genotype/extra.index"
## [9] "genotype/extra" "phase/data"
## [11] "phase/~data" "phase/extra.index"
## [13] "phase/extra" "annotation/id"
## [15] "annotation/qual" "annotation/filter"
## [17] "annotation/info/AA" "annotation/info/AC"
## [19] "annotation/info/AN" "annotation/info/DP"
## [21] "annotation/info/HM2" "annotation/info/HM3"
## [23] "annotation/info/OR" "annotation/info/GP"
## [25] "annotation/info/BN" "annotation/format/DP/data"
## [27] "annotation/format/DP/~data" "sample.annotation/family"
varid <- GDSArray(file, "annotation/id")
DP <- GDSArray(file, "annotation/info/DP")
We use an ordinary character vector and the GDSArray objects to
construct a DelayedDataFrame object.
ddf <- DelayedDataFrame(varid, DP) ## only accommodate 1D GDSArrays with same length
The slots of DelayedDataFrame could be accessed by lazyIndex(),
nrow(), rownames() (if not NULL) functions. With a newly
constructed DelayedDataFrame object, the initial value of
lazyIndex slot will be NULL for all columns.
lazyIndex(ddf)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
nrow(ddf)
## [1] 1348
rownames(ddf)
## NULL
lazyIndex slotThe lazyIndex slot is in LazyIndex class, which is defined in the
DelayedDataFrame package and extends the SimpleList class. The
listData slot saves unique indexes for all the columns, and the
index slots saves the position of index in listData slot for each
column in DelayedDataFrame object. In the above example, with an
initial construction of DelayedDataFrame object, the index for each
column will all be NULL, and all 3 columns points the NULL values
which sits in the first position in listData slot of lazyIndex.
lazyIndex(ddf)@listData
## [[1]]
## NULL
lazyIndex(ddf)@index
## [1] 1 1
Whenever an operation is done (e.g., subsetting), the listData slot
inside the DelayedDataFrame stays the same, but the lazyIndex slot
will be updated, so that the show method, further statistical
calculation will be applied to the subsetting data set. For example,
here we subset the DelayedDataFrame object ddf to keep only the
first 5 rows, and see how the lazyIndex works. As shown in below,
after subsetting, the listData slot in ddf1 stays the same as
ddf. But the subsetting operation was recorded in the lazyIndex
slot, and the slots of lazyIndex, nrows and rownames (if not
NULL) are all updated. So the subsetting operation is kind of
delayed.
ddf1 <- ddf[1:20,]
identical(ddf@listData, ddf1@listData)
## [1] TRUE
lazyIndex(ddf1)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1
nrow(ddf1)
## [1] 20
Only when functions like DataFrame(), or as.list(), the
lazyIndex will be realized and DelayedDataFrame returned.
We will show the realization in the following coercion method section.
The common methods on data.frame or DataFrame are also defined on
DelayedDataFrame class, so that they behave similarily on
DelayedDataFrame objects.
Coercion methods between DelayedDataFrame and other data structures
are defined. When coercing from ANY to DelayedDataFrame, the
lazyIndex slot will be added automatically, with the initial NULL
value of indexes for each column.
as(letters, "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## X
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 z
as(DataFrame(letters), "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## letters
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 z
(a <- as(list(a=1:5, b=6:10), "DelayedDataFrame"))
## DelayedDataFrame with 5 rows and 2 columns
## a b
## <integer> <integer>
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
When coerce DelayedDataFrame into other data structure, the
lazyIndex slot will be realized and the new data structure
returned. For example, when DelayedDataFrame is coerced into a
DataFrame object, the listData slot will be updated according to
the lazyIndex slot.
df1 <- as(ddf1, "DataFrame")
df1@listData
## $varid
## <20> DelayedArray object of type "character":
## [1] [2] [3] . [19]
## "rs111751804" "rs114390380" "rs1320571" . "rs61751002"
## [20]
## "rs6691840"
##
## $DP
## <20> DelayedArray object of type "integer":
## [1] [2] [3] [4] . [17] [18] [19] [20]
## 3251 2676 7610 3383 . 6040 6589 5089 6871
dim(df1)
## [1] 20 2
[two-dimensional [ subsetting on DelayedDataFrame objects by
integer, character, logical values all work.
ddf[, 1, drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206
ddf[, "DP", drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## DP
## <GDSArray>
## 1 3251
## 2 2676
## 3 7610
## ... ...
## 1346 823
## 1347 1257
## 1348 48
ddf[, c(TRUE,FALSE), drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206
When subsetting using [ on an already subsetted DelayedDataFrame
object, the lazyIndex, nrows and rownames(if not NULL) slot will
be updated.
(a <- ddf1[1:10, 2, drop=FALSE])
## DelayedDataFrame with 10 rows and 1 column
## DP
## <GDSArray>
## 1 3251
## 2 2676
## 3 7610
## ... ...
## 8 9076
## 9 9605
## 10 9707
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## index of each column:
## [1] 1
nrow(a)
## [1] 10
[[The [[ subsetting will take column subscripts for integer or
character values, and return corresponding columns in it’s original
data format.
ddf[[1]]
## <1348> GDSArray object of type "character":
## [1] [2] [3] . [1347]
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## [1348]
## "rs5771206"
ddf[["varid"]]
## <1348> GDSArray object of type "character":
## [1] [2] [3] . [1347]
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## [1348]
## "rs5771206"
identical(ddf[[1]], ddf[["varid"]])
## [1] TRUE
rbind/cbindWhen doing rbind, the lazyIndex of input arguments will be
realized and a new DelayedDataFrame with NULL lazyIndex will be
returned.
ddf2 <- ddf[21:40, ]
(ddfrb <- rbind(ddf1, ddf2))
## DelayedDataFrame with 40 rows and 2 columns
## varid DP
## <DelayedArray> <DelayedArray>
## 1 rs111751804 3251
## 2 rs114390380 2676
## 3 rs1320571 7610
## ... ... ...
## 38 rs1886116 3641
## 39 rs115917561 3089
## 40 rs61751016 4109
lazyIndex(ddfrb)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
cbind of DelayedDataFrame objects will keep all existing
lazyIndex of input arguments and carry into the new
DelayedDataFrame object.
(ddfcb <- cbind(varid = ddf1[,1, drop=FALSE], DP=ddf1[, 2, drop=FALSE]))
## DelayedDataFrame with 20 rows and 2 columns
## varid DP.DP
## <GDSArray> <GDSArray>
## 1 rs111751804 3251
## 2 rs114390380 2676
## 3 rs1320571 7610
## ... ... ...
## 18 rs115614983 6589
## 19 rs61751002 5089
## 20 rs6691840 6871
lazyIndex(ddfcb)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GDSArray_1.22.0 gdsfmt_1.38.0 DelayedDataFrame_1.18.0
## [4] DelayedArray_0.28.0 SparseArray_1.2.0 S4Arrays_1.2.0
## [7] abind_1.4-5 IRanges_2.36.0 MatrixGenerics_1.14.0
## [10] matrixStats_1.0.0 Matrix_1.6-1.1 S4Vectors_0.40.0
## [13] BiocGenerics_0.48.0 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.7 compiler_4.3.1 BiocManager_1.30.22
## [4] crayon_1.5.2 Biostrings_2.70.0 bitops_1.0-7
## [7] GenomicRanges_1.54.0 parallel_4.3.1 jquerylib_0.1.4
## [10] yaml_2.3.7 fastmap_1.1.1 lattice_0.22-5
## [13] R6_2.5.1 XVector_0.42.0 GenomeInfoDb_1.38.0
## [16] knitr_1.44 bookdown_0.36 GenomeInfoDbData_1.2.11
## [19] bslib_0.5.1 rlang_1.1.1 cachem_1.0.8
## [22] xfun_0.40 sass_0.4.7 cli_3.6.1
## [25] zlibbioc_1.48.0 digest_0.6.33 grid_4.3.1
## [28] SeqArray_1.42.0 evaluate_0.22 RCurl_1.98-1.12
## [31] rmarkdown_2.25 tools_4.3.1 htmltools_0.5.6.1