HiCool
The HiCool
R/Bioconductor package provides an end-to-end interface to
process and normalize Hi-C paired-end fastq reads into .(m)cool
files.
hicstuff
python library
(https://github.com/koszullab/hicstuff).hicstuff
.cooler
(https://github.com/open2c/cooler)
library is used to parse pairs into a multi-resolution, balanced .mcool
file.
.(m)cool
is a compact, indexed HDF5 file format specifically tailored
for efficiently storing HiC-based data. The .(m)cool
file format was
developed by Abdennur and Mirny and
published in 2019.basilisk
environment.The main processing function offered in this package is HiCool()
.
To process .fastq
reads into .pairs
& .mcool
files, one needs to provide:
r1
and r2
);.fasta
sequence file, a path to a pre-computed bowtie2
index
or a supported ID character (hg38
, mm10
, dm6
, R64-1-1
, WBcel235
, GRCz10
,
Galgal4
);x <- HiCool(
r1 = '<PATH-TO-R1.fq.gz>',
r2 = '<PATH-TO-R2.fq.gz>',
restriction = '<RE1(,RE2)>',
binning = "<minimum resolution>",
genome = '<GENOME_ID>'
)
Here is a concrete example of Hi-C data processing.
HiContactsData
package..mcool
file will have three levels of resolutions, from 1000bp to 8000bp.R64-1-1
, the yeast genome reference.output/
directory.library(HiCool)
hcf <- HiCool(
r1 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1'),
r2 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2'),
restriction = 'DpnII,HinfI',
binning = 1000,
genome = 'R64-1-1',
output = './HiCool/'
)
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> HiCool :: Initializing processing of fastq files [tmp folder: /tmp/Rtmp8JxNUb/03KUOG]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /tmp/RtmplcS2om/Rbuild35f77067ecccdb/HiCool/vignettes/HiCool/15c427129b3999_7833^mapped-R64-1-1^03KUOG.html
#> HiCool :: All processing successfully achieved. Congrats!
hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/15c427129b3999_7833^mapped-R64-1-1^03KUOG.mcool
#> resolution: 1000
#> pairs file: ./HiCool//pairs/15c427129b3999_7833^mapped-R64-1-1^03KUOG.pairs
#> metadata(3): log args stats
S4Vectors::metadata(hcf)
#> $log
#> [1] "./HiCool//logs/15c427129b3999_7833^mapped-R64-1-1^03KUOG.log"
#>
#> $args
#> $args$r1
#> [1] "/home/biocbuild/.cache/R/ExperimentHub/15c427129b3999_7833"
#>
#> $args$r2
#> [1] "/home/biocbuild/.cache/R/ExperimentHub/15c427540af966_7834"
#>
#> $args$genome
#> [1] "/tmp/Rtmp8JxNUb/R64-1-1"
#>
#> $args$binning
#> [1] "1000"
#>
#> $args$restriction
#> [1] "DpnII,HinfI"
#>
#> $args$iterative
#> [1] TRUE
#>
#> $args$balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#>
#> $args$threads
#> [1] 1
#>
#> $args$output
#> [1] "./HiCool/"
#>
#> $args$exclude_chr
#> [1] "Mito|chrM|MT"
#>
#> $args$keep_bam
#> [1] FALSE
#>
#> $args$scratch
#> [1] "/tmp/Rtmp8JxNUb"
#>
#> $args$wd
#> [1] "/tmp/RtmplcS2om/Rbuild35f77067ecccdb/HiCool/vignettes"
#>
#>
#> $stats
#> $stats$nFragments
#> [1] 1e+05
#>
#> $stats$nPairs
#> [1] 64761
#>
#> $stats$nDangling
#> [1] 9266
#>
#> $stats$nSelf
#> [1] 1910
#>
#> $stats$nDumped
#> [1] 32
#>
#> $stats$nFiltered
#> [1] 53553
#>
#> $stats$nDups
#> [1] 613
#>
#> $stats$nUnique
#> [1] 52940
#>
#> $stats$threshold_uncut
#> [1] 7
#>
#> $stats$threshold_self
#> [1] 7
Extra optional arguments can be passed to the hicstuff
workhorse library:
iterative
TRUE
): By default, hicstuff
first truncates your set of reads to 20bp and attempts to align the truncated reads, then moves on to aligning 40bp-truncated reads for those which could not be mapped, etc. This procedure is longer than a traditional mapping but allows for more pairs to be rescued. Set to FALSE
if you want to perform standard alignment of fastq files without iterative alignment;balancing_args
" --min-nnz 10 --mad-max 5 "
): Specify here any balancing argument to be used by cooler
when normalizing the binned contact matrices. Full list of options available at cooler documentation website;threads
1L
): Number of CPUs to use to process data;exclude_chr
'Mito|chrM|MT'
): List here any chromosome you wish to remove from the final contact matrix file;keep_bam
FALSE
): Set to TRUE
if you wish to keep the pair of .bam
files;scratch
tempdir()
): Points to a temporary directory to be used for processing.The important files generated by HiCool
are the following:
<output_folder>/logs/<prefix>^mapped-<genome>^<hash>.log
<output_folder>/matrices/<prefix>^mapped-<genome>^<hash>.mcool
.pairs
file: <output_folder>/pairs/<prefix>^mapped-<genome>^<hash>.pairs
<output_folder>/plots/<prefix>^mapped-<genome>^<hash>_*.pdf
.The diagnosis plots illustrate how pairs were filtered during the processing,
using a strategy described in Cournac et al., BMC Genomics 2012
. The event_distance
chart represents the frequency of ++
, +-
, -+
and --
pairs in the library, as a function
of the number of restriction sites between each end of the pairs, and shows the inferred filtering threshold.
The event_distribution
chart indicates the proportion of each type of pairs (e.g. dangling
, uncut
, abnormal
, …)
and the total number of pairs retained (3D intra
+ 3D inter
).
Notes:
.pairs
file format is defined by the 4DN consortium;.(m)cool
file format is defined by cooler
authors in the supporting publication.Processing Hi-C sequencing libraries into .pairs
and .mcool
files requires
several dependencies, to (1) align reads to a reference genome, (2) manage
alignment files (SAM), (3) filter pairs, (4) bin them to a specific resolution
and (5)
All system dependencies are internally managed by basilisk.utils
. HiCool
maintains
a conda
environment containing:
python 3.12.11
numpy 1.26.4
bowtie2 2.5.4
chromosight 1.6.3
cooler 0.10.3
hicstuff 3.2.4
pairtools 1.1.3
samtools 1.22.1
The first time HiCool()
is executed, a fresh conda
environment will
be created and required dependencies automatically installed. This ensures
compatibility between the different system dependencies needed to process
Hi-C fastq files.
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] HiContactsData_1.11.0 ExperimentHub_2.99.5 AnnotationHub_3.99.6
#> [4] BiocFileCache_2.99.5 dbplyr_2.5.0 BiocGenerics_0.55.0
#> [7] generics_0.1.4 HiCool_1.9.1 HiCExperiment_1.9.1
#> [10] BiocStyle_2.37.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 httr2_1.2.0
#> [3] rlang_1.1.6 magrittr_2.0.3
#> [5] matrixStats_1.5.0 compiler_4.5.1
#> [7] RSQLite_2.4.1 dir.expiry_1.17.0
#> [9] png_0.1-8 vctrs_0.6.5
#> [11] stringr_1.5.1 pkgconfig_2.0.3
#> [13] crayon_1.5.3 fastmap_1.2.0
#> [15] XVector_0.49.0 rmdformats_1.0.4
#> [17] rmarkdown_2.29 sessioninfo_1.2.3
#> [19] tzdb_0.5.0 UCSC.utils_1.5.0
#> [21] strawr_0.0.92 purrr_1.1.0
#> [23] bit_4.6.0 xfun_0.52
#> [25] cachem_1.1.0 GenomeInfoDb_1.45.7
#> [27] jsonlite_2.0.0 blob_1.2.4
#> [29] rhdf5filters_1.21.0 DelayedArray_0.35.2
#> [31] Rhdf5lib_1.31.0 BiocParallel_1.43.4
#> [33] parallel_4.5.1 R6_2.6.1
#> [35] bslib_0.9.0 stringi_1.8.7
#> [37] RColorBrewer_1.1-3 reticulate_1.42.0
#> [39] GenomicRanges_1.61.1 jquerylib_0.1.4
#> [41] Rcpp_1.1.0 Seqinfo_0.99.1
#> [43] bookdown_0.43 SummarizedExperiment_1.39.1
#> [45] knitr_1.50 IRanges_2.43.0
#> [47] Matrix_1.7-3 tidyselect_1.2.1
#> [49] dichromat_2.0-0.1 abind_1.4-8
#> [51] yaml_2.3.10 codetools_0.2-20
#> [53] curl_6.4.0 lattice_0.22-7
#> [55] tibble_3.3.0 withr_3.0.2
#> [57] InteractionSet_1.37.0 Biobase_2.69.0
#> [59] basilisk.utils_1.21.2 KEGGREST_1.49.1
#> [61] evaluate_1.0.4 Biostrings_2.77.2
#> [63] pillar_1.11.0 BiocManager_1.30.26
#> [65] filelock_1.0.3 MatrixGenerics_1.21.0
#> [67] stats4_4.5.1 plotly_4.11.0
#> [69] vroom_1.6.5 BiocVersion_3.22.0
#> [71] S4Vectors_0.47.0 ggplot2_3.5.2
#> [73] scales_1.4.0 glue_1.8.0
#> [75] lazyeval_0.2.2 tools_4.5.1
#> [77] BiocIO_1.19.0 data.table_1.17.8
#> [79] rhdf5_2.53.1 grid_4.5.1
#> [81] tidyr_1.3.1 crosstalk_1.2.1
#> [83] AnnotationDbi_1.71.0 basilisk_1.21.5
#> [85] cli_3.6.5 rappdirs_0.3.3
#> [87] S4Arrays_1.9.1 viridisLite_0.4.2
#> [89] dplyr_1.1.4 gtable_0.3.6
#> [91] sass_0.4.10 digest_0.6.37
#> [93] SparseArray_1.9.0 htmlwidgets_1.6.4
#> [95] farver_2.1.2 memoise_2.0.1
#> [97] htmltools_0.5.8.1 lifecycle_1.0.4
#> [99] httr_1.4.7 bit64_4.6.0-1