Identify reproducible genomic peaks from replicate ChIP-seq experiments

Konstantin Krismer

2023-10-24

IDR2D is an extension of the original method IDR (Li et al. 2011), which was intended for ChIP-seq peaks (or one-dimensional genomic data). This package applies the method to two-dimensional genomic data, such as interactions between two genomic loci (also called anchors). Genomic interaction data is generated by genome-wide methods such as Hi-C (Berkum et al. 2010), ChIA-PET (Fullwood and Ruan 2009), and HiChIP (Yan et al. 2014).

Input data

Load example data:

rep1_df <- idr2d:::chipseq$rep1_df
rep2_df <- idr2d:::chipseq$rep2_df

Example data - replicate 1

Only the first 1000 peaks are shown.

Example data - replicate 2

Only the first 1000 peaks are shown.

Analysis

Load the package:

library(idr2d)

Estimate IDR:

idr_results <- estimate_idr1d(rep1_df, rep2_df, 
                              value_transformation = "log")
rep1_idr_df <- idr_results$rep1_df

Important to note here is that the appropriate value transformation depends on the semantics of the value column (always the seventh column) in rep1_df and rep2_df. This column is used to establish a ranking between interactions, with highly significant interactions on top of the list and least significant interactions (i.e., most likely noise) at the bottom of the list. The ranking is established by the value column, sorted in descending order. Since our value column contains FDRs (the lower, the more significant), we need to transform the values to comply with the assumption that high values indicate high significance. For p-values and p-value derived measures (like Q values), the log_additive_inverse transformation (-log(x)) is recommended.

Results

Only the first 1000 observations are shown.

Summary

summary(idr_results)
## analysis type: IDR1D
## number of interactions in replicate 1: 20978
## number of interactions in replicate 2: 20979
## number of reproducible interactions: 500
## number of interactions with significant IDR (IDR < 0.05): 101
## number of interactions with highly significant IDR (IDR < 0.01): 69
## percentage of interactions with significant IDR (IDR < 0.05): 0.48 %

Distribution of IDRs

draw_idr_distribution_histogram(rep1_idr_df)

Rank - IDR dependence

draw_rank_idr_scatterplot(rep1_idr_df)

Value - IDR dependence

draw_value_idr_scatterplot(rep1_idr_df)

Additional information

Most of the functionality of the IDR2D package is also offered through the website at https://idr2d.mit.edu.

For a more detailed discussion on IDR2D, please have a look at the IDR2D paper:

IDR2D identifies reproducible genomic interactions
Konstantin Krismer, Yuchun Guo, and David K. Gifford
Nucleic Acids Research, Volume 48, Issue 6, 06 April 2020, Page e31; DOI: https://doi.org/10.1093/nar/gkaa030

References

Berkum, N. L. van, E. Lieberman-Aiden, L. Williams, M. Imakaev, A. Gnirke, L. A. Mirny, J. Dekker, and E. S. Lander. 2010. “Hi-C: a method to study the three-dimensional architecture of genomes.” J Vis Exp, no. 39 (May).

Fullwood, M. J., and Y. Ruan. 2009. “ChIP-based methods for the identification of long-range chromatin interactions.” J. Cell. Biochem. 107 (1): 30–39.

Li, Qunhua, James B. Brown, Haiyan Huang, and Peter J. Bickel. 2011. “Measuring Reproducibility of High-Throughput Experiments.” Ann. Appl. Stat. 5 (3): 1752–79. https://doi.org/10.1214/11-AOAS466.

Yan, H., J. Evans, M. Kalmbach, R. Moore, S. Middha, S. Luban, L. Wang, et al. 2014. “HiChIP: a high-throughput pipeline for integrative analysis of ChIP-Seq data.” BMC Bioinformatics 15 (August): 280.