A step-by-step guide to analyzing CAGE data using R/Bioconductor

Malte Thodberg1 and Albin Sandelin2

1Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen
2Department of Biology and Biotech Research and Innovation Centre, University of Copenhagen

1 November 2024

Abstract

Cap Analysis of Gene Expression (CAGE) is one of the most popular 5’-end sequencing methods. In a single experiment, CAGE can be used to locate and quantify the expression of both Transcription Start Sites (TSSs) and enhancers. This workflow is a case study on how to use the CAGEfightR package to orchestrate analysis of CAGE data within the Bioconductor project. This workflow starts from BigWig-files and covers both basic CAGE analyses such as identifying, quantifying and annotating TSSs and enhancers, advanced analysis such as finding interacting TSS-enhancer pairs and enhancer clusters, to differential expression analysis and alternative TSS usage. R-code, discussion and references are intertwined to help provide guidelines for future CAGE studies of the same kind.

1 Background
2 Materials and methods
- 2.1 Dataset
- 2.2 R-packages
3 Workflow
4 Discussion
5 Author information
6 Competing interests
7 Grant information
8 Acknowledgments

R version: R version 4.4.1 (2024-06-14)

Bioconductor version: 3.20

CAGEfightR version: 1.26.0

1 Background

Transcriptional regulation is one of the most important aspects of gene expression. Transcription Start Sites (TSSs) are focal points of this process: The TSS act as an integration point for a wide range of molecular cues from surrounding genomic areas to determine transcription and ultimately expression levels. These include proximal factors such as chromatin accessibility, chromatin modification, DNA methylation and transcription factor binding, and distal factors such as enhancer activity and chromatin confirmation (Smale and Kadonaga 2003; Kadonaga 2012; Lenhard, Sandelin, and Carninci 2012; Haberle and Stark 2018).

Cap Analysis of Gene Expression (CAGE) has emerged as one of the dominant high-throughput assays for studying TSSs (Adiconis et al. 2018). CAGE is based on “cap trapping”: capturing capped full-length RNAs and sequencing only the first 20-30 nucleotides from the 5’-end, so-called CAGE tags (Takahashi et al. 2012). When mapped to a reference genome, the 5’-ends of CAGE tag identify the actual TSS for respective RNA with basepair-level accuracy. Basepair-accurate TSSs identified this way are referred to as CAGE Transcription Start Sites (CTSSs). RNA polymerase rarely initiates from just a single nucleotide: this is manifested in CAGE data by the fact that CTSSs are mostly found in tightly spaced groups on the same strand. The majority of CAGE studies have merged such CTSSs into genomic blocks typically referred to as Tag Clusters (TCs), using a variety of clustering methods (see below). TCs are often interpreted as TSSs in the more general sense, given that most genes have many CTSSs, but only a few TCs that represent a few major transcripts with highly similar CTSSs (Carninci et al. 2006; Sandelin et al. 2007). Since the number of mapped CAGE tags in a given TC is indicative of the number of RNAs from that region, the number of CAGE tags falling in given TC can be used as a measure of expression (Kawaji et al. 2014).

As CAGE tags can be mapped to a reference genome without the need for transcript annotations, it can detect TSSs of known mRNAs, but also novel alternative TSSs (that might be condition or tissue dependent) (Carninci et al. 2006; Consortium, Pmi, and Dgt 2014). Since CAGE captures all capped RNAs, it can also identify long non-coding RNA (lincRNA) (Hon et al. 2017). It was previously shown that active enhancers are characterized by balanced bidirectional transcription producing enhancer RNAs (eRNAs), making it possible to predict enhancer regions and quantify their expression levels from CAGE data alone (Kim et al. 2010; Andersson et al. 2014). Thus, CAGE data can predict the locations and activity of mRNAs, lincRNAs and enhancers in a single assay, providing a comprehensive view of transcriptional regulation across both inter- and intragenic regions.

Bioconductor contains a vast collection of tools for analyzing transcriptomics datasets, in particular the widely used RNA-Seq and microarray assays(Huber et al. 2015). Only a few packages are dedicated to analyzing 5’-end data in general and CAGE data in particular: TSRchitect (Taylor Raborn, Brendel, and Sridharan, n.d.), icetea (Bhardwaj 2019), CAGEr (Haberle et al. 2015) and CAGEfightR (Thodberg et al. 2019) (Table 1).

CAGEr was the first package solely dedicated to the analysis of CAGE data and was recently updated to more closely adhere to Bioconductor S4-class standards. CAGEr takes as input aligned reads in the form of BAM-files and can identify, quantify, characterize and annotate TSSs. TSSs are found in individual samples using either simple clustering of CTSSs (greedy or distance-based clustering) or the more advanced density-based paraclu clustering method(Frith et al. 2008), and can be aggregated across samples to a set of consensus clusters. Several specialized routines for CAGE data is available, such as G-bias correction of mapped tags, power law normalization of CTSS counts and fine-grained TSS shifts. Finally, CAGEr offers easy interface to the large collection of CAGE data from the FANTOM consortium (Consortium, Pmi, and Dgt 2014). TSRchitect and icetea are two more recent additions to Bioconductor. While being less comprehensive, they aim to be more general and handle more types of 5’-end methods that are conceptually similar to CAGE (RAMPAGE, PEAT, PRO-Cap, etc. (Adiconis et al. 2018)). Both packages can identify, quantify and annotate TSSs, with TSRchitect using an X-means algorithm and icetea using a sliding window approach. icetea offers the unique feature of mapping reads to a reference genome by interfacing with Rsubread. Both CAGEr and icetea offers built-in capabilities for differential expression (DE) analysis via the popular DESeq2 or edgeR packages (Love, Huber, and Anders 2014; Robinson, McCarthy, and Smyth 2010).

CAGEfightR is a recent addition to Bioconductor focused on analyzing CAGE data, but applicable to most 5’-end data. It aims to be general and flexible to allow for easy interfacing with the wealth of other Bioconductor packages. CAGEfightR takes CTSSs stored in BigWig-files as input and uses only standard Bioconductor S4-classes (GenomicRanges, SummarizedExperiment, InteractionSet(Lawrence et al. 2013; Lun, Perry, and Ing-Simmons 2016)) making it easy for users to learn and combine CAGEfightR with functions from other Bioconductor packages (e.g. instead of providing custom wrappers around other packages such as differential expression analysis). In addition to TSS analysis, CAGEfightR is the only package on Bioconductor to also offer functions for enhancer analysis based on CAGE (and similarly scoped) data. This includes enhancer identification and quantification, linking enhancers to TSSs via correlation of expression and finding enhancer clusters, often referred to as stretch- or super enhancers.

Table 1: Comparison of Bioconductor packages for CAGE data analysis.
Analysis	icetea	TSRchitect	CAGEr	CAGEfightR
Simplest input	FASTQ	BAM	BAM	BigWig/bedGraph
Paired-end input	+	+	-	-
CTSS extraction	-	+	+	-
TSS calling	Sliding window	X-means	Distance or Paraclu	Slice-reduce
TSS shapes	-	Shape index	IQR and TSS shifts	IQR, entropy, etc.
Differential Expression	+	-	+	-
Enhancer calling	-	-	-	+
TSS-enhancer correlation	-	-	-	+
Super enhancers	-	-	-	+

In this workflow, we illustrate how the CAGEfightR package can be used to orchestrate an end-to-end analysis of CAGE data by making it easy to interface with a wide range of different Bioconductor packages. Highlights include TSS and enhancer candidate identification, differential expression, alternative TSS usage, enrichment of motifs, GO/KEGG terms and calculating TSS-enhancer correlations.

2 Materials and methods

2.1 Dataset

This workflow uses data from “Identification of Gene Transcription Start Sites and Enhancers Responding to Pulmonary Carbon Nanotube Exposure in Vivo” by Bornholdt et al (Bornholdt et al. 2017). This study uses mouse as a model system to investigate how carbon nanotubes affect lung tissue when inhaled. Inhaled nanotubes were previously found to produce a similar response to asbestos, potentially triggering an inflammatory response in the lung tissue leading to drastic changes in gene expression.

The dataset consists of CAGE data from mouse lung biopsies: 5 mice whose lungs were instilled with water (Ctrl) and 6 mice whose lungs were instilled with nanotubes (Nano), with CTSSs for each sample stored in BigWig-files, shown in Table 2:

Table 2: Overview of samples in the nanotube exposure experiment.
Group	Biological Replicates
Ctrl	5 mice
Nano	6 mice

The data is acquired via the nanotubes data package:

library(nanotubes)
data(nanotubes)

2.2 R-packages

This workflow uses a large number of R-packages: Bioconductor packages are primarily used for data analysis while packages from the tidyverse are used to wrangle and plot results. All these packages are loaded prior to beginning the workflow:

# CRAN packages for data manipulation and plotting
library(pheatmap)
library(ggseqlogo)
library(viridis)
library(magrittr)
library(ggforce)
library(ggthemes)
library(tidyverse)

# CAGEfightR and related packages
library(CAGEfightR)
library(GenomicRanges)
library(SummarizedExperiment)
library(GenomicFeatures)
library(BiocParallel)
library(InteractionSet)
library(Gviz)

# Bioconductor packages for differential expression
library(DESeq2)
library(limma)
library(edgeR)
library(statmod)
library(BiasedUrn)
library(sva)

# Bioconductor packages for enrichment analyses
library(TFBSTools)
library(motifmatchr)
library(pathview)

# Bioconductor data packages
library(BSgenome.Mmusculus.UCSC.mm9)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
library(org.Mm.eg.db)
library(JASPAR2016)

We also set some script-wide settings for later convenience:

# Rename these for easier access
bsg <- BSgenome.Mmusculus.UCSC.mm9
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
odb <- org.Mm.eg.db

# Script-wide settings
register(MulticoreParam(3)) # Parallel execution when possible
theme_set(theme_light()) # White theme for ggplot2 figures

3 Workflow

The workflow is divided into 3 parts covering different parts of a typical CAGE data analysis:

Shows how to use CAGEfightR to import CTSSs and find and quantify TSS and enhancer candidates.
Shows examples of how to perform genomic analyses of CAGE clusters using core Bioconductor packages such as GenomicRanges and Biostrings. This part covers typical analyses made from CAGE data, from summarizing cluster annotation, TSS shapes and core promoter sequence analysis to more advanced spatial analyses (finding TSS-enhancer correlation links and clusters of enhancers).
Shows how CAGEfightR can be used to prepare data for differential expression analysis with popular R packages, including DESeq2, limma and edgeR (Love, Huber, and Anders 2014; Ritchie et al. 2015; Robinson, McCarthy, and Smyth 2010). Borrowing from RNA-Seq terminology, differential expression can be assessed at multiple different levels: TSS- and enhancer-level, gene-level and differential TSS usage(Soneson, Love, and Robinson 2015). Once differential expression results have been obtained, they can be combined with other sources of information such as motifs from JASPAR (Mathelier et al. 2016) and GO/KEGG terms(Ashburner et al. 2000; The Gene Ontology Consortium 2019; Kanehisa and Goto 2000).

3.1 Part 1: Locating, quantifying and annotating TSSs and enhancers

Before starting the analysis, we recommend gathering all information (Filenames, groups, batches, preparation data, etc.) about the samples to be analyzed in a single data.frame, often called the design matrix. CAGEfightR can keep track of the design matrix throughout the analysis:

knitr::kable(nanotubes, 
      caption = "The initial design matrix for the nanotubes experiment")

Table 3: The initial design matrix for the nanotubes experiment
	Class	Name	BigWigPlus	BigWigMinus
C547	Ctrl	C547	mm9.CAGE_7J7P_NANO_KON_547.plus.bw	mm9.CAGE_7J7P_NANO_KON_547.minus.bw
C548	Ctrl	C548	mm9.CAGE_ULAC_NANO_KON_548.plus.bw	mm9.CAGE_ULAC_NANO_KON_548.minus.bw
C549	Ctrl	C549	mm9.CAGE_YM4F_Nano_KON_549.plus.bw	mm9.CAGE_YM4F_Nano_KON_549.minus.bw
C559	Ctrl	C559	mm9.CAGE_RSAM_NANO_559.plus.bw	mm9.CAGE_RSAM_NANO_559.minus.bw
C560	Ctrl	C560	mm9.CAGE_CCLF_NANO_560.plus.bw	mm9.CAGE_CCLF_NANO_560.minus.bw
N13	Nano	N13	mm9.CAGE_KTRA_Nano_13.plus.bw	mm9.CAGE_KTRA_Nano_13.minus.bw
N14	Nano	N14	mm9.CAGE_RSAM_NANO_14.plus.bw	mm9.CAGE_RSAM_NANO_14.minus.bw
N15	Nano	N15	mm9.CAGE_RFQS_Nano_15.plus.bw	mm9.CAGE_RFQS_Nano_15.minus.bw
N16	Nano	N16	mm9.CAGE_CCLF_NANO_16.plus.bw	mm9.CAGE_CCLF_NANO_16.minus.bw
N17	Nano	N17	mm9.CAGE_RSAM_NANO_17.plus.bw	mm9.CAGE_RSAM_NANO_17.minus.bw
N18	Nano	N18	mm9.CAGE_CCLF_NANO_18.plus.bw	mm9.CAGE_CCLF_NANO_18.minus.bw

3.1.1 Obtaining CTSSs

CAGEfightR starts analysis from simple CTSSs, which are the number of CAGE tag 5’-ends mapping to each basepair (bp) in the genome. CTSSs are normally stored as either BED-files, bedGraph-files or BigWig-files. As CTSSs are sparse (only are small fraction of all bps are CTSSs), these files are relatively small and thereby easily shared, many studies will make CTSSs available via online repositories such as GEO.

When preparing CTSSs from new CAGE libraries, tags are normally first barcode split, trimmed and filtered before mapping to a reference genome using standard command line tools. The exact steps will depend on the given CAGE protocol in use. CTSSs can subsequently be extracted from the resulting BAM-files one library at a time, for example using genomecov from bedtools with the -5 setting. TSRchitect and CAGEr also include functions (inputToTSS(), getCTSSs(), respectively) for obtaining CTSSs from BAM-files from within R, with CAGEr having the option of correcting for G-bias when mapping.

CAGEfightR can analyze many types of 5’-end data, as long as they can be represented in a format similar to CTSSs.

3.1.2 Importing CTSSs

CAGEfightR uses the convenient BigWigFile and BigWigFileList containers for handling CTSSs stored in BigWig-files (one file for each strand), as these allow for fast random access, inspection and summarization of the genome information stored in the files (e.g. via import(), seqinfo() and summary()). First, we need to tell CAGEfightR where to find the BigWig-files containing CTSSs on the hard drive. Normally, one would supply the paths to each file (e.g. /CAGEdata/BigWigFiles/Sample1_plus.bw), but here we will use data from the nanotubes data package:

# Setup paths to file on hard drive
bw_plus <- system.file("extdata", nanotubes$BigWigPlus, 
                        package = "nanotubes", 
                        mustWork = TRUE)
bw_minus <- system.file("extdata", nanotubes$BigWigMinus, 
                        package = "nanotubes", 
                        mustWork = TRUE)

# Save as named BigWigFileList
bw_plus <- BigWigFileList(bw_plus)
bw_minus <- BigWigFileList(bw_minus)
names(bw_plus) <- names(bw_minus) <- nanotubes$Name

The first step is quantifying CTSS usage across all samples. This is often one of the most time-consuming steps in a CAGEfightR analysis, but it can be sped up by using multiple cores (if available, see Materials and Methods). We also need to specify the genome, which we can get from the BSgenome.Mmusculus.UCSC.mm9 genome package:

CTSSs <- quantifyCTSSs(plusStrand = bw_plus,
                       minusStrand = bw_minus,
                       genome = seqinfo(bsg),
                       design = nanotubes)
#> Checking design...
#> Checking supplied genome compatibility...
#> Iterating over 1 genomic tiles in 11 samples using 3 worker(s)...
#> Importing CTSSs from plus strand...
#> Importing CTSSs from minus strand...
#> Merging strands...
#> Formatting output...
#> ### CTSS summary ###
#> Number of samples: 11
#> Number of CTSSs: 9.339 millions
#> Sparsity: 81.68 %
#> Type of rowRanges: StitchedGPos
#> Final object size: 280.88 MB

The circa 9 million CTSSs are stored as a RangedSummarizedExperiment, which is the standard container of high-throughput experiments in Bioconductor. We can inspect both the ranges and the CTSS counts:

# Get a summary
CTSSs
#> class: RangedSummarizedExperiment 
#> dim: 9338802 11 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(11): C547 C548 ... N17 N18
#> colData names(4): Class Name BigWigPlus BigWigMinus

# Extract CTSS positions
rowRanges(CTSSs)
#> StitchedGPos object with 9338802 positions and 0 metadata columns:
#>                 seqnames       pos strand
#>                    <Rle> <integer>  <Rle>
#>         [1]         chr1   3024556      +
#>         [2]         chr1   3025704      +
#>         [3]         chr1   3025705      +
#>         [4]         chr1   3028283      +
#>         [5]         chr1   3146133      +
#>         ...          ...       ...    ...
#>   [9338798] chrUn_random   5810899      -
#>   [9338799] chrUn_random   5813784      -
#>   [9338800] chrUn_random   5880838      -
#>   [9338801] chrUn_random   5893536      -
#>   [9338802] chrUn_random   5894263      -
#>   -------
#>   seqinfo: 35 sequences (1 circular) from mm9 genome

# Extract CTSS counts
assay(CTSSs, "counts") %>%
    head
#> 6 x 11 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 11 column names 'C547', 'C548', 'C549' ... ]]
#>                           
#> [1,] . . 1 . . . . . . . .
#> [2,] . . . 1 . . . . . . .
#> [3,] . . . . 1 . . . . . .
#> [4,] . . . . 1 . . . . . .
#> [5,] . . . . . . 1 . . . .
#> [6,] . 1 . . . . . . . . .

3.1.3 Unidirectional and bidirectional clustering for finding TSS and enhancer candidates:

CAGEfightR finds clusters by calculating the pooled CTSS signal across all samples: We first normalize CTSS counts in each sample to Tags-Per-Million (TPM) values, and then sum TPM values across samples:

CTSSs <- CTSSs %>%
    calcTPM() %>%
    calcPooled()
#> Calculating library sizes...
#> Calculating TPM...

This will add several new pieces of information to CTSSs: The total number of tags in each library, a new assay called TPM, and the pooled signal for each CTSS.

We can use unidirectional clustering to locate unidirectional clusters, often simply called Tag Clusters (TCs), which are candidates for TSSs. The quickTSSs will both locate and quantify TCs in a single function call:

TCs <- quickTSSs(CTSSs)
#> Using existing score column!
#> 
#>  - Running clusterUnidirectionally:
#> Splitting by strand...
#> Slice-reduce to find clusters...
#> Calculating statistics...
#> Preparing output...
#> Tag clustering summary:
#>   Width   Count Percent
#>   Total 3602099 1e+02 %
#>     >=1 2983433 8e+01 %
#>    >=10  577786 2e+01 %
#>   >=100   40842 1e+00 %
#>  >=1000      38 1e-03 %
#> 
#>  - Running quantifyClusters:
#> Finding overlaps...
#> Aggregating within clusters...

Note: quickTSSs runs CAGEfightR with default settings. If you have larger or more noisy datasets you most likely want to do a more robust analysis with different settings. See the CAGEfightR vignette for more information.

Many of the identified TCs will be very lowly expressed. To obtain likely biologically relevant TSSs, we keep only TCs expressed at more than 1 TPM in at least 5 samples (5 samples being the size of the smallest experimental group):

TSSs <- TCs %>%
    calcTPM() %>%
    subsetBySupport(inputAssay = "TPM", 
                    unexpressed = 1, 
                    minSamples = 4)
#> Calculating support...
#> Calculating library sizes...
#> Warning in calcTotalTags(object = object, inputAssay = inputAssay, outputColumn
#> = outputColumn): object already has a column named totalTags in colData: It
#> will be overwritten!
#> Calculating TPM...
#> Subsetting...
#> Removed 3573214 out of 3602099 regions (99.2%)

This removed a large number of very lowly expressed TCs, leaving us with almost 30.000 TCs for analysis. For simplicity, we will refer to these TCs as TSS candidates, as each TC can be seen as a measure of the location and activity of the TSS of a transcript or gene. Note that this is a simplification, since a TC technically groups together several bp-accurate CTSSs.

Then we turn to bidirectional clustering for identifying bidirectional clusters (BCs). Similarly, we can use quickEnhancers to locate and quantify BCs (BCs are quantified by summing tags on both strands of the cluster):

BCs <- quickEnhancers(CTSSs)
#> Using existing score column!
#> 
#>  - Running clusterBidirectionally:
#> Pre-filtering bidirectional candidate regions...
#> Retaining for analysis: 68%
#> Splitting by strand...
#> Calculating windowed coverage on plus strand...
#> Calculating windowed coverage on minus strand...
#> Calculating balance score...
#> Slice-reduce to find bidirectional clusters...
#> Calculating statistics...
#> Preparing output...
#> # Bidirectional clustering summary:
#> Number of bidirectional clusters: 106779
#> Maximum balance score: 1
#> Minimum balance score: 0.950001090872574
#> Maximum width: 1866
#> Minimum width: 401
#> 
#>  - Running subsetByBidirectionality:
#> Calculating bidirectionality...
#> Subsetting...
#> Removed 73250 out of 106779 regions (68.6%)
#> 
#>  - Running quantifyClusters:
#> Finding overlaps...
#> Aggregating within clusters...

Note: quickEnhancers runs CAGEfightR with default settings. If you have larger or more noisy datasets you most likely want to do a more robust analysis with different settings. See the CAGEfightR vignette for more information.

Again, we are not usually interested in very lowly expressed BCs. As BCs are overall lowly expressed compared to TCs, we will simply filter out BCs without at least 1 count in 5 samples:

BCs <- subsetBySupport(BCs, 
                       inputAssay = "counts", 
                       unexpressed = 0, 
                       minSamples = 4)
#> Calculating support...
#> Subsetting...
#> Removed 20017 out of 33529 regions (59.7%)

3.1.4 Annotating clusters with transcript models

After having located unidirectional and bidirectional clusters, we can annotate them according to known transcript and gene models. These types of annotation are store via TxDb-objects in Bioconductor. Here we will simply use UCSC transcripts included in the TxDb.Mmusculus.UCSC.mm9.knownGene package, but the CAGEfightR vignette includes examples of how to obtain a TxDb object from other sources (GFF/GTF files, AnnotationHub, etc.).

Starting with the TSS candidates, we can not only annotate a TSS with overlapping transcripts, but also in what part of a transcript a TSS lies by using a hierarchical annotation scheme. As some TSS candidates might be very wide, we only use the TSS peak for annotation purposes:

# Annotate with transcript IDs
TSSs <- assignTxID(TSSs, txModels = txdb, swap = "thick")
#> Extracting transcripts...
#> Finding hierachical overlaps...
#> ### Overlap Summary: ###
#> Features overlapping transcripts: 87.65 %
#> Number of unique transcripts: 31898

# Annotate with transcript context
TSSs <- assignTxType(TSSs, txModels = txdb, swap = "thick")
#> Finding hierachical overlaps with swapped ranges...
#> ### Overlap summary: ###
#>       txType count percentage
#> 1   promoter 13395       46.4
#> 2   proximal  2246        7.8
#> 3    fiveUTR  2112        7.3
#> 4   threeUTR  1200        4.2
#> 5        CDS  3356       11.6
#> 6       exon   161        0.6
#> 7     intron  2810        9.7
#> 8  antisense  1294        4.5
#> 9 intergenic  2311        8.0

Almost half of the TSSs were found at annotated promoters, while the other half is novel compared to the UCSC known transcripts.

Transcript annotation is particularly useful for enhancer analysis, as bidirectional clustering might also detect bidirectional promoters. Therefore, a commonly used filtering approached is to only consider BCs in intergenic or intronic regions as enhancer candidates:

# Annotate with transcript context
BCs <- assignTxType(BCs, txModels = txdb, swap = "thick")
#> Finding hierachical overlaps with swapped ranges...
#> ### Overlap summary: ###
#>       txType count percentage
#> 1   promoter   766        5.7
#> 2   proximal  1649       12.2
#> 3    fiveUTR    67        0.5
#> 4   threeUTR   596        4.4
#> 5        CDS   420        3.1
#> 6       exon    71        0.5
#> 7     intron  6815       50.4
#> 8  antisense     0        0.0
#> 9 intergenic  3128       23.1

# Keep only non-exonic BCs as enhancer candidates
Enhancers <- subset(BCs, txType %in% c("intergenic", "intron"))

This leaves almost 10000 BCs for analysis. Again, for simplificity, we will refer to these non-exonic BCs as enhancer candidates for the remainder of the workflow.

3.1.5 Merging into a single dataset

For many downstream analyses, in particular normalization and differential expression, it is useful to combine both TSS and enhancers candidates into a single dataset. This ensures that clusters do not overlap, so that each CAGE tag is counted only once.

We must first ensure that the enhancer and TSS candidates have the same information attached to them, since CAGEfightR will only allow merging of clusters if they have the same sample and cluster information:

# Clean colData
TSSs$totalTags <- NULL
Enhancers$totalTags <- NULL

# Clean rowData
rowData(TSSs)$balance <- NA
rowData(TSSs)$bidirectionality <- NA
rowData(Enhancers)$txID <- NA

# Add labels for making later retrieval easy
rowData(TSSs)$clusterType <- "TSS"
rowData(Enhancers)$clusterType <- "Enhancer"

Then the clusters can be merged: As enhancers could technically be detected as two separate TSSs, we only keep the enhancer if a TSS and enhancer candidate overlaps:

RSE <- combineClusters(object1 = TSSs, 
                       object2 = Enhancers, 
                       removeIfOverlapping = "object1")
#> Removing overlapping features from object1: 374
#> Keeping assays: counts
#> Keeping columns: score, thick, support, txID, txType, balance, bidirectionality, clusterType
#> Merging metadata...
#> Stacking and re-sorting...

We finally calculate the total number of tags and TPM-scaled counts for the final merged dataset:

RSE <- calcTPM(RSE)
#> Calculating library sizes...
#> Calculating TPM...

3.2 Part 2: Genomic analysis of TSSs and enhancers

3.2.1 Genome browser-figures of TSSs and enhancers

First we can simply plot some examples of TSSs and enhancers in a genome browser-style figure using the Gviz package (Hahne and Ivanek 2016). It takes a bit of code to setup, but the resulting tracks can be reused for later examples:

# Genome track
axis_track <- GenomeAxisTrack()

# Annotation track
tx_track <- GeneRegionTrack(txdb, 
                            name = "Gene Models", 
                            col = NA,
                            fill = "bisque4", 
                            shape = "arrow", 
                            showId = TRUE)

A good general strategy for quickly generating genome browser plots is to first define a region of interest, and then only plotting data within that region using subsetByOverlaps. The following code demonstrates this using the first TSS candidate:

# Extract 100 bp around the first TSS
plot_region <- RSE %>% 
    rowRanges() %>% 
    subset(clusterType == "TSS") %>% 
    .[1] %>%
    add(100) %>%
    unstrand()

# CTSS track
ctss_track <- CTSSs %>%
    rowRanges() %>%
    subsetByOverlaps(plot_region) %>%
    trackCTSS(name = "CTSSs")
#> Splitting pooled signal by strand...
#> Preparing track...

# Cluster track
cluster_track <- RSE %>%
    subsetByOverlaps(plot_region) %>%
    trackClusters(name = "Clusters", 
                  col = NA, 
                  showId = TRUE)
#> Setting thick and thin features...
#> Merging and sorting...
#> Preparing track...

# Plot tracks together
plotTracks(list(axis_track, 
                ctss_track,
                cluster_track,
                tx_track),
           from = start(plot_region), 
           to = end(plot_region), 
           chromosome = as.character(seqnames(plot_region)))

Figure 1: Genome browser example of TSS candidate

The top track shows the pooled CTSS signal and the middle track shows the identified TSS candidate. The thick bar indicates the TSS candidate peak (the overall most used CTSSs within the TSS candidate). The bottom track shows the known transcript model at this genomic location. In this case, the CAGE-defined TSS candidate corresponds well to the annotation.

We can also plot the first enhancer candidate:

# Extract 100 bp around the first enhancer
plot_region <- RSE %>% 
    rowRanges() %>% 
    subset(clusterType == "Enhancer") %>% 
    .[1] %>%
    add(100) %>%
    unstrand()

# CTSS track
ctss_track <- CTSSs %>%
    rowRanges() %>%
    subsetByOverlaps(plot_region) %>%
    trackCTSS(name = "CTSSs")
#> Splitting pooled signal by strand...
#> Preparing track...

# Cluster track
cluster_track <- RSE %>%
    rowRanges %>%
    subsetByOverlaps(plot_region) %>%
    trackClusters(name = "Clusters", 
                  col = NA, 
                  showId = TRUE)
#> Setting thick and thin features...
#> Merging and sorting...
#> Preparing track...

# Plot tracks together
plotTracks(list(axis_track, 
                ctss_track,
                cluster_track,
                tx_track),
           from = start(plot_region), 
           to = end(plot_region), 
           chromosome = as.character(seqnames(plot_region)))

Figure 2: Genome browser example of enhancer candidate

Here we see the bidirectional pattern characteristic of active enhancers. The enhancer candidate is seen in the middle track. The midpoint in thick marks the maximally balanced point within the enhancer candidate.

3.2.2 Location and expression of TSSs and enhancers

In addition to looking at single examples of TSS and enhancer candidates, we also want to get an overview of the number and expression of clusters in relation to transcript annotation. First we extract all of the necessary data from the RangedSummarizedExperiment into an ordinary data.frame:

cluster_info <- RSE %>%
    rowData() %>%
    as.data.frame()

Then we use ggplot2 to plot the number and expression levels of clusters in each annotation category:

# Number of clusters
ggplot(cluster_info, aes(x = txType, fill = clusterType)) +
    geom_bar(alpha = 0.75, position = "dodge", color = "black") +
    scale_fill_colorblind("Cluster type") +
    labs(x = "Cluster annotation", y = "Frequency") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Figure 3: Number of clusters within each annotation category

# Expression of clusters
ggplot(cluster_info, aes(x = txType, 
                         y = log2(score / ncol(RSE)), 
                         fill = clusterType)) +
    geom_violin(alpha = 0.75, draw_quantiles = c(0.25, 0.50, 0.75)) +
    scale_fill_colorblind("Cluster type") +
    labs(x = "Cluster annotation", y = "log2(TPM)") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Figure 4: Expression of clusters within each annotation category

We find that TSS candidates at annotated promoters are generally highly expressed. Most novel TSS candidates are expressed at lower levels, except for some TSS candidates in 5’-UTRs. Enhancer candidates are expressed at much lower levels than TSS candidates.

3.2.3 Analysing TSS shapes and sequences

A classic analysis of CAGE data is to divide TSSs into Sharp and Broad classes, which show different core promoter regions and different expression patterns across tissues(Carninci et al. 2006).

CAGEfightR can calculate several shape statistics that summarize the shape of a TSS. The Interquantile Range (IQR) can be used to find sharp and broad TSSs. The IQR measures the bp-distance holding e.g. 10-90% of the pooled expression in a TSS candidate, which dampens the effect of possible straggler tags that can greatly extend the width of a TSS candidate without contributing much to expression. As lowly expressed TSS candidates cannot show much variation in shape due to their low width and number of tags, we here focus on highly expressed TSS candidates (average TPM >= 10):

# Select highly expressed TSSs
highTSSs <- subset(RSE, clusterType == 'TSS' & score / ncol(RSE) >= 10)

# Calculate IQR as 10%-90% interval 
highTSSs <- calcShape(highTSSs, 
                      pooled = CTSSs, 
                      shapeFunction = shapeIQR, 
                      lower = 0.10, 
                      upper = 0.90)
#> Splitting by strand...
#> Applying function to each cluster...
#> Preparing output output...

We can then plot the bimodal distribution of IQRs. We use a zoom-in panel to highlight the distinction between the two classes:

highTSSs %>%
    rowData() %>%
    as.data.frame() %>%
    ggplot(aes(x = IQR)) +
    geom_histogram(binwidth = 1, 
                   fill = "hotpink", 
                   alpha = 0.75) +
    geom_vline(xintercept = 10, 
               linetype = "dashed", 
               alpha = 0.75, 
               color = "black") +
    facet_zoom(xlim = c(0,100)) +
    labs(x = "10-90% IQR", 
         y = "Frequency")

Figure 5: Bimodal distribution of Interquartile Ranges (IQRs) of highly expressed TSSs

We see most TSS candidates are either below or above 10 bp IQR (dashed line), so we use this cutoff to classify TSS candidates into Sharp and Broad:

# Divide into groups
rowData(highTSSs)$shape <- ifelse(rowData(highTSSs)$IQR < 10, "Sharp", "Broad")

# Count group sizes
table(rowData(highTSSs)$shape)
#> 
#> Broad Sharp 
#>  9555   812

We can now investigate the core promoter sequences of the two classes of TSS candidates. We first need to extract the surrounding promoter sequence for each TSS candidate: We define this as the TSS candidate peak -40/+10 bp and extract them using the BSgenome.Mmusculus.UCSC.mm9 genome package:

promoter_seqs <- highTSSs %>%
    rowRanges() %>%
    swapRanges() %>%
    promoters(upstream = 40, downstream = 10) %>%
    getSeq(bsg, .)

This returns a DNAStringSet-object which we can plot as a sequence logo (Schneider and Stephens 1990) via the ggseqlogo package(Wagih 2017):

promoter_seqs %>%
    as.character %>%
    split(rowData(highTSSs)$shape) %>%
    ggseqlogo(data = ., ncol = 2, nrow = 1) +
    theme_logo() +
    theme(axis.title.x = element_blank(),
          axis.text.x = element_blank(),
          axis.ticks.x = element_blank())
#> Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
#> of ggplot2 3.3.4.
#> ℹ The deprecated feature was likely used in the ggseqlogo package.
#>   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

Figure 6: Sequence logos of core promoter regions of Sharp and Broad classes of TSSs

As expected, we observe that Sharp TSS candidates tend to have a stronger TATA-box upstream of the TSS peak compared to Broad TSS candidates.

3.2.4 Finding candidates for interacting TSSs and enhancers

In addition to simply identifying enhancers, it is also interesting to try and infer what genes they might be regulating. CAGE data can itself not provide direct evidence that an enhancer is physically interacting with a TSS. This would require specialized chromatin confirmation capture assays such as HiC, 4C, 5C, etc. However, previous studies have shown that TSSs and enhancers that are close to each other and have highly correlated expression are more likely to be interacting. We can therefore use distance and correlation of expression between TSSs and enhancers to identify TSSs-enhancer links as candidates for physical interactions(Andersson et al. 2014).

To do this with CAGEfightR, we first need to indicate the two types of clusters as a factor with two levels:

rowData(RSE)$clusterType <- RSE %>%
    rowData() %>%
    use_series("clusterType") %>%
    as_factor() %>%
    fct_relevel("TSS")

We can then calculate all pairwise correlations between TSSs and enhancer within a distance of 50 kbp. Here we use the non-parametric Kendall’s tau as a measure of correlation, but other functions for calculating correlation can be supplied (e.g. one could calculate Pearson’s r on log-transformed TPM values to only capture linear relationships):

# Find links and calculate correlations
all_links <- RSE %>%
    swapRanges() %>%
    findLinks(maxDist = 5e4L,
              directional = "clusterType",
              inputAssay = "TPM",
              method = "kendall")
#> Finding directional links from TSS to Enhancer...
#> Calculating 41311 pairwise correlations...
#> Preparing output...
#> # Link summary:
#> Number of links: 41311
#> Summary of pairwise distance:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>     205    8832   21307   22341   35060   50000

# Show links
all_links
#> GInteractions object with 41311 interactions and 4 metadata columns:
#>              seqnames1   ranges1        seqnames2   ranges2 | orientation
#>                  <Rle> <IRanges>            <Rle> <IRanges> | <character>
#>       [1]         chr1   6204746 ---         chr1   6226837 |  downstream
#>       [2]         chr1   7079251 ---         chr1   7083527 |  downstream
#>       [3]         chr1   9535519 ---         chr1   9554735 |  downstream
#>       [4]         chr1   9538162 ---         chr1   9554735 |  downstream
#>       [5]         chr1  20941781 ---         chr1  20990601 |  downstream
#>       ...          ...       ... ...          ...       ... .         ...
#>   [41307]  chr9_random    193165 ---  chr9_random    217926 |    upstream
#>   [41308]  chr9_random    193165 ---  chr9_random    242951 |    upstream
#>   [41309]  chr9_random    223641 ---  chr9_random    217926 |  downstream
#>   [41310]  chr9_random    223641 ---  chr9_random    242951 |    upstream
#>   [41311] chrUn_random   3714359 --- chrUn_random   3718258 |    upstream
#>            distance   estimate   p.value
#>           <integer>  <numeric> <numeric>
#>       [1]     22090 -0.0603023  0.805434
#>       [2]      4275  0.3659942  0.128613
#>       [3]     19215 -0.2132007  0.392330
#>       [4]     16572  0.3411211  0.171112
#>       [5]     48819  0.1407053  0.565461
#>       ...       ...        ...       ...
#>   [41307]     24760  0.4770843 0.0423302
#>   [41308]     49785  0.1809068 0.4599290
#>   [41309]      5714 -0.0366988 0.8758961
#>   [41310]     19309 -0.2613098 0.2857948
#>   [41311]      3898 -0.1705606 0.4937737
#>   -------
#>   regions: 38454 ranges and 8 metadata columns
#>   seqinfo: 35 sequences (1 circular) from mm9 genome

The output is a GInteractions-object from the InteractionSet package(Lun, Perry, and Ing-Simmons 2016): For each TSS-enhancer link, both the distance and orientation (upstream/downstream relative to TSS) is calculated in addition to the correlation estimate and p-value. If one were to extract a set of highly correlated links for further analysis, it would be appropriate to correct the p-values for multiple testing, e.g. with the p.adjust(). For now, we are only interested in the top positive correlations, so we subset and sort the links:

# Subset to only positive correlation
cor_links <- subset(all_links, estimate > 0)

# Sort based on correlation
cor_links <- cor_links[order(cor_links$estimate, decreasing = TRUE)]

We can then visualize the correlation patterns across a genomic region, here using the most correlated TSS-enhancer link:

# Extract region around the link of interest
plot_region <- cor_links[1] %>% 
    boundingBox() %>% 
    linearize(1:2) %>%
    add(1000L)

# Cluster track
cluster_track <- RSE %>%
    subsetByOverlaps(plot_region) %>%
    trackClusters(name = "Clusters", 
                  col = NA, 
                  showId = TRUE)
#> Setting thick and thin features...
#> Merging and sorting...
#> Preparing track...


# Link track
link_track <- cor_links %>%
    subsetByOverlaps(plot_region) %>%
    trackLinks(name = "Links",
               interaction.measure = "p.value",
               interaction.dimension.transform = "log",
               col.outside = "grey",
               plot.anchors = FALSE,
               col.interactions = "black")

# Plot tracks together
plotTracks(list(axis_track, 
                link_track,
                cluster_track,
                tx_track),
           from = start(plot_region), 
           to = end(plot_region), 
           chromosome = as.character(seqnames(plot_region)))

Figure 7: Genome browser example of TSS-enhancer link

The top track shows the correlation between 3 TSS candidates around the Atp1b1 gene. The most significant correlation is seen between the upstream TSS candidate and the most distal enhancer candidate.

A word of caution on calculating correlations between TSSs and enhancers in this manner: As we are calculating the correlation of expression across biological replicates from two conditions (Ctrl and Nano), high correlations could also arise from a TSS and enhancer candidate responding in the same direction in response to the treatment. This means that the correlation observed when combining all samples across conditions can be different from the correlation calculated within each condition (This unintuitive phenomenon is known as Simpson’s Paradox). To avoid including such cases, one could analyze each condition separately to find TSS-enhancer links within each state. As an extension of this approach, one could also look at TSS-enhancer links that show different strengths of correlation in different states. Analyses of this type are referred to as differential coexpression analysis.

3.2.5 Finding stretches of enhancers

Several studies have found that groups or stretches of closely spaced enhancers tend to show different chromatin characteristics and functions compared to singleton enhancers. Such groups of enhancers are often referred to as “super enhancers” or “stretch enhancers”(Pott and Lieb 2015).

CAGEfightR can detect such enhancer stretches based on CAGE data. CAGEfightR groups nearby enhancers and calculates the average pairwise correlation between them, shown below (again using Kendall’s tau):

# Subset to only enhancers
Enhancers <- subset(RSE, clusterType == "Enhancer")

# Find stretches within 12.5 kbp
stretches <- findStretches(Enhancers, 
                           inputAssay = "TPM",
                           mergeDist = 12500L,
                           minSize = 5L,
                           method = "kendall")
#> Finding stretches...
#> Calculating correlations...
#> # Stretch summary:
#> Number of stretches: 95
#> Total number of clusters inside stretches: 587 / 9943
#> Minimum clusters: 5
#> Maximum clusters: 15
#> Minimum width: 7147
#> Maximum width: 92531
#> Summary of average pairwise correlations:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
#> -0.10038  0.01351  0.08107  0.09097  0.16171  0.37105

Similarly to TSSs and enhancers, we can also annotate stretches based on their relation with known transcripts:

# Annotate transcript models
stretches <- assignTxType(stretches, txModels = txdb)
#> Finding hierachical overlaps...
#> ### Overlap summary: ###
#>       txType count percentage
#> 1   promoter    50       52.6
#> 2   proximal     0        0.0
#> 3    fiveUTR     6        6.3
#> 4   threeUTR     5        5.3
#> 5        CDS     3        3.2
#> 6       exon     2        2.1
#> 7     intron    15       15.8
#> 8  antisense     0        0.0
#> 9 intergenic    14       14.7

# Sort by correlation
stretches <- stretches[order(stretches$aveCor, decreasing = TRUE)]

# Show the results
stretches
#> GRanges object with 95 ranges and 4 metadata columns:
#>                             seqnames              ranges strand |
#>                                <Rle>           <IRanges>  <Rle> |
#>     chr11:98628005-98647506    chr11   98628005-98647506      * |
#>    chr7:139979437-140003112     chr7 139979437-140003112      * |
#>     chr15:31261340-31279984    chr15   31261340-31279984      * |
#>   chr11:117733009-117752208    chr11 117733009-117752208      * |
#>      chr7:97167988-97188451     chr7   97167988-97188451      * |
#>                         ...      ...                 ...    ... .
#>   chr15:101076561-101093429    chr15 101076561-101093429      * |
#>     chr16:91373912-91399202    chr16   91373912-91399202      * |
#>    chr7:132619265-132644381     chr7 132619265-132644381      * |
#>     chr15:79181690-79208915    chr15   79181690-79208915      * |
#>     chr10:94708643-94729408    chr10   94708643-94729408      * |
#>                                         revmap nClusters     aveCor     txType
#>                                  <IntegerList> <integer>  <numeric>   <factor>
#>     chr11:98628005-98647506 6600,6601,6602,...         6   0.371053   promoter
#>    chr7:139979437-140003112 4220,4221,4222,...         5   0.328631   promoter
#>     chr15:31261340-31279984 7962,7963,7964,...         5   0.301604   intron  
#>   chr11:117733009-117752208 6785,6786,6787,...         6   0.284399   promoter
#>      chr7:97167988-97188451 4022,4023,4024,...         6   0.262200   promoter
#>                         ...                ...       ...        ...        ...
#>   chr15:101076561-101093429 8320,8321,8322,...         5 -0.0549688 intergenic
#>     chr16:91373912-91399202 8643,8644,8645,...         7 -0.0598361 fiveUTR   
#>    chr7:132619265-132644381 4160,4161,4162,...         5 -0.0626249 promoter  
#>     chr15:79181690-79208915 8144,8145,8146,...         5 -0.0981772 promoter  
#>     chr10:94708643-94729408 5823,5824,5825,...         5 -0.1003807 intron    
#>   -------
#>   seqinfo: 35 sequences (1 circular) from mm9 genome

The returned GRanges contains the the location, number of enhancers and average correlation for each stretch. Stretches are found in a variety of context, some being intergenic and others spanning various parts of genes. Let us plot one of the top intergenic stretches:

# Extract region around a stretch of enhancers
plot_region <- stretches["chr17:26666593-26675486"] + 1000

# Cluster track
cluster_track <- RSE %>%
    subsetByOverlaps(plot_region) %>%
    trackClusters(name = "Clusters", 
                  col = NA, 
                  showId = TRUE)
#> Setting thick and thin features...
#> Merging and sorting...
#> Preparing track...

# CTSS track
ctss_track <- CTSSs %>%
    subsetByOverlaps(plot_region) %>%
    trackCTSS(name = "CTSSs")
#> Splitting pooled signal by strand...
#> Preparing track...

# Stretch enhancer track
stretch_track <- stretches %>%
    subsetByOverlaps(plot_region) %>%
    AnnotationTrack(name = "Stretches", fill = "hotpink", col = NULL)

# Plot tracks together
plotTracks(list(axis_track, 
                stretch_track,
                cluster_track,
                ctss_track),
           from = start(plot_region), 
           to = end(plot_region), 
           chromosome = as.character(seqnames(plot_region)))

Figure 8: Genome browser example of enhancer stretch

This stretch is composed of at least 5 enhancer candidates, each of which shows bidirectional transcription.

3.3 Part 3: Differential Expression analysis of TSSs, enhancers and genes

3.3.1 Normalization of expression and EDA

Before performing statistical tests for various measures of Differential Expression (DE), it is important to first conduct a thorough Exploratory Data Analysis (EDA) to identify what factors we need to include in the final DE model.

Here we will use DESeq2 (Love, Huber, and Anders 2014) for normalization and EDA since it offers easy to use functions for performing basic analyses. Other popular tools such as edgeR (Robinson, McCarthy, and Smyth 2010) and limma (Ritchie et al. 2015) offer similar functionality, as well as more specialized packages for EDA such as EDASeq.

DESeq2 offers sophisticated normalization and transformation of count data in the form of the variance stabilized transformation: this adds a dynamic pseudo-count to normalized expression values before log transforming to dampen the inherent mean-variance relationship of count data. This is particularly useful for CAGE data, as CAGE can detect even very lowly expressed TSSs and enhancers.

Note: Due to their overall lower expression, enhancer candidate tags make up only a small proportion of the total number of tags. As proper estimation of normalization factors assumes a large number non-DE features, both TSS and enhancer candidates should normally be included in a DE analysis.

First, we fit a “blind” version of the variance stabilizing transformation, since we do not yet know what design is appropriate for this particular study:

# Create DESeq2 object with blank design
dds_blind <- DESeqDataSet(RSE, design = ~ 1)

# Normalize and log transform
vst_blind <- vst(dds_blind, blind = TRUE)

A very useful first representation is a Principal Components Analysis (PCA) plot summarizing variance across the entire experiment as Principle Components (PCs):

plotPCA(vst_blind, "Class")
#> using ntop=500 top features by variance

Figure 9: PCA-plot of variance stabilized expression

We observe that PC2 separates the samples according to the experimental group (Nano vs Ctrl). However, PC1 also separates samples into two groups. This is suggestive of an unwanted yet systematic effect on expression, often referred as a batch effect. Batch effect can arise for a multitude of reasons, e.g. libraries being prepared by different labs or people or using slightly different reagent pools. Often, batch effects co-ocur with the date libraries are prepared, and indeed Bornholdt et al suggests this as the cause of the batch effect in the original study.

We do not want to mistake this unwanted variation for biological variation when we test for DE. To prevent this, we can include the batch effect as a factor in the final DE model. First, we define the batch variable:

# Extract PCA results
pca_res <- plotPCA(vst_blind, "Class", returnData = TRUE)
#> using ntop=500 top features by variance

# Define a new variable using PC1
batch_var <- ifelse(pca_res$PC1 > 0, "A", "B")

# Attach the batch variable as a factor to the experiment
RSE$Batch <- factor(batch_var)

# Show the new design
RSE %>%
    colData() %>%
    subset(select = c(Class, Batch)) %>%
    knitr::kable(caption = "Design matrix after adding new batch covariate.")

Table 4: Design matrix after adding new batch covariate.
	Class	Batch
C547	Ctrl	B
C548	Ctrl	B
C549	Ctrl	B
C559	Ctrl	A
C560	Ctrl	A
N13	Nano	B
N14	Nano	A
N15	Nano	B
N16	Nano	A
N17	Nano	A
N18	Nano	A

As an alternative to manually defining the batch variable, tools such as sva and RUVSeq can be used to estimate unknown batch effects from the data.

3.3.2 Cluster-level differential expression

Following our short EDA above, we are ready to specify the final design for the experiment: We want to take into account both the Class and Batch of samples:

# Specify design
dds <- DESeqDataSet(RSE, design = ~ Batch + Class)

# Fit DESeq2 model
dds <- DESeq(dds)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing

We can now extract estimated effects (log fold changes) and statistical significance (p-values) for the Nano-vs-Ctrl comparison, implicitly correcting for the batch effect:

# Extract results
res <- results(dds,
               contrast = c("Class", "Nano", "Ctrl"),
               alpha = 0.05, 
               independentFiltering = TRUE, 
               tidy = TRUE) %>%
    bind_cols(as.data.frame(rowData(RSE))) %>%
    as_tibble()

# Show the top hits
res %>%
    top_n(n = -10, wt = padj) %>%
    dplyr::select(cluster = row, 
                  clusterType, 
                  txType, 
                  baseMean, 
                  log2FoldChange, 
                  padj) %>%
    knitr::kable(caption = "Top differentially expressed TSS and enhancer candidates")

Table 5: Top differentially expressed TSS and enhancer candidates
cluster	clusterType	txType	baseMean	log2FoldChange
chr1:73977049-73977548;-	TSS	intron	1183.3740	2.838367
chr2:32243097-32243468;-	TSS	promoter	30799.5953	3.741789
chr3:144423689-144423778;-	TSS	promoter	191.0431	3.709530
chr4:125840648-125840820;-	TSS	proximal	1063.4328	3.867574
chr4:137325466-137325712;-	TSS	intron	176.7636	3.912592
chr7:53971039-53971170;-	TSS	promoter	8720.5204	6.696838
chr9:120212846-120213294;+	TSS	promoter	316.0582	2.404706
chr11:83222553-83222887;+	TSS	proximal	228.5560	6.098838
chr12:105649334-105649472;+	TSS	CDS	175.1364	3.345411
chr19:56668148-56668332;+	TSS	CDS	103.8795	-2.254371

It always a good idea to inspect a few diagnostics plots to make sure the DESeq2 analysis was successful. One such example is an MA-plot (another useful plot is the p-value histogram):

ggplot(res, aes(x = log2(baseMean), 
                y = log2FoldChange, 
                color = padj < 0.05)) +
    geom_point(alpha = 0.25) +
    geom_hline(yintercept = 0, 
               linetype = "dashed", 
               alpha = 0.75) +
    facet_grid(clusterType ~ .)

Figure 10: Diagnostic MA plot of the differential expression analysis

We can see that we overall find more DE TSS compared to enhancer candidates, which is expected since TSS candidates are more highly expressed. Many enhancer candidates are filtered away for the final DESeq2 analysis (The “Independent Filtering” Step), as their expression level is too low to detect any DE: This increases power for detecting DE at higher expression levels for the remaining TSS and enhancer candidates.

We can tabulate the total number of DE TSS and enhancer candidates:

table(clusterType = rowRanges(RSE)$clusterType, 
      DE = res$padj < 0.05)
#>            DE
#> clusterType FALSE  TRUE
#>    TSS      22071  6385
#>    Enhancer  3034   199

3.3.3 Correcting expression estimates for batch effects

In addition to looking at estimates and significance for each cluster, we might also want to look at individual expression values for some top hits. However, we then need to also correct the expression estimates themselves for batch effects, just like we did for log fold changes and p-values (using the same model of course).

Here we use ComBat(Johnson, Li, and Rabinovic 2007) from the sva package which is suitable for removing simple batch effects from small experiments. For more advanced setups, removeBatchEffect from limma can remove arbitrarily complex batch effects. The RUVSeq package and fsva from sva can be used to remove unknown batch effects.

We again use the variance stabilizing transformation to prepare the data for ComBat (this makes count data resemble expression estimates obtained from microarrays, as ComBat was originally developed for microarrays):

# Guided / non-blind  variance stabilizing transformation
vst_guided <- varianceStabilizingTransformation(dds, blind = FALSE)

To run ComBat we need two additional pieces of information: i) A design matrix describing the biological or wanted effects and ii) the known but unwanted batch effect. We first specify the design matrix, and then run ComBat:

# Design matrix of wanted effects
bio_effects <- model.matrix(~ Class, data = colData(RSE))

# Run ComBat
assay(RSE, "ComBat") <- ComBat(dat = assay(vst_guided),
                                    batch = RSE$Batch,
                                    mod = bio_effects)
#> Found 253 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
#> Found2batches
#> Adjusting for1covariate(s) or covariate level(s)
#> Standardizing Data across genes
#> Fitting L/S model and finding priors
#> Finding parametric adjustments
#> Adjusting the Data

We can redo the PCA-plot, to see the global effect of the batch effect correction:

# Overwrite assay 
assay(vst_guided) <- assay(RSE, "ComBat")

# Plot as before
plotPCA(vst_guided, "Class")
#> using ntop=500 top features by variance

Figure 11: PCA-plot of batch corrected expression

Now Nano and Ctrl are separated along PC1 (compared to PC2 before correction). As PC1 captures the most variance, this indicates that the batch effect has been removed and that the experimental group is now the main contributor to variance of expression.

Then we extract the top 10 DE enhancer candidates using the following tidyverse code:

# Find top 10 DE enhancers
top10 <- res %>%
    filter(clusterType == "Enhancer", padj < 0.05) %>%
    group_by(log2FoldChange >= 0) %>%
    top_n(n = 5, wt = abs(log2FoldChange)) %>%
    pull(row)

# Extract expression values in tidy format
tidyEnhancers <- assay(RSE, "ComBat")[top10, ] %>%
    t() %>%
    as_tibble(rownames = "Sample") %>%
    mutate(Class = RSE$Class) %>%
    gather(key = "Enhancer", 
           value = "Expression", 
           -Sample, -Class, 
           factor_key = TRUE)

Finally, we can plot the batch-corrected expression of each top enhancer candidate:

ggplot(tidyEnhancers, aes(x = Class, 
                          y = Expression, 
                          fill = Class)) +
    geom_dotplot(stackdir = "center", 
                 binaxis = "y", 
                 dotsize = 3) +
    facet_wrap(~ Enhancer, 
               ncol = 2, 
               scales = "free_y")
#> Bin width defaults to 1/30 of the range of the data. Pick better value with
#> `binwidth`.

Figure 12: Expression profile of top 10 differentially expressed enhancer candidates

3.3.4 Enrichment of DNA-binding motifs

A typical question following identification of differentially expressed TSS and enhancer candidates, is what Transcription Factors (TFs) might be involved in their regulation. To shed light on this question we can annotate TSSs and enhancers with DNA-binding motifs from the JASPAR database(Mathelier et al. 2016).

First we extract the sequences around TSSs and enhancers. Here we simply define it as +/- 500 bp around TSS candidate peak or enhancer candidate midpoint:

cluster_seqs <- RSE %>% 
    rowRanges() %>%
    swapRanges() %>%
    unstrand() %>%
    add(500) %>% 
    getSeq(bsg, names = .)

Secondly, we use the TFBSTools(Tan and Lenhard 2016) package to obtain motifs as Position Frequency Matrices (PFMs) from the JASPAR2016 database:

# Extract motifs as PFMs
motif_pfms <- getMatrixSet(JASPAR2016, opts = list(species = "10090"))

# Look at the IDs and names of the first few motifs:
head(name(motif_pfms))
#>    MA0004.1    MA0006.1    MA0029.1    MA0063.1    MA0067.1    MA0078.1 
#>      "Arnt" "Ahr::Arnt"     "Mecom"    "Nkx2-5"      "Pax2"     "Sox17"

Thirdly, we use the motifmatchr package (Schep 2018) to find hits in the promoter sequences:

# Find matches
motif_hits <- matchMotifs(motif_pfms, subject = cluster_seqs)

# Matches are returned as a sparse matrix:
motifMatches(motif_hits)[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "lgCMatrix"
#>      MA0004.1 MA0006.1 MA0029.1 MA0063.1 MA0067.1
#> [1,]        .        .        .        .        |
#> [2,]        .        .        .        .        .
#> [3,]        |        .        |        .        .
#> [4,]        .        .        .        .        .
#> [5,]        .        |        .        |        .

Finally, we can do a simple Fisher’s Exact test to see if a motif co-occurs more with DE TSS and enhancer candidates than we would expect be chance. Here we will look at the FOS::JUN motif (MA0099.2):

# 2x2 table for fishers
table(FOSJUN = motifMatches(motif_hits)[,"MA0099.2"],
      DE = res$padj < 0.05) %>%
    print() %>%
    fisher.test()
#>        DE
#> FOSJUN  FALSE  TRUE
#>   FALSE 22144  5596
#>   TRUE   2961   988
#> 
#>  Fisher's Exact Test for Count Data
#> 
#> data:  .
#> p-value = 5.839e-12
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  1.220330 1.427821
#> sample estimates:
#> odds ratio 
#>   1.320361

A significant odds ratio above 1 indicate that FOS::JUN is a candidate TF (or, more technically correct, a candidate TF dimer) in regulation of the nanotube response. This is not surprising given that FOS::JUN is part of the TNF-alpha inflammatory pathway (see more below). A similar motif was also found by Bornholdt et al in the original study.

Of course, this is a just a very quick and simple analysis of motif enrichment. One could easily have used different regions around TSS and enhancer candidates and/or split the enrichment analysis between TSSs and enhancers. Other Bioconductor packages like PWMEnrich, rGADEM and motifcounter implement more advanced statistical methods for calculating enrichment of known motifs. rGADEM, BCRANK and motifRG can also be used to calculate enrichment of novel motifs, sometimes referred to as motif discovery.

3.3.5 Gene-level differential expression

While CAGE data is naturally analyzed at the level of clusters (TSS and enhancer candidates) it is in many cases interesting to also look at gene-level expression estimates. A prime example of this is looking at enrichment of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) terms (Ashburner et al. 2000; The Gene Ontology Consortium 2019; Kanehisa and Goto 2000) which are only defined at gene-level. CAGEfightR includes functions for annotating clusters with gene models and summarizing expression to gene-level.

We can annotate clusters with gene IDs in the same manner as transcript IDs:

RSE <- assignGeneID(RSE, geneModels = txdb)
#> Extracting genes...
#>   84 genes were dropped because they have exons located on both strands
#>   of the same reference sequence or on more than one reference sequence,
#>   so cannot be represented by a single genomic range.
#>   Use 'single.strand.genes.only=FALSE' to get all the genes in a
#>   GRangesList object, or use suppressMessages() to suppress this message.
#> Overlapping while taking distance to nearest TSS into account...
#> Finding hierachical overlaps...
#> ### Overlap Summary: ###
#> Features overlapping genes: 81.34 %
#> Number of unique genes: 13760

And then use CAGEfightR to sum counts of TSS candidates within genes:

GSE <- RSE %>%
    subset(clusterType == "TSS") %>%
    quantifyGenes(genes = "geneID", inputAssay = "counts")
#> Quantifying genes: 13567

The result is a RangedSummarizedExperiment where the ranges are a GRangesList holding the TSS candidates that were summed within each gene:

rowRanges(GSE["100038347",])
#> GRangesList object of length 1:
#> $`100038347`
#> GRanges object with 2 ranges and 9 metadata columns:
#>                            seqnames            ranges strand |     score
#>                               <Rle>         <IRanges>  <Rle> | <numeric>
#>   chr7:80884953-80885056;+     chr7 80884953-80885056      + |   11.0585
#>   chr7:80885120-80885677;+     chr7 80885120-80885677      + | 1162.3447
#>                                thick   support        txID   txType   balance
#>                            <IRanges> <integer> <character> <factor> <numeric>
#>   chr7:80884953-80885056;+  80885000         5  uc009hrf.2 proximal        NA
#>   chr7:80885120-80885677;+  80885256        11  uc009hrf.2 promoter        NA
#>                            bidirectionality clusterType      geneID
#>                                   <numeric>    <factor> <character>
#>   chr7:80884953-80885056;+               NA         TSS   100038347
#>   chr7:80885120-80885677;+               NA         TSS   100038347
#>   -------
#>   seqinfo: 35 sequences (1 circular) from mm9 genome

The gene IDs in this case are Entrez IDs (which are widely used by Bioconductor packages). We can translate these systematic IDs into more human-readable symbols using the org.Mm.eg.db annotation package:

# Translate symbols
rowData(GSE)$symbol <- mapIds(odb, 
                              keys = rownames(GSE), 
                              column = "SYMBOL", 
                              keytype = "ENTREZID")
#> 'select()' returned 1:1 mapping between keys and columns

Having obtained a gene-level count matrix we can now perform gene-level DE analysis. Here we use limma-voom, since limma makes it easy to perform a subsequent enrichment analysis. Other tools such as DESeq2 (above) or edgeR (see below) could also have been used.

Note: limma is a powerful tool for DE analysis of count-based data. However, since it depends on log transforming counts, it is not always suitable for analyzing datasets where features have very low counts. This is usually not a problem for gene-level analysis, but can be a problem for enhancers, which are generally very lowly expressed.

Similarly to the DESeq2 analysis above, we first build the necessary object and then normalize the expression values:

# Create DGElist object
dge <- DGEList(counts = assay(GSE, "counts"),
               genes = as.data.frame(rowData(GSE)))

# Calculate normalization factors
dge <- calcNormFactors(dge)

Then we apply the voom-transformation to model the mean-variance trend, for which we also need to specify the design matrix (in this case the design must contain both wanted and unwanted effects!). The same design matrix is then used for fitting the gene-wise models:

# Design matrix
mod <- model.matrix(~ Batch + Class, data = colData(GSE))

# Model mean-variance using voom
v <- voom(dge, design = mod)

# Fit and shrink DE model
fit <- lmFit(v, design = mod)
eb <- eBayes(fit, robust = TRUE)

# Summarize the results
dt <- decideTests(eb)

We can then report both an overall summary of differential gene expression, and look at the first few top hits:

# Global summary
dt %>% 
    summary() %>% 
    knitr::kable(caption = "Global summary of differentially expressed genes.")

Table 6: Global summary of differentially expressed genes.
	(Intercept)	BatchB	ClassNano
Down	51	2572	1505
NotSig	463	8278	10373
Up	13053	2717	1689

# Inspect top htis
topTable(eb, coef = "ClassNano") %>%
    dplyr::select(symbol, nClusters, AveExpr, logFC, adj.P.Val) %>%
    knitr::kable(caption = "Top differentially expressed genes.")

Table 7: Top differentially expressed genes.
	symbol	nClusters	AveExpr	logFC	adj.P.Val
66938	Sh3d21	3	5.871004	3.075745	0.0e+00
245049	Myrip	2	4.371325	2.414055	7.0e-07
12722	Clca3a1	1	3.020528	3.692198	7.0e-07
382864	Colq	3	2.770158	-3.426911	1.1e-06
20716	Serpina3n	5	6.384175	1.872782	3.0e-06
72275	2200002D01Rik	2	7.208031	1.693257	5.5e-06
381813	Prmt8	4	4.553612	1.409006	5.8e-06
170706	Tmem37	2	5.503908	1.679690	5.8e-06
18654	Pgf	1	4.862055	2.337045	5.8e-06
20361	Sema7a	1	7.612236	1.473680	5.9e-06

3.3.6 Enrichment of GO- and KEGG-terms

In addition to looking at individual top genes, we can look at how the DE genes relate to known databases of gene function to gain insight in what biological processes might be affected in the experiment.

limma makes it easy to perform such an enrichment analysis following a DE analysis. As we have genes indexed by Entrez IDs, we can directly use goana to find enriched GO-terms: goana uses a biased urn-model to estimate enrichment of GO-terms, while taking into account the expression levels of DE genes:

# Find enriched GO-terms
GO <- goana(eb, coef = "ClassNano", species = "Mm", trend = TRUE)

# Show top hits
topGO(GO, ontology = "BP", number = 10) %>%
    knitr::kable(caption = "Top enriched or depleted GO-terms.")

Table 8: Top enriched or depleted GO-terms.
	Term	Ont	N	Up	Down	P.Down
GO:0042221	response to chemical	BP	2910	502	292	0.9936890
GO:0006952	defense response	BP	1405	278	131	0.9931316
GO:0009605	response to external stimulus	BP	1898	351	204	0.7855123
GO:0006954	inflammatory response	BP	697	159	65	0.9592339
GO:0002376	immune system process	BP	2112	376	193	0.9997049
GO:0006955	immune response	BP	1378	265	122	0.9987402
GO:0050900	leukocyte migration	BP	337	94	26	0.9914427
GO:0097529	myeloid leukocyte migration	BP	204	67	16	0.9692137
GO:0006950	response to stress	BP	3221	528	289	0.9999989
GO:0043207	response to external biotic stimulus	BP	1279	248	126	0.9578744

And similarly for KEGG terms we can use kegga:

# Find enriched KEGG-terms
KEGG <- kegga(eb, coef = "ClassNano", species = "Mm", trend = TRUE)

# Show top hits
topKEGG(KEGG, number = 10) %>%
    knitr::kable(caption = "Top enriched or depleted KEGG-terms.")

Table 9: Top enriched or depleted KEGG-terms.
	Pathway	N	Up	Down	P.Up	P.Down
mmu04060	Cytokine-cytokine receptor interaction	175	56	13	0.0000000	0.9619628
mmu05171	Coronavirus disease - COVID-19	195	59	4	0.0000000	0.9999998
mmu04061	Viral protein interaction with cytokine and cytokine receptor	64	27	5	0.0000000	0.8594840
mmu04820	Cytoskeleton in muscle cells	176	30	43	0.0746273	0.0000015
mmu04668	TNF signaling pathway	113	33	9	0.0000023	0.9038229
mmu00980	Metabolism of xenobiotics by cytochrome P450	48	3	17	0.9581498	0.0000136
mmu03010	Ribosome	122	32	2	0.0000227	0.9999900
mmu04064	NF-kappa B signaling pathway	90	26	5	0.0000229	0.9764089
mmu00600	Sphingolipid metabolism	47	17	3	0.0000498	0.9235392
mmu04657	IL-17 signaling pathway	74	22	2	0.0000806	0.9985557

Both analyses indicate that genes related to the inflammatory response and defense response are upregulated following nanotube exposure (similar enrichments were seen by Bornholdt et al in the original study). This supports the hypothesis that nanotubes induce a response similar to asbestos.

KEGG-terms represent well defined pathways. We can use the pathview package(Luo and Brouwer 2013) to investigate in more detail the genes in a given enriched pathway. For example, we can look at regulation of genes in the TNF signalling pathway:

# Format DE genes for pathview
DE_genes <- dt[, "ClassNano"] %>%
    as.integer() %>%
    set_names(rownames(dt)) %>%
    Filter(f=function(x) x != 0, x=.)

# Run pathview; this will save a png file to a temporary directory
pathview(DE_genes, 
         species = "mmu", 
         pathway.id = "mmu04668", 
         kegg.dir = tempdir())

# Show the png file
grid.newpage()
grid.raster(png::readPNG("mmu04668.pathview.png"))

Figure 13: Detailed view of differentially expressed genes in the KEGG TNF signalling pathway

3.3.7 Differential TSS Usage

In the above two analyses we looked at whether an individual TSS candidate or an individual gene was changing expression between experimental conditions However, we might also want to look at whether a gene shows Differential TSS Usage (DTU): Whether a gene uses different TSS candidates under different conditions. This problem is similar to differential splicing in RNA-Seq, but looking at TSSs rather than transcripts/isoforms(Soneson, Love, and Robinson 2015). Here we will use the edgeR diffSpliceDGE method to test for DTU, although many other packages could have been used, for example diffSplice from limma, DEXSeq, DRIMSeq, etc..

Intuitively, diffSpliceDGE tests whether a given TSS candidate shows the same change as other TSS candidates in the same gene, indicating that TSS candidates are differentially regulated across the gene. This does however not take into account the relative composition of a given TSS candidate, e.g. whether the contribution of a TSS candidate increases from 1%-2% of total gene expression or 25%-50%. A useful preprocessing step is therefore to filter out TSS candidates making only a small contribution to total gene expression before analysis.

We use CAGEfightR to remove TSS candidates that are not expressed as more than 10% of total gene expression in more than 5 samples (we first remove TSS candidates not assigned to genes):

# Filter away lowly expressed
RSE_filtered <- RSE %>%
    subset(clusterType == "TSS" & !is.na(geneID)) %>%
    subsetByComposition(inputAssay = "counts", 
                        genes = "geneID", 
                        unexpressed = 0.1, 
                        minSamples = 5L)
#> Calculating composition...
#> Subsetting...
#> Removed 8001 out of 24500 regions (32.7%)

We can only test for DTU in genes with multiple TSSs. A useful first visualization is therefore to see how many genes use more than one TSS candidate:

RSE_filtered %>% 
    rowData() %>% 
    as.data.frame() %>% 
    as_tibble() %>% 
    dplyr::count(geneID) %>% 
    ggplot(aes(x = n, fill = n >= 2)) + 
    geom_bar(alpha = 0.75) +
    scale_fill_colorblind("Multi-TSS") +
    labs(x = "Number of TSSs per gene", y = "Frequency")

Figure 14: Overview of alternative TSS usage within genes

While most genes utilize only a single TSS candidate, many genes use two or more TSS candidates.

Again, we start by building the necessary R-objects for running edgeR:

# Annotate with symbols like before:
rowData(RSE_filtered)$symbol <- mapIds(odb, 
                                       keys = rowData(RSE_filtered)$geneID,
                                       column = "SYMBOL", 
                                       keytype = "ENTREZID")
#> 'select()' returned 1:1 mapping between keys and columns

# Extract gene info
TSS_info <- RSE_filtered %>%
    rowData() %>%
    subset(select = c(score, txType, geneID, symbol)) %>%
    as.data.frame()

# Build DGEList
dge <- DGEList(counts = assay(RSE_filtered, "counts"),
               genes = TSS_info)

Then we normalize and fit models using the Quasi-likelihood approach, including the diffSpliceDGE step:

# Estimate normalization factors
dge <- calcNormFactors(dge)

# Estimate dispersion and fit GLMs
disp <- estimateDisp(dge, design = mod, tagwise = FALSE)
QLfit <- glmQLFit(disp, design = mod, robust = TRUE)

# Apply diffSpliceDGE
ds <- diffSpliceDGE(QLfit, coef = "ClassNano", geneid = "geneID")
#> Total number of exons:  16499 
#> Total number of genes:  13563 
#> Number of genes with 1 exon:  11098 
#> Mean number of exons in a gene:  1 
#> Max number of exons in a gene:  5

We can look at DTU at two-levels: Whether an individual TSS candidate shows DTU (TSS-level) or whether a individual gene shows DTU in any way (gene-level). First, let us look at individual TSS candidate (TSS-level DTU):

dtu_TSSs <- topSpliceDGE(ds, test = "exon")
dplyr::select(dtu_TSSs, txType, geneID, symbol, logFC, FDR) %>%
    knitr::kable(caption = "Top differentially used TSSs")

Table 10: Top differentially used TSSs
	txType	geneID	symbol	logFC	FDR
chr17:13840650-13840851;-	intron	21646	Tcte2	1.787631	0e+00
chr10:57857044-57857314;+	promoter	110829	Lims1	-1.065344	0e+00
chr17:33966135-33966308;+	intron	66416	Ndufa7	2.197871	0e+00
chr4:141154044-141154185;-	intron	74202	Fblim1	1.701498	0e+00
chr14:70215678-70215876;-	intron	246710	Rhobtb2	2.494688	0e+00
chr15:76428030-76428201;-	intron	94230	Cpsf1	1.457420	0e+00
chr19:57271818-57272125;-	promoter	226251	Ablim1	1.145654	0e+00
chr11:116395161-116395462;+	proximal	20698	Sphk1	1.744934	3e-07
chr1:94960234-94960399;-	intron	16560	Kif1a	2.964803	3e-07
chr17:72131915-72132185;+	intron	78785	Clip4	-1.681990	3e-07

The interpretation of log fold changes here is slightly different from before: These log fold changes are relative to the overall log fold change for all TSS candidates in that gene.

Then we can look at results for each gene (gene-level DTU):

dtu_genes <- topSpliceDGE(ds, test = "Simes")
dplyr::select(dtu_genes, geneID, symbol, NExons, FDR) %>%
    knitr::kable(row.names = FALSE, 
          caption = "Top genes showing any differential TSS usage.")

Table 11: Top genes showing any differential TSS usage.
geneID	symbol	NExons	FDR
21646	Tcte2	4	0e+00
110829	Lims1	3	0e+00
66416	Ndufa7	3	0e+00
74202	Fblim1	3	0e+00
246710	Rhobtb2	3	0e+00
94230	Cpsf1	2	0e+00
226251	Ablim1	3	0e+00
16560	Kif1a	2	2e-07
78785	Clip4	2	2e-07
68801	Elovl5	2	2e-07

We see that the two lists agree, which is not surprising given that the gene-level results are obtained by aggregating TSS-level p-values across genes.

We can look at closer at the TSS usage in on of the top hits: We can visualize the batch-corrected expression (See above) of each TSS candidate in the Fblim1 gene via a heatmap:

RSE_filtered %>%
    subset(geneID == "74202") %>%
    assay("ComBat") %>%
    t() %>%
    pheatmap(color = magma(100), 
             cluster_cols = FALSE,
             main="Fblim1")

Figure 15: Heatmap showing expression of TSSs within Fblim1

Fblim1 has 3 TSS candidates: 2 are used in the Ctrl samples, while the Nano samples also use the chr4:141154044-141154185;- TSS (as also seen in the TSS-level table above). While a heatmap is useful for seeing expression changes, a genome browser view is better to inspect the genomic context of each TSS candidate:

# Extract region at Fblim1
plot_region <- RSE_filtered %>%
    rowRanges() %>%
    subset(geneID == "74202") %>%
    GenomicRanges::reduce(min.gapwidth = 1e6L) %>%
    unstrand() %>%
    add(5e3L)

# Cluster track
cluster_track <- RSE_filtered %>%
    subsetByOverlaps(plot_region) %>%
    trackClusters(name = "Clusters", col = NA, showId = TRUE)
#> Setting thick and thin features...
#> Merging and sorting...
#> Preparing track...

# CTSS tracks for each group
ctrl_track <- CTSSs %>%
    subset(select = Class == "Ctrl") %>%
    calcPooled() %>%
    subsetByOverlaps(plot_region) %>%
    trackCTSS(name = "Ctrl")
#> Warning in calcPooled(.): object already has a column named score in rowData:
#> It will be overwritten!
#> Splitting pooled signal by strand...
#> Preparing track...

nano_track <- CTSSs %>%
    subset(select = Class == "Nano") %>%
    calcPooled() %>%
    subsetByOverlaps(plot_region) %>%
    trackCTSS(name = "Nano")
#> Warning in calcPooled(.): object already has a column named score in rowData:
#> It will be overwritten!
#> Splitting pooled signal by strand...
#> Preparing track...

# Plot tracks together
plotTracks(list(axis_track,
                tx_track,
                cluster_track,
                Ctrl=ctrl_track,
                nano_track),
           from = start(plot_region),
           to = end(plot_region),
           chromosome = as.character(seqnames(plot_region)))

Figure 16: Genome-browser example of differential TSS usage within Fblim1

The Fblim1 gene uses two annotated TSS candidates, but the Nano samples also uses a novel intronic TSS candidate.

4 Discussion

This workflow is intended to provide an outline of the basic building blocks of CAGE data analysis, going from clustering over spatial analyses to differential expression. More advanced analyses can be strung together from these basic elements: Finding enhancers linked to differentially expressed TSSs, enhancer stretches composed of differentially expressed enhancers, comparing DNA binding motif enrichments between enhancers and TSSs, differential coexpression, etc.

One aspect not covered in this workflow is the utility of CAGE data (and 5’-end data in general) in providing accurate TSSs for studying other types of data. For example, having accurate TSSs is highly beneficial in analyzing chromatin genomics data such as ChIP-Seq for modified histones, since the location of nucleosomes and TSSs are closely related (Andersson et al. 2014; Duttke et al. 2015; Thodberg et al. 2018). CAGE can be combined with chromatin confirmation assays such as HiC to find new enhancers that are both co-expressed and physically interacting with TSSs. Many genome-wide association studies are finding that disease-related genetic variants are found in intergenic regions, that are often poorly annotated. The accurate enhancer locations provided by CAGE can greatly aid interpretation of such variants (Boyd et al. 2018). The adherence of CAGEfightR to standard Bioconductor classes facilitates these inter-assay analyses by making it easy to mix-and-match multiple packages developed for different experimental assays.

5 Author information

MS and AS conceived the project and wrote the paper.

6 Competing interests

The authors develop and maintain the CAGEfightR Bioconductor package.

7 Grant information

Work in the Sandelin Lab was supported by the Novo Nordisk Foundation, Lundbeck foundation, Danish Innovation Fund, Danish Cancer Society and Independent Research Fund Denmark.

8 Acknowledgments

We acknowledge all members of the Sandelin Lab and Andersson Lab for advice, discussion and input on all aspects related to CAGE data analysis.

Adiconis, Xian, Adam L. Haber, Sean K. Simmons, Ami Levy Moonshine, Zhe Ji, Michele A. Busby, Xi Shi, et al. 2018. “Comprehensive comparative analysis of 5’-end RNA-sequencing methods.” Nature Methods 15 (7): 505–11. https://doi.org/10.1038/s41592-018-0014-2.

Andersson, Robin, Claudia Gebhard, Irene Miguel-Escalada, Ilka Hoof, Jette Bornholdt, Mette Boyd, Yun Chen, et al. 2014. “An atlas of active enhancers across human cell types and tissues.” Nature 507 (7493): 455–61. https://doi.org/10.1038/nature12787.

Ashburner, M, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, et al. 2000. “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.” Nature Genetics 25 (1): 25–29. https://doi.org/10.1038/75556.

Bhardwaj, Vivek. 2019. “icetea: Integrating Cap Enrichment with Transcript Expression Analysis.” https://doi.org/https://doi.org/doi:10.18129/B9.bioc.icetea.

Bornholdt, Jette, Anne Thoustrup Saber, Berit Lilje, Mette Boyd, Mette Jørgensen, Yun Chen, Morana Vitezic, et al. 2017. “Identification of Gene Transcription Start Sites and Enhancers Responding to Pulmonary Carbon Nanotube Exposure in Vivo.” ACS Nano 11 (4): 3597–3613. https://doi.org/10.1021/acsnano.6b07533.

Boyd, Mette, Malte Thodberg, Morana Vitezic, Jette Bornholdt, Kristoffer Vitting-Seerup, Yun Chen, Mehmet Coskun, et al. 2018. “Characterization of the enhancer and promoter landscape of inflammatory bowel disease from human colon biopsies.” Nature Communications 9 (1): 1661. https://doi.org/10.1038/s41467-018-03766-z.

Carninci, Piero, Albin Sandelin, Boris Lenhard, Shintaro Katayama, Kazuro Shimokawa, Jasmina Ponjavic, Colin A M Semple, et al. 2006. “Genome-wide analysis of mammalian promoter architecture and evolution.” Nature Genetics 38 (6): 626–35. https://doi.org/10.1038/ng1789.

Consortium, The Fantom, Riken Pmi, and Clst Dgt. 2014. “A promoter-level mammalian expression atlas.” Nature 507 (7493): 462–70. https://doi.org/10.1038/nature13182.

Duttke, Sascha H. C., Scott A. Lacadie, Mahmoud M. Ibrahim, Christopher K. Glass, David L. Corcoran, Christopher Benner, Sven Heinz, James T. Kadonaga, and Uwe Ohler. 2015. “Human promoters are intrinsically directional.” Molecular Cell 57 (4): 674–84. https://doi.org/10.1016/j.molcel.2014.12.029.

Frith, Martin C, Eivind Valen, Anders Krogh, Yoshihide Hayashizaki, Piero Carninci, and Albin Sandelin. 2008. “A code for transcription initiation in mammalian genomes.” Genome Research 18 (1): 1–12. https://doi.org/10.1101/gr.6831208.

Haberle, Vanja, and Alexander Stark. 2018. “Eukaryotic core promoters and the functional basis of transcription initiation.” Nature Reviews Molecular Cell Biology 19 (10): 621–37. https://doi.org/10.1038/s41580-018-0028-8.

Haberle, V., a. R. R. Forrest, Y. Hayashizaki, P. Carninci, and B. Lenhard. 2015. “CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses.” Nucleic Acids Research, 1–11. https://doi.org/10.1093/nar/gkv054.

Hahne, Florian, and Robert Ivanek. 2016. “Visualizing Genomic Data Using Gviz and Bioconductor.” In Statistical Genomics: Methods and Protocols, edited by Ewy Mathé and Sean Davis, 335–51. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4939-3578-9_16.

Hon, Chung-Chau, Jordan A. Ramilowski, Jayson Harshbarger, Nicolas Bertin, Owen J L Rackham, Julian Gough, Elena Denisenko, et al. 2017. “An atlas of human long non-coding RNAs with accurate 5’ ends.” Nature 543 (7644): 199–204. https://doi.org/10.1038/nature21374.

Huber, Wolfgang, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nature Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.

Johnson, W Evan, Cheng Li, and Ariel Rabinovic. 2007. “Adjusting batch effects in microarray expression data using empirical Bayes methods.” Biostatistics (Oxford, England) 8 (1): 118–27. https://doi.org/10.1093/biostatistics/kxj037.

Kadonaga, James T. 2012. “Perspectives on the RNA polymerase II core promoter.” Wiley Interdisciplinary Reviews. Developmental Biology 1 (1): 40–51. https://doi.org/10.1002/wdev.21.

Kanehisa, M, and S Goto. 2000. “KEGG: kyoto encyclopedia of genes and genomes.” Nucleic Acids Research 28 (1): 27–30. https://doi.org/10.1016/j.meegid.2016.07.022.

Kawaji, Hideya, Marina Lizio, Masayoshi Itoh, Mutsumi Kanamori-Katayama, Ai Kaiho, Hiromi Nishiyori-Sueki, Jay W. Shin, et al. 2014. “Comparison of CAGE and RNA-seq transcriptome profiling using clonally amplified and single-molecule next-generation sequencing.” Genome Research 24: 708–17. https://doi.org/10.1101/gr.156232.113.

Kim, Tae-Kyung, Martin Hemberg, Jesse M. Gray, Allen M. Costa, Daniel M. Bear, Jing Wu, David A. Harmin, et al. 2010. “Widespread transcription at neuronal activity-regulated enhancers.” Nature 465 (7295): 182–7. https://doi.org/10.1038/nature09033.

Lawrence, Michael, Wolfgang Huber, Herve Pages, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T. Morgan, and Vincent J. Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Computational Biology 9 (8): 1–10. https://doi.org/10.1371/journal.pcbi.1003118.

Lenhard, Boris, Albin Sandelin, and Piero Carninci. 2012. “Metazoan promoters: emerging characteristics and insights into transcriptional regulation.” Nature Reviews. Genetics 13 (4): 233–45. https://doi.org/10.1038/nrg3163.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 1–21. https://doi.org/10.1186/s13059-014-0550-8.

Lun, Aaron T L, Malcolm Perry, and Elizabeth Ing-Simmons. 2016. “Infrastructure for genomic interactions: Bioconductor classes for Hi-C, ChIA-PET and related experiments.” F1000Research 5 (0): 950. https://doi.org/10.12688/f1000research.8759.2.

Luo, Weijun, and Cory Brouwer. 2013. “Pathview: An R/Bioconductor package for pathway-based data integration and visualization.” Bioinformatics 29 (14): 1830–1. https://doi.org/10.1093/bioinformatics/btt285.

Mathelier, Anthony, Oriol Fornes, David J. Arenillas, Chih Yu Chen, Grégoire Denay, Jessica Lee, Wenqiang Shi, et al. 2016. “JASPAR 2016: A major expansion and update of the open-access database of transcription factor binding profiles.” Nucleic Acids Research 44 (D1): D110–D115. https://doi.org/10.1093/nar/gkv1176.

Pott, Sebastian, and Jason D. Lieb. 2015. “What are super-enhancers?” Nature Genetics 47 (1): 8–12. https://doi.org/10.1038/ng.3167.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics (Oxford, England) 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Sandelin, Albin, Piero Carninci, Boris Lenhard, Jasmina Ponjavic, Yoshihide Hayashizaki, and David A. Hume. 2007. “Mammalian RNA polymerase II core promoters: Insights from genome-wide studies.” Nature Reviews Genetics 8 (6): 424–36. https://doi.org/10.1038/nrg2026.

Schep, Alicia. 2018. “motifmatchr: Fast Motif Matching in R.” https://doi.org/https://doi.org/doi:10.18129/B9.bioc.motifmatchr.

Schneider, T D, and R M Stephens. 1990. “Sequence logos: a new way to display consensus sequences.” Nucleic Acids Research 18 (20): 6097–6100. https://doi.org/10.1007/978-3-319-16958-3_6.

Smale, Stephen T., and James T. Kadonaga. 2003. “The RNA Polymerase II Core Promoter.” Annual Review of Biochemistry 72 (1): 449–79. https://doi.org/10.1146/annurev.biochem.72.121801.161520.

Soneson, Charlotte, Michael I. Love, and Mark D. Robinson. 2015. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4: 1521. https://doi.org/10.12688/f1000research.7563.2.

Takahashi, Hazuki, Sachi Kato, Mitsuyoshi Murata, and Piero Carninci. 2012. “CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks.” Methods in Molecular Biology (Clifton, N.J.) 786 (3 C): 181–200. https://doi.org/10.1007/978-1-61779-292-2_11.

Tan, Ge, and Boris Lenhard. 2016. “TFBSTools: An R/bioconductor package for transcription factor binding site analysis.” Bioinformatics 32 (10): 1555–6. https://doi.org/10.1093/bioinformatics/btw024.

Taylor Raborn, R., VP. Brendel, and K. Sridharan. n.d. “TSRchitect: Promoter identification from large-scale TSS profiling data.” https://doi.org/10.18129/B9.bioc.TSRchitect.

The Gene Ontology Consortium. 2019. “The Gene Ontology Resource: 20 years and still GOing strong.” Nucleic Acids Research 47 (D1): D330–D338. https://doi.org/10.1093/nar/gky1055.

Thodberg, Malte, Axel Thieffry, Jette Bornholdt, Mette Boyd, Christian Holmberg, Ajuna Azad, Christopher T Workman, et al. 2018. “Comprehensive profiling of the fission yeast transcription start site activity during stress and media response.” Nucleic Acids Research, December, 1–21. https://doi.org/10.1093/nar/gky1227.

Thodberg, Malte, Axel Thieffry, Kristoffer Vitting-Seerup, Robin Andersson, and Albin Sandelin. 2019. “CAGEfightR: analysis of 5’-end data using R/Bioconductor.” BMC Bioinformatics 20 (1): 487. https://doi.org/10.1186/s12859-019-3029-5.

Wagih, Omar. 2017. “Ggseqlogo: A versatile R package for drawing sequence logos.” Bioinformatics 33 (22): 3645–7. https://doi.org/10.1093/bioinformatics/btx469.