G4SNVHunter 0.99.4
G-quadruplexes (G4s) are non-canonical secondary structures that can form in guanine-rich DNA or RNA regions (Varshney et al. 2020). These structures are significantly enriched in functional regions like gene promoters (Huppert and Balasubramanian 2007) and telomeres (Phan and Mergny 2002), and their regulatory roles are closely linked to their structures rather than their sequences (Zhang et al. 2024).
Single nucleotide variants (SNVs) are a common type of genetic variation. If such a variant occurs within a G4 region, it may affect the formation of that G4 structure.
G4SNVHunter
is designed to evaluate the potential impact of SNVs on G4
formation, but it can also be used to assess single nucleotide polymorphisms
(SNPs). This package leverages the highly effective G4 structure prediction
algorithm, G4Hunter (Bedrat, Lacroix, and Mergny 2016), which was conceptualized by Mergny et al.
(Bedrat, Lacroix, and Mergny 2016). One of G4Hunter’s notable advantages is that it will evaluate
the propensity of G4 structures within a specified window, effectively taking
into account the influence of flanking sequences on G4 formation, not just the
G4 sequence itself.
G4SNVHunter
only requires users to provide genomic sequence data
(DNAStringSet
object) and substitution information of variants
(GRanges
object) to quickly determine which G4 structures may be affected by
SNVs (or SNPs). We simplify the analysis process by integrating complex
functionality into a simple function that can be used even by those
inexperienced in G4 research or basic programming skills. Users can
subsequently design ‘wet’ experiments based on the results of G4SNVHunter
to further explore the mechanisms by which variants affect biological
regulatory functions through their impact on G4 structures.
As the first step, you need to install G4SNVHunter
, which can be done with a
simple command,
please ensure that your network connection is stable.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Currently, G4SNVHunter depends on the devel version of Bioconductor.
# Note: Bioconductor devel requires the latest R version (4.5 or higher).
# Or, you can install G4SNVHunter from GitHub instead.
BiocManager::install(version='devel')
# Install the G4SNVHunter package and its dependencies for this tutorial.
BiocManager::install("G4SNVHunter", dependencies = TRUE)
Then load G4SNVHunter
to access its functions.
library(G4SNVHunter)
During this tutorial, we might need to use a few additional packages.
Since we specified dependencies = TRUE
when installing G4SNVHunter
package,
these additional packages have already been installed.
We can load them directly.
library(BSgenome.Hsapiens.UCSC.hg19)
library(GenomicRanges)
library(DT)
library(rtracklayer)
G4SNVHunter
requires you to provide two types of data:
the genomic sequences (a DNAStringSet
object)
the set of SNVs or SNPs (a GRanges
object)
While the input formats for these data are quite flexible, they must ultimately
be converted into the appropriate formats: DNAStringSet
for sequences and
GRanges
for SNVs.
Detailed descriptions on how to prepare and format these inputs are provided below.
The genomic sequences refer to the sequence of the chromosome or fragment where your SNVs (or SNPs) are located. They can be complete chromosome sequences or large fragments extracted from chromosomes. We will predict G4 structures from the sequences you provide and analyze whether these SNVs can affect their formation.
Please note that the coordinates of your SNVs must be relative to the genomic
sequence you provide, and they should be in 1-based
coordinates.
The genomic sequences need to be processed into a DNAStringSet
object. We
provide a built-in function, loadSequence
, to facilitate this processing. Of
course, you can also load your customized sequence file and convert it into
a DNAStringSet
object without using loadSequence
function.
loadSequence
mainly accepts three types of input:
A two-column data.frame
, with the first column containing the sequence
identifiers and the second column containing their corresponding sequences.
This should be specified using the genome_seq
parameter.
The path to a stored FASTA file. The FASTA file must have a .fa
, .fna
,
or .fasta
extension. This should be specified using the seq_path
parameter.
A text file (.txt
) that stores the sequence identifiers and their
corresponding sequences. The first column should contain the sequence
identifiers, and the second column should contain the sequences. This should
also be specified using the seq_path
parameter.
Note: Please do not include column names in the file!
Here are some examples for you to load sequences into a DNAStringSet
object
using loadSequence
function:
Load from a data.frame
object
seq_df <- data.frame(chr = c("seq1", "seq2"),
sequence = c(paste0(rep("G", 100), collapse = ""),
paste0(rep("A", 100), collapse = "")))
seq <- loadSequence(genome_seq = seq_df)
Load from a fasta
file
# File path to the sequences in fasta format
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
seq <- loadSequence(seq_path = fa_path)
Load from a txt
file
# File path to the sequences in txt format
txt_path <- system.file("extdata", "seq.txt", package = "G4SNVHunter")
seq <- loadSequence(seq_path = txt_path)
We can also obtain genome sequences for some common species from Bioconductor Annotation Packages. While this is convenient, it requires you to install some related packages in advance. For example:
# Load sequence for chromosome 21 (hg19)
hg19 <- BSgenome.Hsapiens.UCSC.hg19
chr21_seq <- DNAStringSet(hg19$chr21)
# Chromosome names are needed for analysis
names(chr21_seq) <- "chr21"
The SNV (or SNP) data needs to be processed into a GRanges
object. Since the
form of SNV data is too flexible, we do not provide a function to load SNVs
here. However, you can easily load and convert SNV data into a GRanges
object
yourself.
For example,
# Path to SNPs
snp_path <- system.file("extdata", "snp.txt", package = "G4SNVHunter")
# Load SNPs into memory
snp <- read.table(snp_path, sep = "\t", header = FALSE)
# Convert snp to GRanges
snp <- GRanges(seqnames = snp$V1,
ranges = IRanges(start = snp$V2, width = 1),
rsid = snp$V3,
ref = snp$V4,
alt = snp$V5)
print(snp)
## GRanges object with 15 ranges and 3 metadata columns:
## seqnames ranges strand | rsid ref alt
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## [1] chr21 15397335 * | rs1244782663 G A
## [2] chr21 39931215 * | rs910271214 G A
## [3] chr21 22915361 * | rs1210166884 A G
## [4] chr21 42644592 * | rs554035836 A G
## [5] chr21 10743898 * | rs955290928 A G
## ... ... ... ... . ... ... ...
## [11] chr21 30836340 * | rs1212424980 T C
## [12] chr21 36426864 * | rs1478442431 G T
## [13] chr21 27584238 * | rs1191559536 T C
## [14] chr21 36106373 * | rs561521189 C T
## [15] chr21 21840835 * | rs1301205088 T C
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We provide a built-in checkSNV
function for you to verify whether your SNV
(or SNP) data meets the requirements of our program.
However, it is not necessary to run this function explicitly, as our program will automatically perform the checks at the appropriate stages. Nevertheless, performing a proactive check is always a good practice.
gr <- GRanges(
seqnames = Rle("seq1"),
ranges = IRanges(c(100, 200, 300), width=1),
ref = c("A", "C", "G"),
alt = c("T", "T", "A")
)
# Check width ('w'), ref ('r'), and alt ('a')
# Returns TRUE if it passed the validation
checkSNV(gr, mode = "wra", ref_col = "ref", alt_col = "alt")
## [1] TRUE
You can use the G4HunterDetect
function to directly predict G4s from a
DNAStringSet
object containing sequences based on G4Hunter algorithm.
For example,
# Sequence file in fasta file format
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
# Load sequences
seq <- loadSequence(seq_path = fa_path)
# Predict G4s
G4_detected <- G4HunterDetect(seq)
## Start processing...
## Merging predicted G4s...
## Done!
Then, we can examine the prediction results, which are stored as a GRanges
object.
You can use functions like print
to directly view the results. However, in
this instance, we’ll leverage the datatable
function to view the predicted G4s, as this method provides a more
user-friendly display.
Let’s take a look at the G4s,
datatable(as.data.frame(G4_detected), options = list(scrollX = TRUE))
This function will return a GRanges
object containing all potential G4s
predicted by the G4Hunter algorithms. It mainly includes:
The sequence identifier where the G4 is located (seqnames
),
The position of the G4 relative to the sequence you provided
(ranges
, 1-based
coordinate),
The strand on which the G4 is located (strand
), with +
indicating the G4
is on the positive strand, and -
indicating it is on the negative strand,
The G4Hunter score for the G4 (score
),
The maximum G4Hunter score of the windows covering that G4 during the
execution of the G4Hunter algorithm
(max_score
, used to determine if the G4 surpasses the threshold),
The predicted G4 sequence on the positive strand (sequence
, with C-rich
means G4s on the reverse strand).
Please note that score
reflects the precise score of the G4 sequence, while
max_score
represents the highest score obtained by sliding a window across
that G4.
See the illustration below.
Some parameters for prediction can be customized. For example,
# Predict G4 by customizing other parameters
G4_detected <- G4HunterDetect(seq, threshold = 1.5, window_size = 20)
## Start processing...
## Merging predicted G4s...
## Done!
threshold
: G4Hunter will search for G4s in windows above this threshold
(absolute value). Default is 1.5
. Unless there are special needs, we do not
recommend setting the threshold below 1.2
.
window_size
: The window size (bp) for G4Hunter prediction.
Default is 25
.
Another commonly used window size is 20
. However, 25
is generally
preferred unless otherwise required.
include_sequences
: Whether to include the predicted G4 sequences in the
output. Default is TRUE
.
strands
: Indicate which strand requires prediction, with b
for both
strands and p
for positive strand only. Please note that if your genome is
single-stranded, this parameter should be set to p
to prevent the program
from predicting G4s on a non-existent negative strand.
We generally do not recommend modifying certain parameters, such as
window_size
and threshold
, as their default settings are already optimal.
Since we store the prediction results as a GRanges
object, exporting is very
straightforward.
For example, you can export it into a CSV file,
# export as csv format
write.csv(as.data.frame(G4_detected), "/path/to/your/file.csv")
or export it into other formats as well.
# export as bed format
export(G4_detected, "/path/to/your/file.bed", format = "bed")
# export as bigwig format
export(G4_detected, "/path/to/your/file.bw", format = "bigWig")
You can use plotG4Info
function for visualizing the statistical information
of G4s predicted by the G4HunterDetect
function.
plotG4Info(G4_detected)
The left panels (A
, C
, E
) show the overall distribution of the
max_score
(absolute value), score
(absolute value), and length of G4s,
respectively. The right panels (B
, D
, F
) illustrate the distribution of
max_score
, score
, and length for G4 structures on both the positive and
negative strands. The scores for G4s on the positive strand are positive, while
those on the negative strand are negative, resulting in a bimodal distribution
of max_score
and score
. Please note that the positive or negative score
reflects only the type of strand the G4 is on, whereas the capability of the G4
structure to form is determined by the absolute value of the score.
We provide two modes in SNVImpactG4
function for you to assess the potential
impact of SNVs on G4s:
Variant-centric assessment (s
mode)
Sample-centric assessment (m
mode)
In short, variant-centric assessment considers the impact of each SNV on the G4 sequence individually, while sample-centric assessment combines the effects of multiple SNVs on a particular G4 for each sample.
See the illustration below.
We have prepared example data with variants from chromosome 21 of the human genome (hg19), and you can easily load them,
data(snp_gr)
data(snv_gr)
Additionally, you can quickly and conveniently predict G4 sequences on
chromosome 21 using the G4HunterDetect
function.
# Predict the G4s in human chr 21 (hg19)
chr21_G4 <- G4HunterDetect(chr21_seq)
## Start processing...
## Now process: chr21.
## Merging predicted G4s...
## Done!
The default mode of SNVImpactG4
is variant-centric (mode = 's'
), which
requires only the G4 data in GRanges
format (as returned by the
G4HunterDetect
function), the SNV data in GRanges
format, and the column
name for the alternate bases.
Then, the assessment can be done easily by,
snp_eff <- SNVImpactG4(chr21_G4,
snp_gr,
alt_col = "alt")
Let’s view the first three records,
datatable(as.data.frame(snp_eff[1:3]), options = list(scrollX = TRUE))
SNVImpactG4
function will return a GRanges
object that includes detailed
information about the SNVs (core columns and SNV.info.*
columns), their
associated G4 regions (G4.info.*
columns), and the changes in G4 sequences
(mut.G4.seq
and mut.G4.anno.seq
columns) and G4Hunter scores (mut.score
and score.diff
column).
Please note that SNVs located outside of G4 regions will be automatically ignored.
You can set the mode
parameter in SNVImpactG4
function to 'm'
to enable
sample-centric mode. This mode is particularly useful for specific scenarios,
such as analyzing SNVs derived from cancer patients.
When using this mode, you must also specify the column names for sample IDs
(sampleid_col
) and SNV IDs (snvid_col
).
# Column names of the Sample ID and SNV ID must be specified
snv_eff <- SNVImpactG4(chr21_G4,
snv_gr,
alt_col = "alt",
mode = "m",
sampleid_col = "sampleid",
snvid_col = "snv_id")
Let’s view the first three records,
datatable(as.data.frame(snv_eff[1:3]), options = list(scrollX = TRUE))
It will return a GRanges
object containing detailed information about the G4s
(core columns and G4.info.*
columns). The object will also include the IDs of
the SNVs (snv.ids
column) located in the G4 regions, the sample IDs
(sample.ids
column), the mutated G4 sequences (mut.G4.seq
and
mut.G4.anno.seq
columns), and the changes in G4Hunter scores (mut.score
and
score.diff
columns).
Please note that in sample-centric mode, when multiple SNVs from the same
sample occur within a single G4 region, we will consider the combined impact of
these SNVs. For example, the following G4 overlaps with two SNVs
(id_3608
and id_49857
) from sample GQ3Y
, so the final G4Hunter score is
based on the combined effect of these two SNVs.
datatable(as.data.frame(snv_eff[528]), options = list(scrollX = TRUE))
Given that some SNVs may have minimal effects on G4 formation, we need to filter out those SNVs that are worth investigating further for experimental validation or additional analysis. In other words, identifying those variants that have the potential to disrupt the G4 structures.
We provide a convenient function, filterSNVImpact
, to help with this
filtering process.
There are three threshold parameters for you to adjust: raw_score_threshold
,
mut_score_threshold
, and score_diff_threshold
.
If raw_score_threshold
(a positive number, but no more than 4) is specified,
filterSNVImpact
will filter out records where the absolute value of the
original G4Hunter score is below this threshold.
If mut_score_threshold
(a positive number, but no more than 4) is specified,
filterSNVImpact
will retain records where the G4Hunter score of the mutated
G4 sequences does not exceed this threshold.
If score_diff_threshold
(a negative number, but no less than -4) is
specified, filterSNVImpact
will retain records where the decrease in the
G4Hunter score after mutation exceeds this threshold.
For example, if score_diff_threshold = -0.2
, those records will be retained:
\(\left| \text{G4HunterScore}_{\text{mut_seq}} \right| - \left| \text{G4HunterScore}_{\text{raw_seq}} \right| \leq -0.2\)
Our recommendation is that raw_score_threshold
should be greater than 1.5,
and the mut_score_threshold
should be less than 1.2. This is an empirically
based guideline, and you are certainly free to adjust them to be more stringent
or to use these parameters for flexible customization as needed.
Please note that you must specify at least one threshold parameter, but you do not need to specify all of them.
For example, using the results returned in mode m
filtered_snv_eff <- filterSNVImpact(snv_eff,
mut_score_threshold = 1.2,
score_diff_threshold = -0.35)
We can examine the filtered records to identify SNVs that have a substantial impact on G4 formation.
datatable(as.data.frame(filtered_snv_eff), options = list(scrollX = TRUE))
The final results can be saved to a file similarly to how G4 prediction results are saved, as described in Section 5.2.
You can use the plotSNVImpact
function we provide to visualize changes in
G4Hunter scores. This function is suitable for both variant-centric and
sample-centric output data, as well as their filtered outputs.
We can observe that mutated G4 structures generally exhibit a trend of decreased formation potential.
plotSNVImpact(snv_eff)
Let’s examine
how the G4Hunter scores have changed in the filtered_snv_eff
object.
We observe that the scores for the filtered G4s have decreased significantly.
plotSNVImpact(filtered_snv_eff)
Finally, we can use plotImpactSeq
to visualize the G4 sequences affected by
SNVs.
For example, we can plot the sequence logos for the top five G4s with the most significant changes in their G4Hunter scores.
top5_snv_eff <- filtered_snv_eff[order(filtered_snv_eff$score.diff)[1:5]]
plotImpactSeq(top5_snv_eff, ncol = 2)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
## Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Let’s explore a coherent example to predict the potential impact of SNVs on G4 formation.
Suppose you already have the sequences saved as a DNAStringset
object
(chr21_seq
) and the SNVs saved as a GRanges
object (snv_gr
), then execute
the following code,
# First step, predict G4s
chr21_G4 <- G4HunterDetect(chr21_seq)
# Second step, evaluate the impact of SNVs on G4s in 's' mode
snv_eff <- SNVImpactG4(chr21_G4, snv_gr, alt_col = "alt")
# Filter the results under default parameters
filtered_snv_eff <- filterSNVImpact(snv_eff, mut_score_threshold = 1.2)
# export as csv format
write.csv(as.data.frame(filtered_snv_eff), "/path/to/your/file.csv")
The author would like to thank Dr. Wenyong Zhu and Dr. Xiao Sun from Southeast University for their contributions: Dr. Zhu for providing the illustrations and testing the package, and Dr. Sun for reviewing this Vignettes.
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DT_0.33 BSgenome.Hsapiens.UCSC.hg19_1.4.3
## [3] BSgenome_1.75.0 rtracklayer_1.67.0
## [5] BiocIO_1.17.1 Biostrings_2.75.2
## [7] XVector_0.47.0 GenomicRanges_1.59.1
## [9] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [11] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [13] generics_0.1.3 G4SNVHunter_0.99.4
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2
## [3] farver_2.1.2 dplyr_1.1.4
## [5] viridis_0.6.5 bitops_1.0-9
## [7] fastmap_1.2.0 RCurl_1.98-1.16
## [9] GenomicAlignments_1.43.0 XML_3.99-0.17
## [11] digest_0.6.37 lifecycle_1.0.4
## [13] magrittr_2.0.3 compiler_4.5.0
## [15] rlang_1.1.4 sass_0.4.9
## [17] progress_1.2.3 tools_4.5.0
## [19] utf8_1.2.4 yaml_2.3.10
## [21] data.table_1.16.4 knitr_1.49
## [23] labeling_0.4.3 htmlwidgets_1.6.4
## [25] prettyunits_1.2.0 S4Arrays_1.7.1
## [27] curl_6.0.1 DelayedArray_0.33.3
## [29] abind_1.4-8 BiocParallel_1.41.0
## [31] withr_3.0.2 grid_4.5.0
## [33] ggpointdensity_0.1.0 fansi_1.0.6
## [35] colorspace_2.1-1 ggplot2_3.5.1
## [37] scales_1.3.0 tinytex_0.54
## [39] SummarizedExperiment_1.37.0 cli_3.6.3
## [41] rmarkdown_2.29 crayon_1.5.3
## [43] httr_1.4.7 rjson_0.2.23
## [45] RcppRoll_0.3.1 cachem_1.1.0
## [47] ggseqlogo_0.2 zlibbioc_1.53.0
## [49] parallel_4.5.0 BiocManager_1.30.25
## [51] restfulr_0.0.15 matrixStats_1.4.1
## [53] vctrs_0.6.5 Matrix_1.7-1
## [55] jsonlite_1.8.9 bookdown_0.41
## [57] hms_1.1.3 magick_2.8.5
## [59] crosstalk_1.2.1 jquerylib_0.1.4
## [61] glue_1.8.0 codetools_0.2-20
## [63] cowplot_1.1.3 gtable_0.3.6
## [65] UCSC.utils_1.3.0 munsell_0.5.1
## [67] tibble_3.2.1 pillar_1.9.0
## [69] htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [71] R6_2.5.1 evaluate_1.0.1
## [73] Biobase_2.67.0 lattice_0.22-6
## [75] Rsamtools_2.23.1 bslib_0.8.0
## [77] Rcpp_1.0.13-1 gridExtra_2.3
## [79] SparseArray_1.7.2 xfun_0.49
## [81] MatrixGenerics_1.19.0 pkgconfig_2.0.3