immApex 1.3.7
Single-cell sequencing is mainstay technology in the field of immunology and oncology that allows researchers to couple RNA quantification and other modalities, like immune cell receptor profiling at the level of an individual cell. A number of workflows and software packages have been created to process and analyze single-cell transcriptomic data. These packages allow users to take the vast dimensionality of the data generated in single-cell-based experiments and distill the data into novel insights. Some of the packages within the R environment that offer single-cell immune profiling support include scRepertoire, immunarch, and immcantation. None of these packages offer support for machine or deep learning applications for immune repertoire profiling.
immApex is meant to serve as an API for supervised and unsupervised learning tasks based on immune receptor sequencing. These functions extract or generate amino acid or nucleotide sequences and prepare them for downstream machine learning. immApex is the underlying structure for the BCR models in Ibex and TCR models in Trex. It should be noted that the tools here are created for immune receptor sequences; they will work more generally for nucleotide or amino acid sequences. The package itself supports AIRR, Adaptive, and 10x formats and interacts with the scRepertoire R package.
More information is available at the immApex GitHub Repo.
suppressMessages(library(immApex))
suppressMessages(library(ggplot2))
suppressMessages(library(viridis))
suppressMessages(library(dplyr))
suppressMessages(library(igraph))
suppressMessages(library(tidygraph))
suppressMessages(library(ggraph))
Depending on the sequencing technology and the version, we might want to expand the length of our sequence embedding approach. The first step in the process is pulling the reference sequences from the ImMunoGeneTics (IMGT) system using getIMGT()
. More information for IMGT can be found at imgt.org. Data from IMGT is under a CC BY-NC-ND 4.0 license. Please be aware that attribution is required for usage and should not be used to create commercial or derivative work.
Parameters for getIMGT()
Here, we will use the getIMGT()
function to get the amino acid sequences for the TRBV region to get all the sequences by V gene allele.
# Function to check IMGT website availability
is_imgt_available <- function() {
tryCatch({
r <- httr::HEAD("https://www.imgt.org", timeout(5))
httr::status_code(r) == 200
}, error = function(e) {
FALSE
})
}
# Run getIMGT only if the website is available
if (is_imgt_available()) {
TRBV_aa <- getIMGT(species = "human",
chain = "TRB",
frame = "inframe",
region = "v",
sequence.type = "aa")
# Display first sequence as an example
TRBV_aa[[1]][1]
} else {
# Display a message if IMGT is not available
"IMGT website is not accessible at the moment."
}
## [1] "IMGT website is not accessible at the moment."
Immune receptor nomenclature can be highly variable across sequencing platforms. When preparing data for models, we can use formatGenes()
to universalize the gene formats into IMGT nomenclature.
Parameters for formatGenes()
Here, we will use the built-in example from Adaptive Biotechnologies and reformat and simplify the v region. formatGenes()
will add 2 columns to the end of the data frame per region selected - 1) v_IMGT will be the formatted gene calls and 2) v_IMGT.check is a binary for if the formatted region appears in the IMGT database. In the example below, “TRBV2-1” is not recognized as a designation within IMGT.
data("immapex_example.data")
Adaptive_example <- formatGenes(immapex_example.data[["Adaptive"]],
region = "v",
technology = "Adaptive",
simplify.format = TRUE)
head(Adaptive_example[,c("aminoAcid","vGeneName", "v_IMGT", "v_IMGT.check")])
## aminoAcid vGeneName v_IMGT v_IMGT.check
## 4490 CASSQDGPSGIETQYF TCRBV04-02 TRBV4-2 1
## 18266 CASSEGSNQPQHF TCRBV02-01 TRBV2-1 0
## 22061 CSASAGDMVTEAFF <NA> <NA> 0
## 22174 CASSQDPGETDTQYF <NA> <NA> 0
## 19117 CATSAWTGELFF <NA> <NA> 0
## 2659 CATSVPGQETQYF <NA> <NA> 0
We can now use inferCDR()
to add additional sequence elements to our example data using the outputs of formatGenes()
and getIMGT()
. Here, we will use the function to isolate the complementarity-determining regions (CDR) 1 and 2. If the gene nomenclature does not match the IMGT the result will be NA for the given sequences. Likewise, if the IMGT nomenclature has been simplified, the first allelic match will be used for sequence extraction.
Parameters for inferCDR
getIMGT()
if (is_imgt_available()) {
Adaptive_example <- inferCDR(Adaptive_example,
chain = "TRB",
reference = TRBV_aa,
technology = "Adaptive",
sequence.type = "aa",
sequences = c("CDR1", "CDR2"))
Adaptive_example[200:210,c("CDR1_IMGT", "CDR2_IMGT")]
} else {
# Display a message if IMGT is not available
"IMGT website is not accessible at the moment."
}
## [1] "IMGT website is not accessible at the moment."
Generating synthetic sequences is a quick way to start testing the model code. generateSequences()
can also generate realistic noise for generative adversarial networks.
Parameters for generateSequences()
sequences <- generateSequences(prefix.motif = "CAS",
suffix.motif = "YF",
number.of.sequences = 1000,
min.length = 8,
max.length = 16)
sequences <- unique(sequences)
head(sequences)
## [1] "CASHMCYF" "CASKDINEQTYF" "CASVIVYF" "CASKVNWNPKPNVQYF"
## [5] "CASIQLAINYF" "CASMCISYF"
If we want to generate nucleotide sequences instead of amino acids, we must to change the sequence.dictionary.
nucleotide.sequences <- generateSequences(number.of.sequences = 1000,
min.length = 8,
max.length = 16,
sequence.dictionary = c("A", "C", "T", "G"))
head(nucleotide.sequences)
## [1] "TCGTGAAC" "ATGCGAAC" "CCGTGCCACGGGTCAA" "AAGAACGGATAGATT"
## [5] "CTAGCCTGGCG" "ACGCCGACAT"
A common approach is to mutate sequences randomly or at specific intervals. This can be particularly helpful if we have fewer sequences or want to test a model for accuracy given new, altered sequences. mutateSequences()
allows us to tune the type of mutation, where along the sequences to introduce the mutation and the overall number of mutations.
Parameters for mutateSequences()
mutated.sequences <- mutateSequences(sequences,
number.of.sequences = 1,
position.start = 3,
position.end = 8)
head(sequences)
## [1] "CASHMCYF" "CASKDINEQTYF" "CASVIVYF" "CASKVNWNPKPNVQYF"
## [5] "CASIQLAINYF" "CASMCISYF"
head(mutated.sequences)
## CASHMCYF CASKDINEQTYF CASVIVYF CASKVNWNPKPNVQYF
## "CASNMCYF" "CASKDFNEQTYF" "CASVIVYT" "CASSVNWNPKPNVQYF"
## CASIQLAINYF CASMCISYF
## "CASIQMAINYF" "CASACISYF"
Beyond encoding single sequences, immApex provides functions to calculate summary statistics and features across a collection of sequences, such as a clonal repertoire.
This function calculates the relative frequency of each residue at each position across a set of sequences.
Parameters for calculateFrequency()
freq.matrix <- calculateFrequency(sequences,
max.length = 20)
head(freq.matrix[, 1:10])
## Pos.1 Pos.2 Pos.3 Pos.4 Pos.5 Pos.6 Pos.7 Pos.8
## A 0 1 0 0.05210421 0.04308617 0.05210421 0.03807615 0.03106212
## R 0 0 0 0.04208417 0.06412826 0.06012024 0.03707415 0.03306613
## N 0 0 0 0.03907816 0.04208417 0.05911824 0.04809619 0.03306613
## D 0 0 0 0.04809619 0.05210421 0.05611222 0.04208417 0.03306613
## C 1 0 0 0.05010020 0.05410822 0.04609218 0.04008016 0.03807615
## Q 0 0 0 0.04108216 0.05811623 0.05811623 0.04308617 0.02905812
## Pos.9 Pos.10
## A 0.03306613 0.02605210
## R 0.03006012 0.02805611
## N 0.03206413 0.02404810
## D 0.03807615 0.02204409
## C 0.02905812 0.02805611
## Q 0.02304609 0.01903808
This function measures the diversity (or entropy) at each position in a set of aligned sequences. It can use several common diversity metrics.
Parameters for calculateEntropy()
shannon.entropy <- calculateEntropy(sequences,
method = "shannon")
head(shannon.entropy)
## Pos1 Pos2 Pos3 Pos4 Pos5 Pos6
## 0.000000 0.000000 0.000000 2.986591 2.984538 2.985312
This function computes position-wise summary statistics for amino acid properties. For each position, it calculates a metric (like the mean) of a specific physicochemical property for all residues found at that position.
Parameters for calculateProperty()
# Calculate the mean of Atchley factors at each position
atchley.profile <- calculateProperty(sequences,
property.set = "atchleyFactors",
summary.fun = "mean")
head(atchley.profile[, 1:6])
## position
## scale Pos.1 Pos.2 Pos.3 Pos.4 Pos.5 Pos.6
## AF1 -1.343 -0.591 -0.228 -0.04847395 0.06555010 0.03563226
## AF2 0.465 -1.302 1.399 -0.04388978 -0.05410321 -0.04387976
## AF3 -0.862 -0.733 -4.760 -0.02414930 0.06490982 -0.09043687
## AF4 -1.020 1.570 0.670 0.02724349 -0.02970441 -0.02113026
## AF5 -0.255 -0.146 -2.647 -0.02846994 0.06192184 -0.03162625
This function quantifies the usage of gene loci (e.g., V and J genes) within a repertoire.
Parameters for calculateGeneUsage()
* loci: A character vector of one or two column names corresponding to the gene loci.
* summary: The output format: “proportion” (default), “count”, or “percent”.
# Using the built-in AIRR data
data_airr <- immapex_example.data[["AIRR"]]
# Calculate paired V-J gene usage as percentages
vj_usage <- calculateGeneUsage(data_airr,
loci = c("v_call", "j_call"),
summary = "percent")
vj_usage[1:5, 1:5]
## y
## x TRAJ10*01 TRAJ11*01 TRAJ12*01 TRAJ13*01 TRAJ13*02
## TRAV1-1*01 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## TRAV1-1*02 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## TRAV1-2*01 0.00000000 0.01666667 0.00000000 0.00000000 0.00000000
## TRAV1-2*03 0.01666667 0.00000000 0.03333333 0.00000000 0.01666667
## TRAV10*01 0.08333333 0.00000000 0.00000000 0.00000000 0.00000000
This function rapidly finds and counts contiguous (or gapped) amino acid or nucleotide motifs of specified lengths across a set of sequences.
Parameters for calculateMotif()
motif.counts <- calculateMotif(sequences,
motif.lengths = 3,
min.depth = 5)
head(motif.counts)
## motif frequency
## 1 SAT 6
## 2 SQI 5
## 3 DIY 5
## 4 DYY 6
## 5 SER 6
## 6 MYF 44
This function calculates a positional probability matrix (PPM) for a group of sequences. This can be used to represent the sequence profile of an antigen-specific repertoire or to generate a sequence logo.
ppm.matrix <- probabilityMatrix(sequences)
head(ppm.matrix)
## Pos.1 Pos.2 Pos.3 Pos.4 Pos.5 Pos.6 Pos.7 Pos.8
## A 0 1 0 0.05210421 0.04308617 0.05210421 0.03807615 0.03106212
## R 0 0 0 0.04208417 0.06412826 0.06012024 0.03707415 0.03306613
## N 0 0 0 0.03907816 0.04208417 0.05911824 0.04809619 0.03306613
## D 0 0 0 0.04809619 0.05210421 0.05611222 0.04208417 0.03306613
## C 1 0 0 0.05010020 0.05410822 0.04609218 0.04008016 0.03807615
## Q 0 0 0 0.04108216 0.05811623 0.05811623 0.04308617 0.02905812
## Pos.9 Pos.10 Pos.11 Pos.12 Pos.13 Pos.14 Pos.15
## A 0.03780069 0.03439153 0.0312500 0.02592593 0.02347418 0.01242236 0
## R 0.03436426 0.03703704 0.0312500 0.03333333 0.01643192 0.01552795 0
## N 0.03665521 0.03174603 0.0359375 0.03148148 0.03521127 0.00931677 0
## D 0.04352806 0.02910053 0.0375000 0.04259259 0.01877934 0.00621118 0
## C 0.03321879 0.03703704 0.0328125 0.02407407 0.03755869 0.02795031 0
## Q 0.02634593 0.02513228 0.0203125 0.03888889 0.02816901 0.01242236 0
## Pos.16
## A 0
## R 0
## N 0
## D 0
## C 0
## Q 0
The PPM can also be converted to a positional weight matrix (PWM) using a log-likelihood ratio.
set.seed(42)
# Generate a sample background frequency
back.freq <- sample(1:1000, 20)
names(back.freq) <- amino.acids
back.freq <- back.freq/sum(back.freq)
pwm.matrix <- probabilityMatrix(sequences,
max.length = 20,
convert.PWM = TRUE,
background.frequencies = back.freq)
head(pwm.matrix)
## Pos.1 Pos.2 Pos.3 Pos.4 Pos.5 Pos.6 Pos.7
## A -6.153093 3.811248 -6.153093 -0.42517268 -0.6936615 -0.4251727 -0.8676909
## R -6.982686 -6.982686 -6.982686 -1.55642111 -0.9603181 -1.0519485 -1.7347584
## N -5.347666 -5.347666 -5.347666 -0.02573756 0.0785991 0.5592249 0.2670442
## D -4.278624 -4.278624 -4.278624 1.33608583 1.4492964 1.5542660 1.1476407
## C 6.733651 -3.230690 -3.230690 2.44173581 2.5506702 2.3238993 2.1268625
## Q -4.854126 -4.854126 -4.854126 0.53819124 1.0285169 1.0285169 0.6053054
## Pos.8 Pos.9 Pos.10 Pos.11 Pos.12 Pos.13
## A -1.15309313 -0.8766248 -1.0065966 -1.1355761 -1.3839637 -1.50303956
## R -1.89522302 -1.8394841 -1.7330959 -1.9651688 -1.8725195 -2.79206392
## N -0.26020282 -0.1142661 -0.3122005 -0.1375035 -0.3155018 -0.15704371
## D 0.80883883 1.1957837 0.6365469 0.9904318 1.1685773 0.08192294
## C 2.05471268 1.8652065 2.0189005 1.8539417 1.4389042 2.04739525
## Q 0.05276441 -0.0801582 -0.1405891 -0.4215716 0.4675443 0.03693548
## Pos.14 Pos.15 Pos.16 Pos.17 Pos.18 Pos.19 Pos.20
## A -2.2574957 -4.080435 -3.1615713 -0.4834994 -0.4834994 -0.4834994 -0.4834994
## R -2.8240540 -4.910027 -3.9911640 -1.3130921 -1.3130921 -1.3130921 -1.3130921
## N -1.7739963 -3.275007 -2.3561438 0.3219281 0.3219281 0.3219281 0.3219281
## D -1.1199922 -2.205965 -1.2871022 1.3909697 1.3909697 1.3909697 1.3909697
## C 1.6649079 -1.158031 -0.2391677 2.4389042 2.4389042 2.4389042 2.4389042
## Q -0.9585288 -2.781468 -1.8626043 0.8154676 0.8154676 0.8154676 0.8154676
These functions help analyze relationships between residues or whole sequences.
adjacencyMatrix()
summarizes transitions between adjacent residues, creating a matrix of co-occurrence counts.
adj.matrix <- adjacencyMatrix(sequences,
normalize = FALSE)
adj.matrix[1:10, 1:10]
## A R N D C Q E G H I
## A 20 28 38 26 1023 28 24 28 32 24
## R 28 22 41 33 27 34 33 27 30 34
## N 38 41 16 14 21 31 23 31 33 29
## D 26 33 14 38 35 18 30 24 32 35
## C 1023 27 21 35 24 28 30 30 27 34
## Q 28 34 31 18 28 20 28 27 20 31
## E 24 33 23 30 30 28 28 28 23 32
## G 28 27 31 24 30 27 28 36 25 35
## H 32 30 33 32 27 20 23 25 28 28
## I 24 34 29 35 34 31 32 35 28 42
buildNetwork()
constructs a similarity network where nodes are sequences and edges connect sequences with an edit distance below a given threshold.
# Building an edge list
g1 <- buildNetwork(data.frame(sequences = sequences),
seq_col = "sequences",
threshold = 2)
# Forming network
graph <- graph_from_edgelist(as.matrix(g1[,1:2]))
E(graph)$weight <- g1[,3]
# Remove isolated nodes for clearer visualization
graph <- delete_vertices(graph, which(degree(graph) == 0))
# Convert to tidygraph for use with ggraph
g_tidy <- as_tbl_graph(graph)
# Plot the network
ggraph(g_tidy, layout = "fr") +
geom_edge_link(aes(width = weight), color = "black", alpha = 0.5) +
geom_node_point(aes(size = degree(g_tidy, mode = "all")),
fill = "steelblue", color= "black", shape = 21) +
theme_void() +
scale_edge_width(range = c(0.1, 0.5)) +
labs(size = "Degree") +
theme(legend.position = "right")
The primary tool for converting biological sequences into a numerical format suitable for machine learning is the sequenceEncoder()
function. It is a versatile wrapper that can generate three different types of numerical representations, controlled by the mode argument. Understanding this function is key to using the package effectively.
Key Arguments:
"onehot"
, "property"
, or "geometric"
.One-hot encoding is a common method that transforms each amino acid at each position into a binary vector. This is useful when the specific identity and position of each residue are important. This functionality is accessed by setting mode = "onehot"
in sequenceEncoder()
or by using the onehotEncoder()
alias.
Mode-Specific Arguments:
# Generate one-hot encoding using the main function
enc_onehot <- sequenceEncoder(sequences, mode = "onehot", verbose = FALSE)
# You can achieve the same result with the alias
enc_onehot_alias <- onehotEncoder(sequences, verbose = FALSE)
# View the first few columns of the flattened matrix output
head(enc_onehot$flattened[, 1:10])
## A_1 R_1 N_1 D_1 C_1 Q_1 E_1 G_1 H_1 I_1
## [1,] 0 0 0 0 1 0 0 0 0 0
## [2,] 0 0 0 0 1 0 0 0 0 0
## [3,] 0 0 0 0 1 0 0 0 0 0
## [4,] 0 0 0 0 1 0 0 0 0 0
## [5,] 0 0 0 0 1 0 0 0 0 0
## [6,] 0 0 0 0 1 0 0 0 0 0
Instead of a binary vector, this method represents each amino acid using a set of continuous numerical values based on physicochemical properties (e.g., kideraFactors, FASGAI, etc). This can capture the biochemical similarity between amino acids. Access this method by setting mode = "property"
or using the propertyEncoder()
alias.
Mode-Specific Arguments:
Available property.set options include:
# Generate property-based encoding using FASGAI factors
enc_prop <- sequenceEncoder(sequences,
mode = "property",
property.set = "FASGAI",
verbose = FALSE)
# The propertyEncoder() alias is a convenient shortcut
enc_prop_alias <- propertyEncoder(sequences, property.set = "FASGAI", verbose = FALSE)
# View the first few columns of the flattened property matrix
head(enc_prop$flattened[, 1:6])
## F1_1 F2_1 F3_1 F4_1 F5_1 F6_1
## [1,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [2,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [3,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [4,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [5,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [6,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
This approach creates a single, fixed-length vector (an “embedding”) for an entire sequence. It works by averaging the vectors for each amino acid from a substitution matrix (like BLOSUM62) and then applying a geometric rotation. This is useful for tasks where a summary of the entire sequence is needed, similar to approaches like GIANA. Use this with mode = "geometric"
or the geometricEncoder()
alias.
Parameters for geometricEncoder()
# Generate a geometric embedding for each sequence
enc_geo <- sequenceEncoder(sequences,
mode = "geometric",
method = "BLOSUM62",
theta = pi / 3,
verbose = FALSE)
# The alias provides a direct path to this functionality
enc_geo_alias <- geometricEncoder(sequences,
method = "BLOSUM62",
theta = pi / 3,
verbose = FALSE)
# The output is a single summary matrix where each row is a sequence
head(enc_geo$summary)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## 1 -1.640544 -0.65849365 -2.7610572 0.28229128 -0.6282849 -1.6617786 -2.840304
## 2 -1.229861 -0.03648518 -0.7135148 0.06917725 -0.8720085 0.8436963 -1.754380
## 3 -2.056810 -1.18750000 -3.2900635 0.69855716 -1.5612976 -1.0457532 -3.119310
## 4 -1.380529 0.14114564 -1.3470349 -0.41686706 -1.2371393 1.0177881 -1.930168
## 5 -1.462587 -0.73945224 -2.3684144 0.10221415 -1.1720277 -0.1518066 -2.628962
## 6 -1.732051 -1.00000000 -2.8133898 0.42848961 -0.7915951 -1.7400282 -2.883831
## [,8] [,9] [,10] [,11] [,12] [,13]
## 1 0.66955081 -0.9910254 -0.28349365 -1.953044 -0.1172278 -0.12500000
## 2 -0.62799153 -1.6852027 0.08552329 -1.255181 1.0073714 -1.51036297
## 3 0.65280398 -0.3337341 1.82804446 -1.794551 -0.8917468 0.04575318
## 4 -0.03185118 -1.9072913 0.05352540 -1.191386 1.1885413 -1.76778811
## 5 0.37167793 -0.8392773 1.09003464 -1.474767 -0.5365385 -0.53001155
## 6 0.77272030 -1.0664529 1.18048396 -1.761823 -0.5039887 -0.09622504
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## 1 0.216506351 -1.5122595 1.8693103 -1.582532 0.2410254 -0.6785254 -0.82475953
## 2 0.282692070 -0.9166667 1.5877132 -2.043055 -0.7946582 -1.3853630 0.06618572
## 3 0.170753175 -1.8370191 1.6818103 -1.828044 -0.3337341 0.7120191 0.26674682
## 4 0.061898816 -0.7935095 0.6243988 -1.859294 -0.2796075 -1.4346552 0.10989564
## 5 0.008916019 -1.4512820 1.7864214 -2.095687 -0.3701633 -0.4180069 0.17855469
## 6 -0.055555556 -1.1666667 2.0207259 -1.680484 -0.2004275 -0.4960113 -0.02977213
Another approach to transforming a sequence into numerical values is tokenizing it into numbers. This is a common approach for recurrent neural networks where one letter corresponds to a single integer. In addition, we can add start and stop tokens to our original sequences to differentiate between the beginning and end of the sequences.
Parameters for tokenizeSequences()
token.matrix <- tokenizeSequences(input.sequences = c(sequences, mutated.sequences),
add.startstop = TRUE,
start.token = "!",
stop.token = "^",
convert.to.matrix = TRUE)
head(token.matrix[,1:18])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 1 6 2 17 10 14 6 20 15 22 23 23 23 23
## [2,] 1 6 2 17 13 5 11 4 8 7 18 20 15 22
## [3,] 1 6 2 17 21 11 21 20 15 22 23 23 23 23
## [4,] 1 6 2 17 13 21 4 19 4 16 13 16 4 21
## [5,] 1 6 2 17 11 7 12 2 11 4 20 15 22 23
## [6,] 1 6 2 17 14 6 11 17 20 15 22 23 23 23
## [,15] [,16] [,17] [,18]
## [1,] 23 23 23 23
## [2,] 23 23 23 23
## [3,] 23 23 23 23
## [4,] 7 20 15 22
## [5,] 23 23 23 23
## [6,] 23 23 23 23
We have a function called sequenceDecoder()
that extracts sequences from one-hot or property-encoded matrices or arrays. This function can be applied to any generative approach to sequence generation.
Parameters for sequenceDecoder()
"onehot"
or "property"
mode = "property"
, a character vector of property names (e.g., "Atchley"
) that were used for the original encoding.property.matrix <- propertyEncoder(input.sequences = c(sequences, mutated.sequences),
property.set = "FASGAI")
property.sequences <- sequenceDecoder(property.matrix[[1]],
mode = "property",
property.set = "FASGAI",
call.threshold = 1)
head(sequences)
## [1] "CASHMCYF" "CASKDINEQTYF" "CASVIVYF" "CASKVNWNPKPNVQYF"
## [5] "CASIQLAINYF" "CASMCISYF"
head(property.sequences)
##
## "CASHMCYF" "CASKDINEQTYF" "CASVIVYF" "CASKVNWNPKPNVQYF"
##
## "CASIQLAINYF" "CASMCISYF"
A similar approach can be applied when using matrices or arrays derived from one-hot encoding:
sequence.matrix <- onehotEncoder(input.sequences = c(sequences, mutated.sequences))
OHE.sequences <- sequenceDecoder(sequence.matrix,
mode= "onehot")
head(OHE.sequences)
## [1] "CASHMCYF" "CASKDINEQTYF" "CASVIVYF" "CASKVNWNPKPNVQYF"
## [5] "CASIQLAINYF" "CASMCISYF"
A common task is to classify sequences into groups, such as predicting whether an immune receptor binds to a specific antigen. Here, we’ll train a Random Forest model to distinguish between two classes of sequences based on their physicochemical properties
Step 1: Simulate Data and Engineer Features
First, we’ll simulate two distinct classes of sequences using generateSequences()
. We’ll then use propertyEncoder()
with the “Atchley” factors to convert each sequence into a flattened numerical vector. Each element in the vector represents a specific physicochemical property at a specific position, transforming our variable-length sequences into a fixed-size feature matrix suitable for machine learning.
# Step 1a: Generate two distinct classes of sequences
class1.sequences <- generateSequences(prefix.motif = "CAS",
min.length = 3,
number.of.sequences = 500)
class2.sequences <- generateSequences(prefix.motif = "CSG",
min.length = 3,
number.of.sequences = 500)
# Combine sequences and create labels
all.sequences <- c(class1.sequences, class2.sequences)
labels <- as.factor(c(rep("Class1", 500), rep("Class2", 500)))
# Step 1b: Use propertyEncoder to create a feature matrix from Atchley factors
feature.matrix <- propertyEncoder(all.sequences,
property.set = "atchleyFactors",
verbose = FALSE)
# Combine the flattened feature matrix and labels into the final data frame
training.data <- data.frame(feature.matrix$flattened, labels)
Step 2: Train and Evaluate the Model
Now we can train the Random Forest classifier. This model is robust, requires minimal tuning, and can provide insights into which features (in this case, which property at which position) are most important for distinguishing between the classes.
suppressMessages(library(randomForest))
# Train the Random Forest model
set.seed(42) # for reproducibility
rf.model <- randomForest(labels ~ .,
data = training.data,
ntree = 100,
importance = TRUE) # Set importance=TRUE to calculate scores
# Print the confusion matrix to see model performance
print(rf.model)
##
## Call:
## randomForest(formula = labels ~ ., data = training.data, ntree = 100, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0%
## Confusion matrix:
## Class1 Class2 class.error
## Class1 500 0 0
## Class2 0 500 0
A key advantage of Random Forest is the ability to easily extract feature importance. We can create a plot to see which positional Atchley factors were most influential in the model’s predictions.
# Extract feature importance data from the model
importance.data <- as.data.frame(importance(rf.model))
importance.data$Feature <- rownames(importance.data)
# Get the top 15 most important features
top_features <- importance.data[order(importance.data$MeanDecreaseGini, decreasing = TRUE), ][1:15,]
# Plot using ggplot2
ggplot(top_features, aes(x = MeanDecreaseGini, y = reorder(Feature, MeanDecreaseGini))) +
geom_col(aes(fill = MeanDecreaseGini)) +
scale_fill_viridis_c() +
labs(title = "Top 15 Most Important Features",
x = "Mean Decrease in Gini Impurity",
y = "Feature (Property and Position)") +
theme_classic() +
theme(legend.position = "none")
The plot shows the features that the model found most useful for classification. The feature names correspond to specific Atchley factors at specific positions within the sequence - here residues 2 and 3 where the difference was encoded - demonstrating how propertyEncoder()
allows the model to learn from the underlying biochemistry.
Sometimes we don’t have labels, but we want to see if our sequences form natural clusters. We can use dimensionality reduction techniques like Principal Component Analysis (PCA) to visualize the relationships between sequences. The geometricEncoder()
function is perfect for this, as it creates a rich, fixed-length numerical embedding for each sequence.
Step 1: Simulate and Encode a Mixed Population of Sequences
We’ll generate three distinct families of sequences. Then, we will use geometricEncoder()
to transform them into a 20-dimensional numerical matrix.
# Generate three distinct groups of sequences
groupA <- generateSequences(prefix.motif = "CA",
number.of.sequences = 100,
min.length = 2)
groupB <- generateSequences(prefix.motif = "QR",
number.of.sequences = 100,
min.length = 2)
groupC <- generateSequences(prefix.motif = "YY",
number.of.sequences = 100,
min.length = 2)
# Combine them and create a vector of original group IDs for later visualization
mixed.sequences <- c(groupA, groupB, groupC)
original.groups <- as.factor(rep(c("Group A", "Group B", "Group C"), each = 100))
# Use geometricEncoder to create embeddings
geometric.embeddings <- geometricEncoder(mixed.sequences,
method = "BLOSUM62",
verbose = FALSE)
Step 2: Perform PCA and Visualize Clusters
With our geometric.embeddings matrix, we can now perform PCA and plot the results. If the geometric encoding captures the underlying differences in our sequence families, we should see distinct clusters in the plot of the first two principal components.
# Perform PCA on the embedding matrix
pca.result <- prcomp(geometric.embeddings$summary, center = TRUE, scale. = TRUE)
# Prepare data for plotting with ggplot2
pca.data <- data.frame(PC1 = pca.result$x[,1],
PC2 = pca.result$x[,2],
Group = original.groups)
# Plot PC1 vs PC2 using ggplot2 and viridis
ggplot(pca.data, aes(x = PC1, y = PC2, color = Group)) +
geom_point(size = 3, alpha = 0.8) +
scale_color_viridis(discrete = TRUE) +
labs(title = "PCA of Geometric Sequence Embeddings",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_classic()
The resulting plot clearly shows three distinct clusters, demonstrating that the geometricEncoder()
successfully captured the structural differences between our sequence families, allowing for their separation in an unsupervised manner.
Beyond analyzing individual sequences, we can explore the relationships between sequences by building a similarity network. A powerful machine learning application for these networks is community detection, an unsupervised clustering method that finds groups of densely connected nodes. In immunology, this can be used to identify clonal families or groups of sequences with shared features.
Step 1: Simulate Data with Inherent Structure
First, we will simulate a dataset containing three distinct “families” of sequences. Our goal is to see if the network analysis can blindly recover these groups without being given the labels.
set.seed(42)
# Generate three distinct groups of sequences to simulate clonal families
family1 <- unique(generateSequences(prefix.motif = "CASS",
suffix.motif = "YF",
min.length = 6,
number.of.sequences = 60))
family2 <- unique(generateSequences(prefix.motif = "CARS",
suffix.motif = "GF",
min.length = 6,
number.of.sequences = 60))
family3 <- unique(generateSequences(prefix.motif = "CSVA",
suffix.motif = "HF",
min.length = 6,
number.of.sequences = 60))
# Combine into a single data frame
all_sequences_df <- data.frame(
sequence = c(family1, family2, family3),
original_family = c(rep("Family 1", 42), rep("Family 2", 40), rep("Family 3", 46))
)
Step 2: Build the Sequence Network
Next, we use buildNetwork()
to create an edge list based on sequence similarity (edit distance). We then convert this list into a formal graph object using the igraph package, which is the standard for network analysis in R.
# Build the edge list: connect sequences with an edit distance of 3 or less
edge_list <- buildNetwork(all_sequences_df,
seq_col = "sequence",
threshold = 3)
# Replace numerical edge list with the sequences
edge_list_sequences <- as.matrix(data.frame(
from = all_sequences_df$sequence[as.numeric(edge_list$from)],
to = all_sequences_df$sequence[as.numeric(edge_list$to)],
dist = edge_list$dist
))
# Create a graph object from the edge list and node data
# This graph now contains all our sequences as nodes
sequence_graph <- graph_from_data_frame(d = edge_list_sequences,
vertices = all_sequences_df,
directed = FALSE)
E(sequence_graph)$weight <- edge_list_sequences[,3]
Step 3: Perform Community Detection
Now we apply a machine learning algorithm to find communities. The igraph
package offers many algorithms; we will use “Walktrap,” a method that finds communities through a series of short random walks. The idea is that walks are more likely to get “trapped” within densely connected parts of the network (i.e., communities).
# Perform community detection using the Walktrap algorithm
communities <- cluster_walktrap(sequence_graph)
# Add the community membership as an attribute to the graph nodes
V(sequence_graph)$community <- communities$membership
Step 4: Visualize and Interpret the Communities
Finally, we use ggraph
to visualize the network, coloring each node by the community it was assigned to by the algorithm. If the analysis was successful, the colors should align with the original sequence families we simulated.
# Convert to tidygraph for use with ggraph
g_tidy <- as_tbl_graph(sequence_graph)
# Plot the network, coloring nodes by their detected community
ggraph(g_tidy, layout = "fr") +
geom_edge_link(aes(alpha = weight), show.legend = FALSE) +
geom_node_point(aes(color = as.factor(community)), size = 4) +
scale_color_viridis(discrete = TRUE, option = "D") +
labs(title = "Sequence Network with Detected Communities",
color = "Detected\nCommunity") +
theme_void()
The plot clearly visualizes the network structure, and the distinct color groups demonstrate that the community detection algorithm successfully identified and separated the three original sequence families without any prior information. This unsupervised approach is a powerful tool for exploring the hidden structure within a complex repertoire.
This has been a general overview of the capabilities of immApex for processing immune receptor sequences, extracting features, and developing algorithms. If you have any questions, comments, or suggestions, feel free to visit the GitHub repository.
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/liblapack.so.3; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] randomForest_4.7-1.2 ggraph_2.2.1 tidygraph_1.3.1
## [4] igraph_2.1.4 dplyr_1.1.4 viridis_0.6.5
## [7] viridisLite_0.4.2 ggplot2_3.5.2 immApex_1.3.7
## [10] BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2
## [3] fastmap_1.2.0 SingleCellExperiment_1.31.1
## [5] tweenr_2.0.3 promises_1.3.3
## [7] digest_0.6.37 lifecycle_1.0.4
## [9] processx_3.8.6 magrittr_2.0.3
## [11] compiler_4.5.1 rlang_1.1.6
## [13] sass_0.4.10 tools_4.5.1
## [15] yaml_2.3.10 knitr_1.50
## [17] S4Arrays_1.9.1 labeling_0.4.3
## [19] graphlayouts_1.2.2 curl_6.4.0
## [21] DelayedArray_0.35.2 xml2_1.3.8
## [23] RColorBrewer_1.1-3 abind_1.4-8
## [25] websocket_1.4.4 withr_3.0.2
## [27] purrr_1.1.0 Peptides_2.4.6
## [29] BiocGenerics_0.55.1 grid_4.5.1
## [31] polyclip_1.10-7 hash_2.2.6.3
## [33] stats4_4.5.1 scales_1.4.0
## [35] MASS_7.3-65 dichromat_2.0-0.1
## [37] tinytex_0.57 SummarizedExperiment_1.39.1
## [39] cli_3.6.5 rmarkdown_2.29
## [41] crayon_1.5.3 generics_0.1.4
## [43] httr_1.4.7 cachem_1.1.0
## [45] chromote_0.5.1 ggforce_0.5.0
## [47] stringr_1.5.1 parallel_4.5.1
## [49] rvest_1.0.4 BiocManager_1.30.26
## [51] XVector_0.49.0 matrixStats_1.5.0
## [53] vctrs_0.6.5 Matrix_1.7-3
## [55] jsonlite_2.0.0 bookdown_0.43
## [57] IRanges_2.43.0 S4Vectors_0.47.0
## [59] ggrepel_0.9.6 magick_2.8.7
## [61] jquerylib_0.1.4 tidyr_1.3.1
## [63] glue_1.8.0 ps_1.9.1
## [65] stringi_1.8.7 gtable_0.3.6
## [67] later_1.4.2 GenomicRanges_1.61.1
## [69] tibble_3.3.0 pillar_1.11.0
## [71] htmltools_0.5.8.1 Seqinfo_0.99.2
## [73] R6_2.6.1 evaluate_1.0.4
## [75] lattice_0.22-7 Biobase_2.69.0
## [77] memoise_2.0.1 bslib_0.9.0
## [79] Rcpp_1.1.0 gridExtra_2.3
## [81] SparseArray_1.9.1 xfun_0.52
## [83] MatrixGenerics_1.21.0 pkgconfig_2.0.3