1 Introduction

scoreInvHap genotypes inversions using SNP data. This method computes a similarity score between the available SNPs of an individual and experimental references for the three inversion-genotypes; NN: non-inverted homozygous, NI: inverted heterozygous and II: inverted homozygous. Individuals are classified into the inversion-genotypes with the highest score. There are other approaches to genotype inversions from SNP data: inveRsion, invClust or PFIDO. However, these approaches have limitations including:

  • they are based in population sample inferences. When a new individual is included in the study the whole population sample needs to be reanalyzed. As they depend on dimensionality reduction methods, they are slow in the analysis of several individuals.

  • they can only handle limited number of samples. They need large samples sizes for accurate inferences.

  • They are highly sensitive to SNP array densities. Their accuracy depends on how clear the population sample can be partitioned, which depends on the amount and quality of informative SNPs.

  • They need external information to label the inversion-genotype clusters. External validation of the clustering is performed at the end of the inference. If there are more than three clusters, it is not clear how which inversion genotype should be assigned to the clusters.

scoreInvHap overcomes these difficulties by using a set of reference genotypes from the inversion of interest. The methods is a supervised classifier that genotypes each individual according to their SNP similarities to the reference genotypes across the inverted segment. The classifier in particular uses the linkage desequillibrium (R2) between the SNPs and the inversion genoptypes, and the SNPs of each reference inversion-genotypes.

2 Inversion characterization

The package is loaded as usual

library(scoreInvHap)

scoreInvHap takes as input SNP data formated in either SNPMatrix or VCF Bioconductor’s objects. In the case of having data in SNPMatrix format, a list with two elements is required:

  • genotypes: a SNPMatrix with individuals in rows and SNPs in columns
  • map: a data.frame with the SNPs annotation. It must contain the columns allele.1 and allele.2 with the alleles of the SNPs.

If data is originally available in PLINK files (.bed, .bim), we can use the functions of snpStats to load the data as SNPMatrix object

library(snpStats)

## From a bed 
snps <- read.plink("example.bed")

## From a pedfile
snps <- read.pedfile("example.ped", snps = "example.map")

In both cases, data is readily formatted for scoreInvHap.

If data is available in a variant call format (.vcf) file, we can load it in R into a VCF object, using the VariantAnnotation package. scoreInvHap includes a small vcf in scoreInvHap as a demo. This file contains genotyped and imputed SNPs within inversion 7p11.2 for 30 European individuals from the 1000 Genomes project. We can load the vcf as follows

library(VariantAnnotation)
vcf_file <- system.file("extdata", "example.vcf", package = "scoreInvHap")
vcf <- readVcf(vcf_file, "hg19")
vcf
## class: CollapsedVCF 
## dim: 380 30 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 4 columns: AF, MAF, R2, ER2
## info(header(vcf)):
##        Number Type  Description                                                
##    AF  1      Float Estimated Alternate Allele Frequency                       
##    MAF 1      Float Estimated Minor Allele Frequency                           
##    R2  1      Float Estimated Imputation Accuracy                              
##    ER2 1      Float Empirical (Leave-One-Out) R-square (available only for g...
## geno(vcf):
##   List of length 3: GT, DS, GP
## geno(header(vcf)):
##       Number Type   Description                                                
##    GT 1      String Genotype                                                   
##    DS 1      Float  Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]      
##    GP 3      Float  Estimated Posterior Probabilities for Genotypes 0/0, 0/1...

The object vcf contains 380 SNPs and 30 individuals.

scoreInvHap references were created using 1000 Genomes data. Thus, SNPs are annotated to the hg19 build in the plus strand. We have included a function that checks if input SNPs correspond to scoreInvHap references:

check <- checkSNPs(vcf)
check
## $genos
## class: CollapsedVCF 
## dim: 307 30 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 4 columns: AF, MAF, R2, ER2
## info(header(vcf)):
##        Number Type  Description                                                
##    AF  1      Float Estimated Alternate Allele Frequency                       
##    MAF 1      Float Estimated Minor Allele Frequency                           
##    R2  1      Float Estimated Imputation Accuracy                              
##    ER2 1      Float Empirical (Leave-One-Out) R-square (available only for g...
## geno(vcf):
##   List of length 3: GT, DS, GP
## geno(header(vcf)):
##       Number Type   Description                                                
##    GT 1      String Genotype                                                   
##    DS 1      Float  Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]      
##    GP 3      Float  Estimated Posterior Probabilities for Genotypes 0/0, 0/1...
## 
## $wrongAlleles
## NULL
## 
## $wrongFreqs
## NULL
vcf <- check$genos

The function checkSNPs checks if the alleles in the SNP object are equal to those in scoreInvHap references. The function also tests if the frequencies are similar in the input data and in the references. checkSNPs returns a list with three elements. wrongAlleles are the SNPs with different alleles, wrongFreqs are the SNPs with different allele frequencies and genos is an object with the genotype data without SNPs failing any of the checks.

Now, we illustrate how to perform the inversion genotyping of inv7p11.2 for these data. The inversion calling requires two pieces of information:

  • sample genotypes and their annotation (argument SNPlist),
  • name of the inversion

Currently, there are 20 inversion references included in scoreInvHap. We included a table with their coordinates and scoreInvHap labels at the end of this document.

inv7p11.2 is labeled as inv7_005 in scoreInvHap. We then obtain the inversion genotypes for the 30 individuals in our dataset as follows

res <- scoreInvHap(SNPlist = vcf, inv = "inv7_005")
res
## scoreInvHapRes
## Samples:  30 
## Genotypes' table:
##  II   NI  NN  
##  6    17  7   
## - Inversion genotypes' table:
##  NN   NI  II  
##  7    17  6   
## - Inversion frequency: 48.33%

This function has a parameter called BPPARAM that allows paralell computing using BiocParallel infrastructure. For example, parallel computation using 8 cores would be run by executing

res <- scoreInvHap(SNPlist = vcf, inv = "inv7_005",
                   BPPARAM = BiocParallel::MulticoreParam(8))

The results of scoreInvHap are encapsulated in a object of class scoreInvHapRes. This object contains the classification of the individuals into the inversion-genotypes, as well as the similarity scores. We can retrieve results with the classification and the scores functions

# Get classification
head(classification(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 
##      II      II      NI      NI      II      NN 
## Levels: II NI NN
# Get scores
head(scores(res))
##              IaIa      IaIb      IbIb      NaIa      NaIb      NaNa      NaNb
## HG00096 0.7636176 0.7659578 0.9945596 0.6346131 0.7729563 0.7725742 0.4156252
## HG00097 0.7636176 0.7659578 0.9945596 0.6346131 0.7729563 0.7725742 0.4156252
## HG00099 0.6438462 0.7267291 0.7685422 0.7768338 0.9758600 0.7703418 0.5317241
## HG00100 0.3431981 0.4263283 0.4892990 0.2604141 0.4764038 0.3691768 0.6083141
## HG00101 0.9841178 0.7603894 0.7584714 0.7188997 0.6280646 0.7182172 0.3558992
## HG00102 0.2518848 0.1732271 0.3167020 0.1441663 0.1994766 0.3004813 0.5812806
##              NbIa      NbIb      NbNb
## HG00096 0.3784514 0.5670885 0.3911247
## HG00097 0.3784514 0.5670885 0.3911247
## HG00099 0.3025523 0.5345369 0.2689791
## HG00100 0.6376241 0.6834476 0.5605652
## HG00101 0.4545927 0.4165003 0.3266407
## HG00102 0.5527802 0.5480628 0.7223922

Inversion genotypes is returned as a factor, which includes the individual names present in the snpMatrix or VCF. Thus, inversion classification can be readily used in down-stream association tests.

2.1 Quality control

We can retrieve other values that are useful to evaluate the quality of the inference. For each individual, we can obtain the highest similarity score and the difference between the highest similarity score and the second highest:

# Get max score
head(maxscores(res))
##   HG00096   HG00097   HG00099   HG00100   HG00101   HG00102 
## 0.9945596 0.9945596 0.9758600 0.6834476 0.9841178 0.7223922
# Get difference score
head(diffscores(res))
##   HG00096   HG00097   HG00099   HG00100   HG00101   HG00102 
## 0.2216032 0.2216032 0.1990263 0.0458235 0.2237283 0.1411116

Classification is good when the highest score is close to 1 and the other scores are small. This means that the SNPs in the individual are almost identical to one of the reference genotypes and different from the rest. We can use the difference between the highest score and the second highest score as a quality measure. We can have a visual evaluation of the quality parameters with the function plotScores:

plotScores(res, pch = 16, main = "QC based on scores")

The horizontal line is a threshold for quality, set to 0.1 but can be changed in the parameter minDiff in the function scoreInvHap. This default value considers that the SNPs of the individual are at least 10% more similar to the selected reference than to second one. plotScores can be customized.

Another quality measure is based on the number of SNPs used in the computation. scoreInvHap allows individuals having different SNP coverages within the inverted regions. SNPs with no information are excluded from the computation of the scores. To reflect this we report a SNP call rate:

# Get Number of scores used
head(numSNPs(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 
##     307     307     307     307     307     307
# Get call rate
head(propSNPs(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 
##       1       1       1       1       1       1

The number of SNPs must always be taken into account to evaluate the performance of the computation. It is highly recommended to use, at least, 15 SNPs in any inversion-calling. plotCallRate can be used to visualize the call rate

plotCallRate(res, main = "Call Rate QC")

The vertical line is the minimum recommended call rate to consider a sample. By default, it is set to 0.9 but can be changed with the parameter callRate.

The function classification have two parameters that selects samples based on these two QC parameters. The argument minDiff sets the minimum difference between the highest and the second highest score. The argument callRate sets the minimum call rate of a sample to pass the QC. By default, both arguments are set to 0 so all the samples are included:

## No filtering
length(classification(res))
## [1] 30
## QC filtering
length(classification(res, minDiff = 0.1, callRate = 0.9))
## [1] 27

Finally, the function classification has the argument inversion that, when it is set to FALSE, haplotype-genotypes are called instead of inversion-genotuypes. This is useful for inversions that have more than one haplotype per inversion status:

## Inversion classification
table(classification(res))
## 
## II NI NN 
##  6 17  7
## Haplotype classification
table(classification(res, inversion = FALSE))
## 
## IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb 
##    2    0    4    5    5    0    4    3    4    3

2.2 Imputed data

When SNPs data are imputed, we obtain three different types of results: the best-guess, the dosage and the posterior probabilities. By default, scoreInvHap uses the best-guess when computing the similarity scores. However, we can also use posterior probabilities setting the argument probs to TRUE:

res_imp <- scoreInvHap(SNPlist = vcf, inv = "inv7_005", probs = TRUE)
res_imp
## scoreInvHapRes
## Samples:  30 
## Genotypes' table:
##  II   NI  NN  
##  6    17  7   
## - Inversion genotypes' table:
##  NN   NI  II  
##  7    17  6   
## - Inversion frequency: 48.33%

In this case, the samples were identically classified in both cases:

table(PostProbs = classification(res_imp), 
      BestGuess = classification(res))
##          BestGuess
## PostProbs II NI NN
##        II  6  0  0
##        NI  0 17  0
##        NN  0  0  7

3 Other features

There are two additional parameters of scoreInvHap than can reduce computing time: R2 and BPPARAM. R2 indicates the minimum R2 that a SNP should have with the inversion to be included in the score. The less number of SNPs the less time it takes. By default, all SNPs are included in the computation. On the other hand, BPPARAM requires an instance of BiocParallelParam, which allows to parallelize the computation of the score. You can find more information about this class in its help page (?bpparam) and in the BiocParallel vignette.

4 Inversions included in scoreInvHap

The following table describes the inversion includes in scoreInvHap:

Table 1: Inversions included in scoreInvHap
Length: inversion length in Kb in hg19. Num. Haplos: Number of haplotypes supported by the inversion. In parenthesis, number of standard and inverted haplotypes for inversions with more than two haplotypes.
scoreInvHap Label Locus Length (Kb) Num. Haplos
inv1_004 1p22.1 0.77 2
inv1_008 1q31.1 1.2 2
inv2_002 2p22.3 0.72 2
inv2_013 2q22.1 4.25 2
inv3_003 3q26.1 2.28 4 (3/1)
inv6_002 6p21.33 0.87 2
inv6_006 6q23.1 4.13 2
inv7_003 7p14.3 5.25 2
inv7_005 7p11.2 73.9 4 (2/2)
inv7_011 7q11.22 12.7 2
inv7_014 7q36.1 2.08 2
inv8_001 8p23.1 3925 2
inv11_001 11p12 4.75 2
inv11_004 11q13.2 1.38 3 (2/1)
inv12_004 12q21.2 19.3 2
inv12_006 12q21.2 1.03 3 (2/1)
inv14_005 14q23.3 0.86 2
inv16_009 16p11.2 364.2 2
inv17_007 17q21.31 711 2
inv21_005 21q21.3 1.06 4 (3/1)
invX_006 Xq13.2 90.8 4 (3/1)

This information is also available in inversionGR:

data(inversionGR)
inversionGR
## GRanges object with 22 ranges and 5 metadata columns:
##             seqnames              ranges strand | scoreInvHap.name
##                <Rle>           <IRanges>  <Rle> |      <character>
##    inv1_004     chr1   92131841-92132615      * |         inv1_004
##    inv1_008     chr1 197756784-197757982      * |         inv1_008
##    inv2_002     chr2   33764554-33765272      * |         inv2_002
##    inv2_013     chr2 139004949-139009203      * |         inv2_013
##    inv3_003     chr3 162545362-162547641      * |         inv3_003
##         ...      ...                 ...    ... .              ...
##   inv17_007    chr17   43661775-44372665      * |        inv17_007
##   inv21_005    chr21   28020653-28021711      * |        inv21_005
##    invX_006     chrX   72215927-72306774      * |         invX_006
##   inv16_009    chr16   28424774-28788943      * |        inv16_009
##   inv10_005    chr10   27220925-27656433      * |        inv10_005
##             Cytogenetic.location  Inv.freq Haplotypes  Num.SNPs
##                      <character> <numeric>  <numeric> <numeric>
##    inv1_004               1p22.1     11.23          2         6
##    inv1_008               1q31.3     19.68          2         5
##    inv2_002               2p22.3     15.11          2         6
##    inv2_013               2q22.1     71.47          2        13
##    inv3_003               3q26.1     56.16          4         6
##         ...                  ...       ...        ...       ...
##   inv17_007             17q21.31     23.96          2      3637
##   inv21_005              21q21.3     51.29          4        11
##    invX_006               Xq13.2     13.30          4       135
##   inv16_009              16p11.2     34.49          2       361
##   inv10_005                 <NA>        NA         NA        NA
##   -------
##   seqinfo: 14 sequences from an unspecified genome; no seqlengths

5 sessionInfo

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] snpStats_1.52.0             Matrix_1.6-1.1             
##  [3] survival_3.5-7              VariantAnnotation_1.48.0   
##  [5] Rsamtools_2.18.0            Biostrings_2.70.0          
##  [7] XVector_0.42.0              SummarizedExperiment_1.32.0
##  [9] Biobase_2.62.0              GenomicRanges_1.54.0       
## [11] GenomeInfoDb_1.38.0         IRanges_2.36.0             
## [13] S4Vectors_0.40.0            MatrixGenerics_1.14.0      
## [15] matrixStats_1.0.0           BiocGenerics_0.48.0        
## [17] scoreInvHap_1.24.0          BiocStyle_2.30.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0         dplyr_1.1.3              blob_1.2.4              
##  [4] filelock_1.0.2           bitops_1.0-7             fastmap_1.1.1           
##  [7] RCurl_1.98-1.12          BiocFileCache_2.10.0     GenomicAlignments_1.38.0
## [10] XML_3.99-0.14            digest_0.6.33            lifecycle_1.0.3         
## [13] KEGGREST_1.42.0          RSQLite_2.3.1            magrittr_2.0.3          
## [16] compiler_4.3.1           rlang_1.1.1              sass_0.4.7              
## [19] progress_1.2.2           tools_4.3.1              utf8_1.2.4              
## [22] yaml_2.3.7               rtracklayer_1.62.0       knitr_1.44              
## [25] prettyunits_1.2.0        S4Arrays_1.2.0           bit_4.0.5               
## [28] curl_5.1.0               DelayedArray_0.28.0      xml2_1.3.5              
## [31] abind_1.4-5              BiocParallel_1.36.0      grid_4.3.1              
## [34] fansi_1.0.5              biomaRt_2.58.0           cli_3.6.1               
## [37] rmarkdown_2.25           crayon_1.5.2             generics_0.1.3          
## [40] rjson_0.2.21             httr_1.4.7               DBI_1.1.3               
## [43] cachem_1.0.8             stringr_1.5.0            splines_4.3.1           
## [46] zlibbioc_1.48.0          parallel_4.3.1           AnnotationDbi_1.64.0    
## [49] BiocManager_1.30.22      restfulr_0.0.15          vctrs_0.6.4             
## [52] jsonlite_1.8.7           bookdown_0.36            hms_1.1.3               
## [55] bit64_4.0.5              magick_2.8.1             GenomicFeatures_1.54.0  
## [58] jquerylib_0.1.4          glue_1.6.2               codetools_0.2-19        
## [61] stringi_1.7.12           BiocIO_1.12.0            tibble_3.2.1            
## [64] pillar_1.9.0             rappdirs_0.3.3           htmltools_0.5.6.1       
## [67] BSgenome_1.70.0          GenomeInfoDbData_1.2.11  R6_2.5.1                
## [70] dbplyr_2.3.4             evaluate_0.22            lattice_0.22-5          
## [73] png_0.1-8                memoise_2.0.1            bslib_0.5.1             
## [76] Rcpp_1.0.11              SparseArray_1.2.0        xfun_0.40               
## [79] pkgconfig_2.0.3