---
title: "syntenet as a synteny detection tool"
author: 
  - name: Fabricio Almeida-Silva
    affiliation: 
    - VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
    - Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
  - name: Tao Zhao
    affiliation:
    - State Key Laboratory of Crop Stress Biology for Arid Areas/Shaanxi Key Laboratory of Apple, College of Horticulture, Northwest A&F University, Yangling, China
  - name: Kristian K Ullrich
    affiliation:
    - Department of Evolutionary Biology, Max Planck Institute For Evolutionary Biology, Ploen, Germany
  - name: Yves Van de Peer
    affiliation: 
    - VIB-UGent Center for Plant Systems Biology, Ghent, Belgium
    - Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent, Belgium
    - College of Horticulture, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China
    - Center for Microbial Ecology and Genomics, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, South Africa
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: yes
bibliography: vignette_bibliography.bib
vignette: >
  %\VignetteIndexEntry{syntenet as a synteny detection tool}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
)
```

# Introduction

Although `r BiocStyle::Biocpkg("syntenet")` was designed for large-scale
synteny analyses (i.e., with tens of genomes), users might also be interested
in using `r BiocStyle::Biocpkg("syntenet")` as a simple synteny detection
tool. For example, one can use `r BiocStyle::Biocpkg("syntenet")` to
identify syntenic regions in a genome (intraspecies synteny),
or to identify syntenic regions between the genomes of two species 
(interspecies synteny). To detect synteny, `r BiocStyle::Biocpkg("syntenet")` 
uses its own implementation of the MCScanX algorithm [@wang2012mcscanx], 
ported to R thanks to the `r BiocStyle::CRANpkg("Rcpp")` package for 
R and C++ integration.
In this vignette, we will guide you on how to perform intra and interspecies
synteny with `r BiocStyle::Biocpkg("syntenet")`.

```{r load_package, message=FALSE}
library(syntenet)
```

# Detecting intragenome synteny

To detect intragenome (or intraspecies) synteny, you will use the function
`intraspecies_synteny`. As input, you will need:

1. A `GRangesList` object containing the processed annotation for your 
species of interest, as returned by `process_input()`.

2. A list of data frames with the output of similarity search programs 
(e.g., DIAMOND, BLAST, etc). **Only intragenome comparisons must be included.**
This can be obtained with `run_diamond(seq, compare = "intraspecies")`.


To demonstrate the usage of `intraspecies_synteny()`, let's identify syntenic
regions in the genome of *Saccharomyces cerevisiae*. The processed annotation
and DIAMOND output are stored in the example data sets `scerevisiae_annot`
and `scerevisiae_diamond`.

```{r data_intra}
# Load example data sets
data(scerevisiae_annot)
head(scerevisiae_annot)

data(scerevisiae_diamond)
names(scerevisiae_diamond)
head(scerevisiae_diamond$Scerevisiae_Scerevisiae)
```

Once we have the data, we can detect synteny with `intraspecies_synteny()`.
This function returns the path to the *.collinearity* files generated by 
MCScanX [@wang2012mcscanx], which can be read and parsed with the
`parse_collinearity()` function.

```{r intra_syn}
# Detect intragenome synteny
intra_syn <- intraspecies_synteny(
    scerevisiae_diamond, scerevisiae_annot
)

intra_syn # see where the .collinearity file is

# Get anchor pairs from .collinearity file
anchors <- parse_collinearity(intra_syn)
head(anchors)
```

By default, the output of `parse_collinearity()` is a 2-column data 
frame with anchor pairs in syntenic regions. You can also extract data on each
synteny block (or synteny region) by specifying `as = 'blocks'`.

```{r parse_blocks}
# Get synteny block information with `parse_collinearity()`
blocks <- parse_collinearity(intra_syn, as = "blocks")
head(blocks)
```

You can even extract both anchor pairs and synteny block data at once in
a single data frame.

```{r parse_all}
# Get anchors and block data with `parse_collinearity()`
intrasyn_all <- parse_collinearity(intra_syn, as = "all")
head(intrasyn_all)
```

# Detecting intergenome synteny

To detect intergenome (or interspecies) synteny, you will use the function
`interspecies_synteny`, which works very similarly to `intraspecies_synteny()`.
As input, you will need:

1. A `GRangesList` object containing the processed annotation for your 
species of interest (2+ species), as returned by `process_input()`.

2. A list of data frames with the output of similarity search programs 
(e.g., DIAMOND, BLAST, etc). **Only intergenome comparisons must be included.**
This can be obtained with `run_diamond(seq, compare = compare_df)`.

Importantly, to detect intergenome synteny, the MCScanX algorithm requires
bidirectional BLAST hits. For instance, if you're trying to detect synteny
between `spA` and `spB`, you need to perform DIAMOND/BLAST searches in both
directions: `spA_spB` and `spB_spA`. You can create a data frame specifying
these comparisons as follows:

```{r make_bidirectional}
# Create a data frame with comparisons to be made
comp <- data.frame(
    query = "spA",
    db = "spB"
)
comp

# Make comparisons bidirectional
comp_bi <- make_bidirectional(comp)
comp_bi
```

Then, you can peform similarity searches for these comparisons with:

```{r run_diamond_compare, eval = FALSE}
# NOTE: Not executed because object `seq` doesn't exist; for demo only
dmd_inter <- run_diamond(seq, compare = comp_bi)
```

The code above would generate a list of two data frames with DIAMOND tables: 
one named `spA_spB`, and another one named `spB_spA`. 

To demonstrate what this looks like, we will load the example data set 
we will use to detect intergenome synteny. Here, we will detect syntenic
regions between the genomes of the algae *Ostreococcus lucimarinus* and 
*Ostreococcus sp RCC809*. The list of DIAMOND tables is stored in object
`blast_list`, and we will create the processed annotation from objects
`proteomes` and `annotation`.

```{r example_inter}
# Load list of DIAMOND tables
data(blast_list)
names(blast_list)

algae_inter <- blast_list[c(2,3)] # keep only intergenome comparisons
names(algae_inter)


# Get processed annotation
data(proteomes)     # A list of `AAStringSet` objects
data(annotation)    # A `GRangesList` object

pdata <- process_input(proteomes, annotation)
names(pdata$annotation)
```

Now that we have a list of DIAMOND tables with bidirectional comparisons 
between *Olucimarinus* and *OspRCC809*, as well as annotation for these two
genomes, we can detect synteny with `interspecies_synteny()`. Like
`intraspecies_synteny()`, this function will return the path to 
a *.collinearity* file generated by MCScanX, and we can parse this file
with `parse_collinearity()`.

```{r inter_syn}
# Detect interspecies synteny
intersyn <- interspecies_synteny(algae_inter, pdata$annotation)

intersyn # see where the .collinearity file is

# Parse collinearity file
## 1) Get anchor pairs
algae_anchors <- parse_collinearity(intersyn)
head(algae_anchors)

## 2) Get synteny blocks
algae_blocks <- parse_collinearity(intersyn, as = "blocks")
head(algae_blocks)

## 3) Get all data combined (blocks + anchor pairs)
algae_all <- parse_collinearity(intersyn, as = "all")
head(algae_all)
```


# Session information {.unnumbered}

This document was created under the following conditions:

```{r sessionInfo}
sessionInfo()
```

# References {.unnumbered}