% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/createClusterMST.R
\name{createClusterMST}
\alias{createClusterMST}
\alias{createClusterMST,ANY-method}
\alias{createClusterMST,SummarizedExperiment-method}
\alias{createClusterMST,SingleCellExperiment-method}
\title{Minimum spanning trees on cluster centroids}
\usage{
createClusterMST(x, ...)

\S4method{createClusterMST}{ANY}(
  x,
  clusters,
  use.median = FALSE,
  outgroup = FALSE,
  outscale = 1.5,
  endpoints = NULL,
  columns = NULL,
  dist.method = c("simple", "scaled.full", "scaled.diag", "slingshot", "mnn"),
  with.mnn = FALSE,
  mnn.k = 50,
  BNPARAM = NULL,
  BPPARAM = NULL
)

\S4method{createClusterMST}{SummarizedExperiment}(x, ..., assay.type = "logcounts")

\S4method{createClusterMST}{SingleCellExperiment}(
  x,
  clusters = colLabels(x, onAbsence = "error"),
  ...,
  use.dimred = NULL
)
}
\arguments{
\item{x}{A numeric matrix of coordinates where each row represents a cell/sample and each column represents a dimension
(usually a PC or another low-dimensional embedding, but features or genes can also be used).

Alternatively, a \linkS4class{SummarizedExperiment} or \linkS4class{SingleCellExperiment} object
containing such a matrix in its \code{\link{assays}}, as specified by \code{assay.type}.
This will be transposed prior to use.

Alternatively, for \linkS4class{SingleCellExperiment}s, this matrix may be extracted from its \code{\link{reducedDims}},
based on the \code{use.dimred} specification.
In this case, no transposition is performed.

Alternatively, if \code{clusters=NULL}, a numeric matrix of coordinates for cluster centroids,
where each row represents a cluster and each column represents a dimension 
Each row should be named with the cluster name.
This mode can also be used with assays/matrices extracted from SummarizedExperiments and SingleCellExperiments.}

\item{...}{For the generic, further arguments to pass to the specific methods.

For the SummarizedExperiment method, further arguments to pass to the ANY method.

For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method
(if \code{use.dimred} is specified) or the ANY method (otherwise).}

\item{clusters}{A factor-like object of the same length as \code{nrow(x)},
specifying the cluster identity for each cell in \code{x}.
If \code{NULL}, \code{x} is assumed to already contain coordinates for the cluster centroids.

Alternatively, a matrix with number of rows equal to \code{nrow(x)}, 
containing soft assignment weights for each cluster (column).
All weights should be positive and sum to 1 for each row.}

\item{use.median}{A logical scalar indicating whether cluster centroid coordinates should be computed using the median rather than mean.}

\item{outgroup}{A logical scalar indicating whether an outgroup should be inserted to split unrelated trajectories.
Alternatively, a numeric scalar specifying the distance threshold to use for this splitting.}

\item{outscale}{A numeric scalar specifying the scaling of the median distance between centroids,
used to define the threshold for outgroup splitting.
Only used if \code{outgroup=TRUE}.}

\item{endpoints}{A character vector of clusters that must be endpoints, i.e., nodes of degree 1 or lower in the MST.}

\item{columns}{A character, logical or integer vector specifying the columns of \code{x} to use.
If \code{NULL}, all provided columns are used by default.}

\item{dist.method}{A string specifying the distance measure to be used, see Details.}

\item{with.mnn}{Logical scalar, deprecated; use \code{dist.method="mnn"} instead.}

\item{mnn.k}{An integer scalar specifying the number of nearest neighbors to consider for the MNN-based distance calculation when \code{dist.method="mnn"}.
See \code{\link[BiocNeighbors]{findMutualNN}} for more details.}

\item{BNPARAM}{A BiocNeighborParam object specifying how the nearest-neighbor search should be performed when \code{dist.method="mnn"},
see the \pkg{BiocNeighbors} package for more details.}

\item{BPPARAM}{A BiocParallelParam object specifying whether the nearest neighbor search should be parallelized when \code{dist.method="mnn"},
see the \pkg{BiocNeighbors} package for more details.}

\item{assay.type}{An integer or string specifying the assay to use from a SummarizedExperiment \code{x}.}

\item{use.dimred}{An integer or string specifying the reduced dimensions to use from a SingleCellExperiment \code{x}.}
}
\value{
A \link{graph} object containing an MST computed on \code{centers}.
Each node corresponds to a cluster centroid and has a numeric vector of coordinates in the \code{coordinates} attribute.
The edge weight is set to the Euclidean distance and the confidence is stored as the \code{gain} attribute.
}
\description{
Build a MST where each node is a cluster centroid and 
each edge is weighted by the Euclidean distance between centroids.
This represents the most parsimonious explanation for a particular trajectory
and has the advantage of being directly intepretable with respect to any pre-existing clusters.
}
\section{Computing the centroids}{

By default, the cluster centroid is defined by taking the mean value across all of its cells for each dimension.
If \code{clusters} is a matrix, a weighted mean is used instead.
This treats the column of weights as fractional identities of each cell to the corresponding cluster.

If \code{use.median=TRUE}, the median across all cells in each cluster is used to compute the centroid coordinate for each dimension.
(With a matrix-like \code{clusters}, a weighted median is calculated.)
This protects against outliers but is less stable than the mean.
Enabling this option is advisable if one observes that the default centroid is not located near any of its points due to outliers.
Note that the centroids computed in this manner is not a true medoid, which was too much of a pain to compute.
}

\section{Introducing an outgroup}{

If \code{outgroup=TRUE}, we add an outgroup to avoid constructing a trajectory between \dQuote{unrelated} clusters (Street et al., 2018).
This is done by adding an extra row/column to the distance matrix corresponding to an artificial outgroup cluster,
where the distance to all of the other real clusters is set to \eqn{\omega/2}.
Large jumps in the MST between real clusters that are more distant than \eqn{\omega} will then be rerouted through the outgroup,
allowing us to break up the MST into multiple subcomponents (i.e., a minimum spanning forest) by removing the outgroup.

The default \eqn{\omega} value is computed by constructing the MST from the original distance matrix,
computing the median edge length in that MST, and then scaling it by \code{outscale}.
This adapts to the magnitude of the distances and the internal structure of the dataset
while also providing some margin for variation across cluster pairs.
The default \code{outscale=1.5} will break any branch that is 50\% longer than the median length. 

Alternatively, \code{outgroup} can be set to a numeric scalar in which case it is used directly as \eqn{\omega}.
}

\section{Forcing endpoints}{

If certain clusters are known to be endpoints (e.g., because they represent terminal states), we can specify them in \code{endpoints}.
This ensures that the returned graph will have such clusters as nodes of degree 1, i.e., they terminate the path.
The function uses an exhaustive search to identify the MST with these constraints.
If no configuration can be found, an error is raised - this will occur if all nodes are specified as endpoints, for example.

If \code{outgroup=TRUE}, the function is allowed to connect two endpoints together to create a two-node subcomponent.
This will result in the formation of a minimum spanning forest if there are more than two clusters in \code{x}.
Of course, if there are only two nodes and both are specified as endpoints, a two-node subcomponent will be formed regardless of \code{outgroup}.

Note that edges involving endpoint nodes will have infinite confidence values (see below).
This reflects the fact that they are forced to exist during graph construction.
}

\section{Confidence on the edges}{

For the MST, we obtain a measure of the confidence in each edge by computing the distance gained if that edge were not present.
Ambiguous parts of the tree will be less penalized from deletion of an edge, manifesting as a small distance gain.
In contrast, parts of the tree with clear structure will receive a large distance gain upon deletion of an obvious edge.

For each edge, we divide the distance gain by the length of the edge to normalize for cluster resolution.
This avoids overly penalizing edges in parts of the tree involving broad clusters
while still retaining sensitivity to detect distance gain in overclustered regions.
As an example, a normalized gain of unity for a particular edge means that its removal
requires an alternative path that increases the distance travelled by that edge's length.

The normalized gain is reported as the \code{"gain"} attribute in the edges of the MST from \code{\link{createClusterMST}}.
Note that the \code{"weight"} attribute represents the edge length.
}

\section{Distance measures}{

Distances between cluster centroids may be calculated in multiple ways:
\itemize{
\item The default is \code{"simple"}, which computes the Euclidean distance between cluster centroids.
\item With \code{"scaled.diag"}, we downscale the distance between the centroids by the sum of the variances of the two corresponding clusters (i.e., the diagonal of the covariance matrix).
This accounts for the cluster \dQuote{width} by reducing the effective distances between broad clusters.
\item With \code{"scaled.full"}, we repeat this scaling with the full covariance matrix.
This accounts for the cluster shape by considering correlations between dimensions, but cannot be computed when there are more cells than dimensions.
\item The \code{"slingshot"} option will typically be equivalent to the \code{"scaled.full"} option, 
but switches to \code{"scaled.diag"} in the presence of small clusters (fewer cells than dimensions in the reduced dimensional space). 
\item For \code{"mnn"}, see the more detailed explanation below.
}

If \code{clusters} is a matrix with \code{"scaled.diag"}, \code{"scaled.full"} and \code{"slingshot"}, 
a weighted covariance is computed to account for the assignment ambiguity.
In addition, a warning will be raised if \code{use.median=TRUE} for these choices of \code{dist.method};
the Mahalanobis distances will not be correctly computed when the centers are medians instead of means.
}

\section{Alternative distances with MNN pairs}{

While distances between centroids are usually satisfactory for gauging cluster \dQuote{closeness}, 
they do not consider the behavior at the boundaries of the clusters.
Two clusters that are immediately adjacent (i.e., intermingling at the boundaries) may have a large distance between their centroids
if the clusters themselves span a large region of the coordinate space.
This may preclude the obvious edge from forming in the MST.

In such cases, we can use an alternative distance calculation based on the distance between mutual nearest neighbors (MNNs).
An MNN pair is defined as two cells in separate clusters that are each other's nearest neighbors in the other cluster.
For each pair of clusters, we identify all MNN pairs and compute the median distance between them.
This distance is then used in place of the distance between centroids to construct the MST.
In this manner, we focus on cluster pairs that are close at their boundaries rather than at their centers.

This mode can be enabled by setting \code{dist.method="mnn"}, while the stringency of the MNN definition can be set with \code{mnn.k}.
Similarly, the performance of the nearest neighbor search can be controlled with \code{BPPARAM} and \code{BSPARAM}.
Note that this mode performs a cell-based search and so cannot be used when \code{x} already contains aggregated profiles.
}

\examples{
# Mocking up a Y-shaped trajectory.
centers <- rbind(c(0,0), c(0, -1), c(1, 1), c(-1, 1))
rownames(centers) <- seq_len(nrow(centers))
clusters <- sample(nrow(centers), 1000, replace=TRUE)
cells <- centers[clusters,]
cells <- cells + rnorm(length(cells), sd=0.5)

# Creating the MST:
mst <- createClusterMST(cells, clusters)
plot(mst)

# We could also do it on the centers:
mst2 <- createClusterMST(centers, clusters=NULL)
plot(mst2)

# Works if the expression matrix is in a SE:
library(SummarizedExperiment)
se <- SummarizedExperiment(t(cells), colData=DataFrame(group=clusters))
mst3 <- createClusterMST(se, se$group, assay.type=1)
plot(mst3)

}
\references{
Ji Z and Ji H (2016).
TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis.
\emph{Nucleic Acids Res.} 44, e117

Street K et al. (2018).
Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. 
\emph{BMC Genomics}, 477.
}
\author{
Aaron Lun
}
