% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/similarity.R
\name{similarity}
\alias{similarity}
\title{Construct a Matrix of Pairwise Set Similarity Coefficients}
\usage{
similarity(x, type = c("jaccard", "overlap", "otsuka"))
}
\arguments{
\item{x}{a named list of sets. Elements must be of type \code{"character"}.}

\item{type}{character; the type of similarity measure to use. Either
\code{"jaccard"}, \code{"overlap"}, or \code{"otsuka"}. May be abbreviated.}
}
\value{
A symmetric \code{\link[Matrix:dgCMatrix-class]{dgCMatrix}}
containing all pairwise set similarity coefficients.
}
\description{
Construct a sparse matrix of similarity coefficients for each
pair of sets in a list.
}
\section{Set Similarity}{


If \eqn{A} and \eqn{B} are sets, we define the Jaccard similarity
coefficient \eqn{J} as the size of their intersection divided by the size
of their union (Jaccard, 1912):

\deqn{\text{J}(A, B) = \frac{|A \cap B|}{|A \cup B|}}

The overlap coefficient is defined as the size of the intersection divided
by the size of the smaller set (Simpson, 1943, 1947, 1960; Fallaw 1979):

\deqn{\text{Overlap}(A, B) = \frac{|A \cap B|}{min(|A|, |B|)}}

The Ōtsuka coefficient is defined as the size of the intersection divided
by the geometric mean of the set sizes (Ōtsuka, 1936), which is equivalent
to the cosine similarity of two bit vectors:

\deqn{\text{Ōtsuka}(A, B) = \frac{|A \cap B|}{\sqrt{|A| \times |B|}}}

The Jaccard and Ōtsuka coefficients can identify aliased sets (sets which
contain the same elements, but have different names), while the overlap
coefficient can identify both aliased sets and subsets. Aliases and subsets
are not easily distinguished without also having the matrix of Jaccard (or
Ōtsuka) coefficients or the set sizes.

Notice the relationship between the similarity coefficients:

\deqn{0 \leq \text{J}(A, B) \leq \text{Ōtsuka}(A, B) \leq
  \text{Overlap}(A, B) \leq 1}
}

\section{Optimization}{


Calculations are only performed for pairs of sets with nonzero
intersections in the lower triangular part of the matrix. As such,
\code{similarity} is efficient even for large similarity matrices, and it
is especially efficient for sparse similarity matrices.
}

\examples{
x <- list("A" = c("a", "b", "c", "d", "e"),
          "B" = c("d", "e", "f", "g"), # overlaps with A
          "C" = c("d", "e", "f", "g"), # aliased with B
          "D" = c("a", "b", "c")) # subset of A

similarity(x) # Jaccard coefficients

similarity(x, type = "overlap") # overlap coefficients

similarity(x, type = "otsuka") # Ōtsuka coefficients
}
\references{
Jaccard, P. (1912). The distribution of the flora in the alpine
zone. \emph{The New Phytologist, 11}(2), 37–50.
doi:\href{https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
}{10.1111/j.1469-8137.1912.tb05611.x}.
\url{https://www.jstor.org/stable/2427226}

Ōtsuka, Y. (1936). The faunal character of the Japanese Pleistocene marine
Mollusca, as evidence of the climate having become colder during the
Pleistocene in Japan. \emph{Bulletin of the Biogeographical Society of
Japan, 6}(16), 165–170.

Simpson, G. G. (1943). Mammals and the nature of continents. \emph{American
Journal of Science, 241}(1), 1–31.

Simpson, G. G. (1947). Holarctic mammalian faunas and continental
relationships during the Cenozoic. \emph{Bulletin of the Geological Society
of America, 58}(7), 613–688.

Simpson, G. G. (1960). Notes on the measurement of faunal resemblance.
\emph{American Journal of Science, 258-A}, 300–311.

Fallaw, W. C. (1979). A test of the Simpson coefficient and other binary
coefficients of faunal similarity. \emph{Journal of Paleontology, 53}(4),
1029–1034. \url{http://www.jstor.org/stable/1304126}
}
\seealso{
\code{\link{sparseIncidence}}, \code{\link{clusterSets}}
}
