% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/testMotifEnrich.R
\name{testMotifEnrich}
\alias{testMotifEnrich}
\title{Test motif enrichment using a background set of sequences}
\usage{
testMotifEnrich(
  pwm,
  stringset,
  bg,
  var = "iteration",
  model = c("quasipoisson", "hypergeometric", "poisson", "iteration"),
  sort_by = c("p", "none"),
  mc.cores = 1,
  prior.count = 1,
  seed = 100,
  ...
)
}
\arguments{
\item{pwm}{A Position Weight Matrix or list of PWMs}

\item{stringset}{An XStringSet with equal sequence widths}

\item{bg}{An XStringSet with the same sequence widths as the test XStringset}

\item{var}{A column in the mcols element of bg, usually denoting an iteration
number}

\item{model}{The model used for analysis}

\item{sort_by}{Column to sort results by}

\item{mc.cores}{Passed to \link[parallel]{mclapply}}

\item{prior.count}{Added to all counts to better manage zero counts in
background sequences. For analysis under QuasiPoisson models prior counts
are added as Poisson noise using this value as expected counts}

\item{seed}{Used for reproducibility when adding Poisson noise}

\item{...}{Passed to \link{getPwmMatches} or \link{countPwmMatches}}
}
\value{
A data.frame with columns: \code{sequences}, \code{matches}, \code{expected}, \code{enrichment},
and \code{p}, with additional columns \code{Z}, \code{est_bg_rate} (Poisson), \code{odds_ratio}
(Hypergeometric) or \code{Z}, \code{sd_bg}, \code{n_iter} and \code{iter_p} (Iterations).
The numbers of sequences and matches refer to the test set of sequences,
whilst expected is the expected number of matches under the Poisson or
iterative null distribution. The ratio of matches to expected is given as
\code{enrichment}, along with the Z score and p-value. Whilst the Z score is only
representative under the Poisson model, it is used to directly estimate
p-values under the iterative approach.
Under this latter approach, the sd of the matches found in the background
sequences is also given, along with the number of iterations and the p-values
from permutations testing the one-sided hypothesis hypothesis for enrichment.

It may also be worth noting that if producing background sequences using
\link{makeRMRanges} with \code{replace = TRUE} and \code{force_ol = TRUE}, the
iterative model corresponds to a bootstrap, given that the test sequences
will overlap the background sequences and background ranges are able to be
sampled with replacement.
}
\description{
Test for motif enrichment within a set of sequences using a background set to
derive a NULL hypothesis
}
\details{
This function offers four alternative models for assessing the enrichment of
a motif within a set of sequences, in comparison to a background set of
sequences.
Selection of the BG sequences plays an important role and, in conjunction
with the question being addressed, determines the most appropriate model to
use for testing, as described below.

It should also be noted that the larger the BG set of sequences, the larger
the computational burden, and results can take far longer to return. For
many millions of background sequences, this may run beyond an hour
\subsection{Descriptions of Models and Use Cases}{
\subsection{Hypergeometric Tests}{

Hypergeometric tests are best suited to the use case where the test set of
sequences represents a subset of a larger set, with a specific feature or
behaviour, whilst the BG set may be the remainder of the set without that
feature. For example, the test set may represent ChIP-Seq binding sites
where signal changes in response to treatment, whilst the BG set represents
the sites where no changed signal was observed. Testing is one-sided, for
enrichment of motifs within the test set.

Due to these relatively smaller sized datasets, setting
model = "hypergeometric", will generally return results quickly
}

\subsection{Poisson Tests}{

This approach requires a set of background sequences which should be much
larger than the test set of sequences.
The parameters for a Poisson model are estimated in a per-sequence manner on
the set of BG sequences, and the observed rate of motif-matches within the
test set is then tested using \link[stats]{poisson.test}.
Testing is two-sided.

This approach assumes that all matches follow a Poisson distribution, which
is often true, but data can also be over-dispersed. Given that this model can
also return results relatively quickly, is it primarily suitable for data
exploration, such as quickly checking for expected behaviours, but not for
final results.
}

\subsection{Quasi-Poisson Test}{

The quasipoisson model allows for over-dispersion and will return more
conservative results than using the standard Poisson model.
Under the method currently implemented here, BG sequences should be divided
into blocks (i.e. iterations), identical in size to the test set of sequences.
Model parameters are estimated per iteration across the BG set of sequences,
with the rate of matches in the test set being compared against these blocks.
This ensures more conservative results that if analysing test and bg
sequences as collections of individual sequences.

It is expected that the BG set will matched for the features of interest and
chosen using \link{makeRMRanges} with a large number of iterations, e.g.
n_iter = 1000.
Due to this parameterisation, quasipoisson approaches can be computationally
time-consuming, as this is effectively an iterative approach.
Testing is two-sided.
}

\subsection{Iteration}{

Setting the model as "iteration" performs a non-parametric analysis, with
the exception of returning Z-scores under the Central Limit Theorem.
Mean and SD of matches is found for each iteration, and used to return Z
scores, with p-values returned from both a Z-test and from comparing
observed values directly to sampled values obtained from the BG sequences.
Sampled values are calculated directly and as such, are limited in precision.

As for the QuasiPoisson model, a very large number of iterations is expected
to be used, to ensure the CLT holds, again making this a computationally
demanding test.
Each iteration/block is expected to be identically-sized to the test set,
and matched for any features as appropriate using \code{\link[=makeRMRanges]{makeRMRanges()}}.
}

}
}
\examples{
## Load the example peaks & the sequences
data("ar_er_peaks")
data("ar_er_seq")
sq <- seqinfo(ar_er_peaks)
## Now sample size-matched ranges 10 times larger. In real-world analyses,
## this set should be sampled as at least 1000x larger, ensuring features
## are matched to your requirements. This example masks regions with known N
## content, including centromeres & telomeres
data("hg19_mask")
set.seed(305)
bg_ranges <- makeRMRanges(
  ar_er_peaks, GRanges(sq)[1], exclude = hg19_mask, n_iter = 10
)

## Convert ranges to DNAStringSets
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
bg_seq <- getSeq(genome, bg_ranges)

## Test for enrichment of the ESR1 motif
data("ex_pfm")
esr1 <- ex_pfm$ESR1
testMotifEnrich(esr1, ar_er_seq, bg_seq, model = "poisson")

## Test all motifs
testMotifEnrich(ex_pfm, ar_er_seq, bg_seq, model = "poisson")


}
\seealso{
\code{\link[=makeRMRanges]{makeRMRanges()}}, \code{\link[=getPwmMatches]{getPwmMatches()}}, \code{\link[=countPwmMatches]{countPwmMatches()}}
}
