% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/countComboBarcodes.R
\name{countComboBarcodes}
\alias{countComboBarcodes}
\alias{matrixOfComboBarcodes}
\title{Count combinatorial barcodes}
\usage{
countComboBarcodes(
  fastq,
  template,
  choices,
  substitutions = 0,
  find.best = FALSE,
  strand = c("both", "original", "reverse"),
  num.threads = 1,
  indices = FALSE
)

matrixOfComboBarcodes(files, ..., withDimnames = TRUE, BPPARAM = SerialParam())
}
\arguments{
\item{fastq}{String containing the path to a FASTQ file containing single-end data,
or a connection object to such a file.}

\item{template}{A template for the barcode structure, see \code{?\link{parseBarcodeTemplate}} for details.}

\item{choices}{A \linkS4class{List} of character vectors, one per variable region in \code{template}.
The first vector should contain the potential sequences for the first variable region, 
the second vector for the second variable region and so on.}

\item{substitutions}{Integer scalar specifying the maximum number of substitutions when considering a match.}

\item{find.best}{Logical scalar indicating whether to search each read for the best match.
Defaults to stopping at the first match.}

\item{strand}{String specifying which strand of the read to search.}

\item{num.threads}{Integer scalar specifying the number of threads to use to process a single file.}

\item{indices}{Logical scalar indicating whether integer indices should be used to define each combinational barcode.}

\item{files}{A character vector of paths to FASTQ files.}

\item{...}{Further arguments to pass to \code{countComboBarcodes}.}

\item{withDimnames}{A logical scalar indicating whether the rows and columns should be named.}

\item{BPPARAM}{A \linkS4class{BiocParallelParam} object specifying how parallelization is to be performed across files.}
}
\value{
\code{countComboBarcodes} returns a \linkS4class{DataFrame} where each row corresponds to a combinatorial barcode.
It contains \code{combinations}, a nested \linkS4class{DataFrame} that contains the sequences that define each combinatorial barcode;
and \code{counts}, an integer vector containing the frequency of each barcode.
The medata contains \code{nreads}, an integer scalar of the total number of reads in \code{fastq}.

Each column of \code{combinations} corresponds to a single variable region in \code{template} and one vector in \code{choices}.
By default, the sequences are reported directly as character vectors.
If \code{indices=FALSE}, each column contains the indices of the sequences in the corresponding entry of \code{choices}.

\code{matrixOfComboBarcodes} returns a \linkS4class{SummarizedExperiment} containing:
\itemize{
\item An integer matrix named \code{"counts"}, containing counts for each combinatorial barcode in each \code{files}.
\item One or more vectors in the \code{rowData} that define each combinatorial barcode, equivalent to \code{combinations}.
\item Column metadata containing a character vector \code{files}, the path to each file;
an integer vector \code{nreads}, containing the total number of reads in each file;
and \code{nmapped}, containing the number of reads assigned to a barcode in the output count matrix.
}
If \code{withDimnames=TRUE}, row names are set to \code{"BARCODE_[ROW]"} and column names are set to \code{basename(files)}.
}
\description{
Count combinatorial barcodes for single-end screen sequencing experiments where entities are distinguished based on random combinations of a small pool of known sequences within a single template.
}
\details{
Certain screen sequencing experiments take advantage of combinatorial complexity to generate a very large pool of unique barcode sequences.
Only a subset of all possible combinatorial barcodes will be used in any given experiment.
This function only counts the combinations that are actually observed, improving efficiency over a more conventional approach (i.e., to generate all possible combinations and use \code{\link{countSingleBarcodes}} to count their frequency).

If \code{strand="both"}, the original read sequence will be searched first.
If no match is found, the sequence is reverse-complemented and searched again.
Other settings of \code{strand} will only search one or the other sequence.
The most appropriate choice depends on both the sequencing protocol and the design (i.e., position and length) of the barcode.

We can handle sequencing errors by setting \code{substitutions} to a value greater than zero.
This will consider substitutions in both the variable region as well as the constant flanking regions.

By default, the function will stop at the first match that satisfies the requirements above.
If \code{find.best=TRUE}, we will instead try to find the best match with the fewest mismatches.
If there are multiple matches with the same number of mismatches, the read is discarded to avoid problems with ambiguity.
}
\examples{
# Creating an example dual barcode sequencing experiment.
known.pool <- c("AGAGAGAGA", "CTCTCTCTC",
    "GTGTGTGTG", "CACACACAC")

N <- 1000
barcodes <- sprintf("ACGT\%sACGT\%sACGT",
   sample(known.pool, N, replace=TRUE),
   sample(known.pool, N, replace=TRUE))
names(barcodes) <- seq_len(N)

library(Biostrings)
tmp <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(barcodes), filepath=tmp, format="fastq")

# Counting the combinations.
output <- countComboBarcodes(tmp,
    template="ACGTNNNNNNNNNACGTNNNNNNNNNACGT",
    choices=list(first=known.pool, second=known.pool))
output$combinations
head(output$counts)

matrixOfComboBarcodes(c(tmp, tmp),
    template="ACGTNNNNNNNNNACGTNNNNNNNNNACGT",
    choices=list(first=known.pool, second=known.pool))
}
\author{
Aaron Lun
}
