knitr::opts_chunk$set(warning = FALSE, comment = NA,
fig.width = 6.25, fig.height = 5)
library(ANCOMBC)
library(tidyverse)
The data_sanity_check
function performs essential validations on the input data to ensure its integrity before further processing. It verifies data types, confirms the structure of the input data, and checks for consistency between sample names in the metadata and the feature table, safeguarding against common data input errors.
Download package.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ANCOMBC")
Load the package.
phyloseq
objectThe HITChip Atlas dataset contains genus-level microbiota profiling with HITChip for 1006 western adults with no reported health complications, reported in (Lahti et al. 2014). The dataset is available via the microbiome R package (Lahti et al. 2017) in phyloseq (McMurdie and Holmes 2013) format.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 130 taxa and 1151 samples ]
sample_data() Sample Data: [ 1151 samples by 10 sample variables ]
tax_table() Taxonomy Table: [ 130 taxa by 3 taxonomic ranks ]
List the taxonomic levels available for data aggregation.
[1] "Phylum" "Family" "Genus"
List the variables available in the sample metadata.
[1] "age" "sex" "nationality"
[4] "DNA_extraction_method" "project" "diversity"
[7] "bmi_group" "subject" "time"
[10] "sample"
Data sanity and integrity check.
# With `group` variable
check_results = data_sanity_check(data = atlas1006,
tax_level = "Family",
fix_formula = "age + sex + bmi_group",
group = "bmi_group",
struc_zero = TRUE,
global = TRUE,
verbose = TRUE)
Checking the input data type ...
The input data is of type: phyloseq
PASS
Checking the sample metadata ...
The specified variables in the formula: age, sex, bmi_group
The available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, sample
PASS
Checking other arguments ...
The number of groups of interest is: 6
The sample size per group is: underweight = 21, lean = 484, overweight = 197, obese = 222, severeobese = 99, morbidobese = 22
PASS
# Without `group` variable
check_results = data_sanity_check(data = atlas1006,
tax_level = "Family",
fix_formula = "age + sex + bmi_group",
group = NULL,
struc_zero = FALSE,
global = FALSE,
verbose = TRUE)
Checking the input data type ...
The input data is of type: phyloseq
PASS
Checking the sample metadata ...
The specified variables in the formula: age, sex, bmi_group
The available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, sample
PASS
Checking other arguments ...
PASS
tse
objectList the taxonomic levels available for data aggregation.
[1] "Phylum" "Family" "Genus"
List the variables available in the sample metadata.
[1] "age" "sex" "nationality"
[4] "DNA_extraction_method" "project" "diversity"
[7] "bmi_group" "subject" "time"
[10] "sample"
Data sanity and integrity check.
check_results = data_sanity_check(data = tse,
assay_name = "counts",
tax_level = "Family",
fix_formula = "age + sex + bmi_group",
group = "bmi_group",
struc_zero = TRUE,
global = TRUE,
verbose = TRUE)
Checking the input data type ...
The input data is of type: TreeSummarizedExperiment
PASS
Checking the sample metadata ...
The specified variables in the formula: age, sex, bmi_group
The available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, sample
PASS
Checking other arguments ...
The number of groups of interest is: 6
The sample size per group is: underweight = 21, lean = 484, overweight = 197, obese = 222, severeobese = 99, morbidobese = 22
PASS
matrix
or data.frame
Both abundance data and sample metadata are required for this import method.
Note that aggregating taxa to higher taxonomic levels is not supported in this method. Ensure that the data is already aggregated to the desired taxonomic level before proceeding. If aggregation is needed, consider creating a phyloseq
or tse
object for importing.
Ensure that the rownames
of the metadata correspond to the colnames
of the abundance data.
[1] TRUE
List the variables available in the sample metadata.
[1] "age" "sex" "nationality"
[4] "DNA_extraction_method" "project" "diversity"
[7] "bmi_group" "subject" "time"
[10] "sample"
Data sanity and integrity check.
check_results = data_sanity_check(data = abundance_data,
assay_name = "counts",
tax_level = "Family",
meta_data = meta_data,
fix_formula = "age + sex + bmi_group",
group = "bmi_group",
struc_zero = TRUE,
global = TRUE,
verbose = TRUE)
Checking the input data type ...
The input data is of type: matrix
The imported data is in a generic 'matrix'/'data.frame' format.
PASS
Checking the sample metadata ...
The specified variables in the formula: age, sex, bmi_group
The available variables in the sample metadata: age, sex, nationality, DNA_extraction_method, project, diversity, bmi_group, subject, time, sample
PASS
Checking other arguments ...
The number of groups of interest is: 6
The sample size per group is: underweight = 21, lean = 484, overweight = 197, obese = 222, severeobese = 99, morbidobese = 22
PASS
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] doRNG_1.8.6 rngtools_1.5.2 foreach_1.5.2 DT_0.33
[5] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[13] ggplot2_3.5.1 tidyverse_2.0.0 ANCOMBC_2.8.0
loaded via a namespace (and not attached):
[1] splines_4.4.1 cellranger_1.1.0
[3] DirichletMultinomial_1.48.0 rpart_4.1.23
[5] lifecycle_1.0.4 Rdpack_2.6.1
[7] doParallel_1.0.17 lattice_0.22-6
[9] MASS_7.3-61 MultiAssayExperiment_1.32.0
[11] crosstalk_1.2.1 backports_1.5.0
[13] magrittr_2.0.3 Hmisc_5.2-0
[15] sass_0.4.9 rmarkdown_2.28
[17] jquerylib_0.1.4 yaml_2.3.10
[19] gld_2.6.6 DBI_1.2.3
[21] minqa_1.2.8 ade4_1.7-22
[23] multcomp_1.4-26 abind_1.4-8
[25] zlibbioc_1.52.0 expm_1.0-0
[27] Rtsne_0.17 GenomicRanges_1.58.0
[29] BiocGenerics_0.52.0 phyloseq_1.50.0
[31] yulab.utils_0.1.7 nnet_7.3-19
[33] TH.data_1.1-2 sandwich_3.1-1
[35] GenomeInfoDbData_1.2.13 IRanges_2.40.0
[37] S4Vectors_0.44.0 ggrepel_0.9.6
[39] irlba_2.3.5.1 tidytree_0.4.6
[41] vegan_2.6-8 rbiom_1.0.3
[43] microbiome_1.28.0 DelayedMatrixStats_1.28.0
[45] permute_0.9-7 codetools_0.2-20
[47] DelayedArray_0.32.0 scuttle_1.16.0
[49] energy_1.7-12 tidyselect_1.2.1
[51] UCSC.utils_1.2.0 farver_2.1.2
[53] viridis_0.6.5 ScaledMatrix_1.14.0
[55] lme4_1.1-35.5 gmp_0.7-5
[57] matrixStats_1.4.1 stats4_4.4.1
[59] base64enc_0.1-3 jsonlite_1.8.9
[61] BiocNeighbors_2.0.0 multtest_2.62.0
[63] e1071_1.7-16 decontam_1.26.0
[65] mia_1.14.0 Formula_1.2-5
[67] survival_3.7-0 scater_1.34.0
[69] iterators_1.0.14 tools_4.4.1
[71] treeio_1.30.0 DescTools_0.99.57
[73] Rcpp_1.0.13 glue_1.8.0
[75] gridExtra_2.3 SparseArray_1.6.0
[77] xfun_0.48 mgcv_1.9-1
[79] MatrixGenerics_1.18.0 GenomeInfoDb_1.42.0
[81] TreeSummarizedExperiment_2.14.0 withr_3.0.2
[83] numDeriv_2016.8-1.1 fastmap_1.2.0
[85] bluster_1.16.0 boot_1.3-31
[87] rhdf5filters_1.18.0 fansi_1.0.6
[89] rsvd_1.0.5 digest_0.6.37
[91] timechange_0.3.0 R6_2.5.1
[93] colorspace_2.1-1 lpSolve_5.6.21
[95] gtools_3.9.5 utf8_1.2.4
[97] generics_0.1.3 DECIPHER_3.2.0
[99] data.table_1.16.2 class_7.3-22
[101] CVXR_1.0-14 httr_1.4.7
[103] htmlwidgets_1.6.4 S4Arrays_1.6.0
[105] pkgconfig_2.0.3 gtable_0.3.6
[107] Exact_3.3 Rmpfr_0.9-5
[109] SingleCellExperiment_1.28.0 XVector_0.46.0
[111] htmltools_0.5.8.1 biomformat_1.34.0
[113] scales_1.3.0 Biobase_2.66.0
[115] lmom_3.2 knitr_1.48
[117] rstudioapi_0.17.1 tzdb_0.4.0
[119] reshape2_1.4.4 checkmate_2.3.2
[121] nlme_3.1-166 nloptr_2.1.1
[123] proxy_0.4-27 cachem_1.1.0
[125] zoo_1.8-12 rhdf5_2.50.0
[127] rootSolve_1.8.2.4 vipor_0.4.7
[129] parallel_4.4.1 foreign_0.8-87
[131] pillar_1.9.0 grid_4.4.1
[133] vctrs_0.6.5 slam_0.1-54
[135] BiocSingular_1.22.0 beachmat_2.22.0
[137] cluster_2.1.6 beeswarm_0.4.0
[139] htmlTable_2.4.3 evaluate_1.0.1
[141] mvtnorm_1.3-1 cli_3.6.3
[143] compiler_4.4.1 rlang_1.1.4
[145] crayon_1.5.3 labeling_0.4.3
[147] mediation_4.5.0 ggbeeswarm_0.7.2
[149] plyr_1.8.9 fs_1.6.4
[151] stringi_1.8.4 viridisLite_0.4.2
[153] BiocParallel_1.40.0 lmerTest_3.1-3
[155] munsell_0.5.1 Biostrings_2.74.0
[157] gsl_2.1-8 lazyeval_0.2.2
[159] Matrix_1.7-1 hms_1.1.3
[161] sparseMatrixStats_1.18.0 bit64_4.5.2
[163] Rhdf5lib_1.28.0 SummarizedExperiment_1.36.0
[165] highr_0.11 rbibutils_2.3
[167] igraph_2.1.1 RcppParallel_5.1.9
[169] bslib_0.8.0 bit_4.5.0
[171] readxl_1.4.3 ape_5.8
Lahti, Leo, Jarkko Salojärvi, Anne Salonen, Marten Scheffer, and Willem M De Vos. 2014. “Tipping Elements in the Human Intestinal Ecosystem.” Nature Communications 5 (1): 1–10.
Lahti, Leo, Sudarshan Shetty, T Blake, J Salojarvi, and others. 2017. “Tools for Microbiome Analysis in R.” Version 1: 10013.
McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PloS One 8 (4): e61217.