This document will go through the analysis of single and multi-omics data using the RFLOMICS package. It will be shown how to set up the statistical framework of the designed protocol: i.e. translate biological condition comparisons into contrasts, perform quality control, data pre-processing and statistical analysis of each omics dataset: i.e. find differentially expressed genes/proteins/metabolites (DEF) per contrast, identify clusters of co-expressed DEF and perform functional annotation (over representation analysis) of the lists of DEF or clusters of co-expressed DEF using GO, KEGG or a custom annotation. Then, the vignette will show how to explore the common variance/correlation between the different omics datasets and simultaneously identify key features (genes/proteins/metabolites) that best correlate with biological factors.
RFLOMICS 0.99.7
The acquisition of multi-omics data with complex experimental designs is a widely adopted practice to identify entities and unravel the biological processes they are involved in.
Exploring each omics layer usually serves as an initial step to investigate and extract relevant biological variability. Statistical integration can then focus on pertinent omics layers and features.
Most existing solutions for the analysis of such data are command-line based, which limits accessibility for many users. Handling many packages, methods and their results can become a burden even for experienced R-users. Interface solutions exist, like ExploreModelMatrix, pcaExplorer MatrixQCvis or ideal, which each allows perform one particular step of the analysis, or MSstatsShiny, which is designed for one particular omics type.
Here we present RFLOMICS, an R package with a Shiny interface, designed to ensure a guided, comprehensive analysis and visualization of transcriptomics, proteomics, or metabolomics data. RFLOMICS covers the entire process from defining the statistical model to multi-omics integration, all within a single application.
RFLOMICS uses extended classes of the MultiAssayExperiment
and SummarizedExperiment
classes to ensure efficient data management of heterogeneous data.
A flexible HTML report can be generated at any stage of the analysis, providing results, methods, and settings to facilitate the analysis reproducibility. Additionally, users can download an archive containing all produced results.
You can install the latest RFLOMICS development from github:
# BiocManager::install("RFLOMICS/RFLOMICS")
Then you can access the interface with the following commands:
library(RFLOMICS)
This part of the vignette will follow the different steps of a multi-omics analysis trough the RFLOMICS interface which are:
They will appear in the left tab menu of the interface along with the analysis progression. Whereas single omics analysis steps will be displayed in tab panel associated to each dataset .
The interface can be accessed using:
# runRFLOMICS()
Data loading starts by setting a project name (it is mandatory), then load the experimental design and finally the different omics datasets.
Data used for this vignette have been provided by Pr. Loic Rajjou and Gwendal Cueff. It is included in the ‘ExamplesFiles’ directory of the package. Briefly, A. thaliana’s transcriptome, metabolome and proteome have been obtained in the context of the study of seed germination and vigor. In particular, the authors were interested in deciphering key entities involved in response to environmental stresses (on the mother plant): influences of temperature (high, medium and low) and imbibition stage (Dry: DI, early imbibition: EI and late imbibition: LI). See(https://www.uibk.ac.at/botany/ecoseed/home/)
The experimental design file must contain experimental information/conditions for each sample in a tab separated values format (see RFLOMICS-input-data.html). As soon as this file is loaded, for each design factor, we have to set up the factor type (Biological factor: Bio; batch factor: batch; metadata information: meta) and a reference level.
It is required to have a minimum of 1 biological factor (maximum 3) and 1 batch factor. Order of the columns does not matter.
It is possible to remove factor’s conditions from the design. Associated samples will be excluded from the analysis even if they are present in the omics dataset matrix.
It is also possible to re-order factor modalities
Experimental data must be loaded as matrix files in a tab separated values format. (for more information see RFLOMICS-input-data.html). For each dataset it is mandatory to specify the type of omics (transcriptomics, proteomics, metabolomics ) as it will adapt the data processing and analysis methods. By default, datasets will be called set1, set2, set3, …, respectively. A more suitable name can be given by changing ‘Dataset name’ in the interface.
To load the 3 examples datasets of the Ecoseed project: RNAseq data (read counts), proteomics data (abundance of proteins), and metabolomics data (intensity of metabolites) (see RFLOMICS-input-data.html), please press the ‘Load Ecoseed Data’.
As soon as the data are loaded the RFLOMICS MultiAssayExperiment object is created. The data overview panel displays the samples overlap across datasets and the number of datasets per combination of biological factors.
The interface provides models written from the simplest one to the most complete one. Only the second order interaction terms between the biological factors appear. Batch factors never appear in interaction terms. In our example, the complete model is chosen.
All ‘a priori’ hypotheses (contrasts) are then calculated from the chosen model and displayed. The user must select those that correspond to the biological questions. Contrasts are numbered. These labels will follow for all analysis.
In our example we are studying the effect of temperature on seed germination (transition from the state of dormancy to the state of germination vigor). For that we select 3 average contrasts (see image):
This statistical framework will be applied on all loaded datasets.
Once the contrasts are chosen, 3 Item menus appear in the side bar menu that correspond to the 3 loaded omics types.
For each dataset, different steps of data analysis are proposed as a tab panel. These analysis steps must be performed sequentially. It is possible to switch between datasets. However, you cannot perform a new task until the previous one is completed. A progress bar is displayed indicating the progress status.
In this tab panel we access a data quality control screen and we perform the appropriate data filtering and processing according to the type of omics.
By default, features with 0 count in all conditions are removed from the data.
Specifically for each data we have the possibility to exclude samples or to select a subset of samples to analyze, provided the completeness condition is met.
The design must be complete and as possible balanced. A Complete design (mandatory) mean that all possible combinations of biological factors’ levels (called groups) are presents. A Balanced design (recommended) has an equal number of observations (replicates) for each group.
These constraints are checked and represented as graph (each square represent 1 group and each color level indicate the number of observations).
The effect of data filtering or sample removing step can be explored thanks to many QC graphs. For each QC, a graph is built on each raw data and filtered data to allow the comparison.
For RNAseq data, low expressed genes are filtered so that only genes whose cpm expression is greater than “CPM cutoff” in x samples are kept. The value of x is given by “Filtering strategy” (NbOfsample_over_cpm <= NbConditions). By default the data are filtered based on NbCondition strategy with CPM cutoff equal to 5. In this example, there are 9 conditions. The genes with a CPM less than 5 in at least 9 samples are removed.
To correct for differences in sequencing depths (library size), RFLOMICS proposes one method for normalization, “TMM method” from the edgeR package.
For RNAseq data the QC graphs are drawn before and after the filtering and normalization steps. The following graphs are displayed:
In this tab panel, we can run the differential expression analysis. The analysis
is run by contrast.
For RNAseq data, the glmFit() function from the edgeR package is used
whereas for proteomics and metabolomics data, the lmFit() function from limma
packages is used.
Results will appear in two components: a scrolling menu will give the results of the analysis by contrast and, at the bottom of the panel, the intersecting sets of the lists of DEG is given thanks to the UpSetR() function.
Each results has to be validated:
You have to to check the Pvalue distribution for each contrast and validate it (ok checkbox).
You can filter differentially expressed genes/proteins/metabolites either by fold change (|FC|) or by False Discovery Rate (BH) cut-off. Graphs will be automatically updated.
You have to validate to save the cut-off thresholds but also the contrasts result selection and pass to the co-expression analysis or to the annotation enrichment step.
In this tab panel, choice is given to merge (union) or intersect
(intersection) the lists of differentially expressed genes/proteins/metabolites
associated to contrasts.
Co-expression analysis is done using to the coseq package with a set of optimal parameters.
The number of technical replicates to perform (iter) for each K which is the number of expressions’ profile into which the DEG have to be cut, can be first set to 2 to precise the range of K and then put to a minimum of 20 replicates per K.
Results description (see CoSeq)
Results can be explored with several plots proposed by coseq. The ICL graph has been slightly modified to show all the range of the ICL values for a given K.
A table of jobs summary is also given. It groups error messages by K.
We select the lists of DE genes or clusters to annotate. All available lists are selected by default.
The enrichment analysis is performed using the clusterProfiler R-package. For more information on the methods and the package, please see the article, Bioconductor Package and the clusterProfiler Vignette.
You can chose between several domains, depending on several parameters:
Once everything is set and the analysis is conducted, several panels are displayed to show the results.
Results of the analysis for multiple lists
This panel summarizes the effects of the filtering steps on the selected datasets and the results of the differential analyses. The first row shows how many samples and entities are left in each table. The integration will be performed on these datasets. The second row shows the number of DE entities for each selected contrast and for each table, with the distinction between up-regulated and down-regulated entities.
Once at least two tables have passed the pre-processing step, the Integration tab will appear in the side bar menu. Two sub menus are available. For each of them, you will have to run a preprocessing before accessing to the integration panel: select the datasets you want to integrate together and the filtering strategy for each data. A filtering is strongly advised when having large data or unbalanced number of features between them with the goals to homogenize the number of features between the datasets.
You can filter according to the DE entities found in the differential analysis, or based on the variation coefficient, taking a certain number of variables that have the largest coefficient. For this example, we chose to keep only 500 features from the RNAseq data, based on the variation coefficient.
RNAseq data will be transformed before the integration. This is done with the voom transformation from the limma R package.
Batch effects are removed before applying the function, using removebatcheffect from limma.
Features are then centered before the integration.
This integration analysis is performed using the MOFA2 R-package. It is an unsupervised method that searches for latent factors to decompose the variance of the data, similarly to a PCA, but applied to a multi-dataset situation. The interpretation of the results is also very similar to those of a PCA: the most interesting factors are the one explaining more variance, whether it’s on several dataset or just one table. It also gives weights to the features of each table, for each factor. This is interesting in case a factor can be a posteriori linked to a covariate (in the case displayed on the figure, the temperature seems linked to the first factor, the features used to build this factor are to be considered very relevant in the analysis).
The user can chose to scale or not each tables, with the scale views argument, then set the number of factors to be computed and finally the maximum iterations for the MOFA run. We recommend that the user do not change the number of iterations. Once everything is set, you can run the analysis by clicking on the run button.
Several figure are provided by the interface once the analysis is done.
This integration analysis is performed using the MixOmics R-package. In RFLOMICS, we focused on the supervised analyses using the functions block.(s)plda (also called DIABLO) from the package. Only complete cases will be used for the analysis. The user can select all biological factors as response variables. The meta and batch factors are not available.
Using Sparse Analysis will set the “s” in the function names. It this case, the tuning ‘cases’ argument will be considered and passed to tune.splsda or tune.block.splsda: it determines the number of features selection to be tested.
In RFLOMICS, we chose to only ask for a number of cases. This number of cases is used to determine the keep.X argument, using a span of nfeatures/number of cases. By example, if the user chose 5 cases and has 2 datasets with respectively 500 and 250 features, the function will test 25 combinations of features per component. For each datasets, features will be picked-up in a cumulative manner ie for the DS1: 100,200,300,400,500 and DS2: 50, 100, 150, 200, 250. An example of combination will be c(DS1=100,DS2=50). The best combination of results by components will be used by mixOmics block.splsda.
Once everything is set, you can run the analysis by clicking on the run button. Several figures are provided by the interface once the analysis is done, for each selected response variable.
Weather the results are obtained as a .RData object downloaded from the interface or directly using the command line functions, the structure of the results is the same and visualization functions (from RFLOMICS, MultiAssayExperiment or SummarizedExperiment) can be used to further investigate the analyses results and produce suitable figures for an article. Most visualization functions from RFLOMICS return a ggplot object and can be modified using the ggplot2 functions.
Each of the analysis step has their own plot and getters methods.
All the pipeline, results and figures produced in the interface can be obtained by running methods directly in the R console.
Ecoseed data can be accessed using the following command:
data("ecoseed.mae")
The first step is to create the object containing all the data that can be used by RFLOMICS. This requires a project name, the datasets to be analyzed, a design matrix and information for each factor: the type (biological, batch or meta-factor) and a reference level, all in the form of data.frames. The reference level must be set with a particular care, as it will be used to define all contrasts in the differential analysis.
factorInfo <- data.frame(
"factorName" = c("Repeat", "temperature", "imbibition"),
"factorType" = c("batch", "Bio", "Bio")
)
testObject <- createRflomicsMAE(
projectName = "VignetteProject",
omicsData = ecoseed.mae,
omicsTypes = c("RNAseq","proteomics","metabolomics"),
factorInfo = factorInfo
)
class(testObject)
#> [1] "RflomicsMAE"
#> attr(,"package")
#> [1] "RFLOMICS"
RFLOMICS uses a structure derived from the MultiAssayExperiment package. Each dataset is stored as a RflomicsSE (derived from SummarizedExperiment) in a RflomicsMAE object. All methods from the MultiAssayExperiment package are also available for RFLOMICS objects.
is(testObject, "MultiAssayExperiment")
#> [1] TRUE
upsetSamples(testObject)
sessionInfo()
#> R Under development (unstable) (2025-03-13 r87965)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] RFLOMICS_0.99.7 coseq_1.31.0
#> [3] knitr_1.50 htmltools_0.5.8.1
#> [5] ggplot2_3.5.1 dplyr_1.1.4
#> [7] shinyBS_0.61.1 MultiAssayExperiment_1.33.9
#> [9] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [11] GenomicRanges_1.59.1 GenomeInfoDb_1.43.4
#> [13] IRanges_2.41.3 S4Vectors_0.45.4
#> [15] BiocGenerics_0.53.6 generics_0.1.3
#> [17] MatrixGenerics_1.19.1 matrixStats_1.5.0
#> [19] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] fs_1.6.5 enrichplot_1.27.4 httr_1.4.7
#> [4] RColorBrewer_1.1-3 doParallel_1.0.17 tools_4.6.0
#> [7] backports_1.5.0 R6_2.6.1 DT_0.33
#> [10] HDF5Array_1.35.16 lazyeval_0.2.2 uwot_0.2.3
#> [13] rhdf5filters_1.19.2 GetoptLong_1.0.5 withr_3.0.2
#> [16] gridExtra_2.3 bayesm_3.1-6 cli_3.6.4
#> [19] Cairo_1.6-2 flashClust_1.01-2 sandwich_3.1-1
#> [22] labeling_0.4.3 sass_0.4.9 mvtnorm_1.3-3
#> [25] robustbase_0.99-4-1 proxy_0.4-27 yulab.utils_0.2.0
#> [28] gson_0.1.0 DOSE_4.1.0 R.utils_2.13.0
#> [31] plotrix_3.8-4 limma_3.63.11 org.At.tair.db_3.21.0
#> [34] RSQLite_2.3.9 gridGraphics_0.5-1 shape_1.4.6.1
#> [37] crosstalk_1.2.1 vroom_1.6.5 car_3.1-3
#> [40] HTSCluster_2.0.11 GO.db_3.21.0 leaps_3.2
#> [43] Matrix_1.7-3 abind_1.4-8 R.methodsS3_1.8.2
#> [46] lifecycle_1.0.4 scatterplot3d_0.3-44 multcomp_1.4-28
#> [49] yaml_2.3.10 edgeR_4.5.10 carData_3.0-5
#> [52] rhdf5_2.51.2 qvalue_2.39.0 SparseArray_1.7.7
#> [55] Rtsne_0.17 grid_4.6.0 blob_1.2.4
#> [58] promises_1.3.2 shinydashboard_0.7.2 crayon_1.5.3
#> [61] dir.expiry_1.15.0 ggtangle_0.0.6 lattice_0.22-6
#> [64] cowplot_1.1.3 KEGGREST_1.47.0 HTSFilter_1.47.0
#> [67] magick_2.8.6 capushe_1.1.2 pillar_1.10.1
#> [70] ComplexHeatmap_2.23.1 fgsea_1.33.4 rjson_0.2.23
#> [73] estimability_1.5.1 corpcor_1.6.10 mixOmics_6.31.4
#> [76] codetools_0.2-20 fastmatch_1.1-6 glue_1.8.0
#> [79] ggfun_0.1.8 data.table_1.17.0 vctrs_0.6.5
#> [82] png_0.1-8 treeio_1.31.0 EnhancedVolcano_1.25.0
#> [85] gtable_0.3.6 cachem_1.1.0 xfun_0.51
#> [88] S4Arrays_1.7.3 mime_0.13 coda_0.19-4.1
#> [91] survival_3.8-3 pheatmap_1.0.12 iterators_1.0.14
#> [94] tinytex_0.56 statmod_1.5.0 TH.data_1.1-3
#> [97] nlme_3.1-168 ggtree_3.15.0 bit64_4.6.0-1
#> [100] filelock_1.0.3 UpSetR_1.4.0 tensorA_0.36.2.1
#> [103] bslib_0.9.0 colorspace_2.1-1 DBI_1.2.3
#> [106] DESeq2_1.47.5 tidyselect_1.2.1 emmeans_1.11.0
#> [109] bit_4.6.0 compiler_4.6.0 Rmixmod_2.1.10
#> [112] compositions_2.0-8 h5mread_0.99.4 basilisk.utils_1.19.1
#> [115] DelayedArray_0.33.6 bookdown_0.42 scales_1.3.0
#> [118] DEoptimR_1.1-3-1 multcompView_0.1-10 stringr_1.5.1
#> [121] digest_0.6.37 MOFA2_1.17.0 rmarkdown_2.29
#> [124] basilisk_1.19.3 XVector_0.47.2 pkgconfig_2.0.3
#> [127] FactoMineR_2.11 fastmap_1.2.0 rlang_1.1.5
#> [130] GlobalOptions_0.1.2 htmlwidgets_1.6.4 UCSC.utils_1.3.1
#> [133] shiny_1.10.0 farver_2.1.2 jquerylib_0.1.4
#> [136] zoo_1.8-13 jsonlite_2.0.0 BiocParallel_1.41.5
#> [139] GOSemSim_2.33.0 R.oo_1.27.0 magrittr_2.0.3
#> [142] Formula_1.2-5 GenomeInfoDbData_1.2.14 ggplotify_0.1.2
#> [145] patchwork_1.3.0 Rhdf5lib_1.29.2 munsell_0.5.1
#> [148] Rcpp_1.0.14 ape_5.8-1 reticulate_1.42.0
#> [151] stringi_1.8.7 MASS_7.3-65 plyr_1.8.9
#> [154] parallel_4.6.0 ggrepel_0.9.6 forcats_1.0.0
#> [157] Biostrings_2.75.4 splines_4.6.0 circlize_0.4.16
#> [160] locfit_1.5-9.12 igraph_2.1.4 ggpubr_0.6.0
#> [163] ggsignif_0.6.4 reshape2_1.4.4 evaluate_1.0.3
#> [166] BiocManager_1.30.25 tzdb_0.5.0 foreach_1.5.2
#> [169] httpuv_1.6.15 tidyr_1.3.1 purrr_1.0.4
#> [172] clue_0.3-66 BiocBaseUtils_1.9.0 broom_1.0.8
#> [175] xtable_1.8-4 e1071_1.7-16 RSpectra_0.16-2
#> [178] tidytree_0.4.6 rstatix_0.7.2 later_1.4.1
#> [181] class_7.3-23 rARPACK_0.11-0 tibble_3.2.1
#> [184] clusterProfiler_4.15.1 aplot_0.2.5 memoise_2.0.1
#> [187] AnnotationDbi_1.69.0 ellipse_0.5.0 cluster_2.1.8.1
#> [190] corrplot_0.95 shinyWidgets_0.9.0