AskoR, A R Package for Easy RNASeq Data Analysis

: For facilitating the process of transcriptomics data, and to guarantee the reproducibility of our analyses, we developed AskoR, which is a R library for performing a suite of statistical analysis and graphical output from gene expression data obtained by sequencing (RNA-Seq). From raw counts, it makes it possible to ﬁlter and normalize the data, to check the consistency of the samples, and to carry out differential expression tests, GO terms enrichments, and clusters of co-expression, with a large number of ﬁgures in the output. AskoR can be downloaded and used in your favorite R environment or directly accessible through a Galaxy portal like the one which is hosted by the BioInformatics Platform for the Agroecosystems Arthropods (BIPAA).


Introduction
The BioInformatics Platform for the Agroecosystems Arthopods (BIPAA) is a bioinformatics platform from the French Research Institute for Agriculture, Food and Environment (INRAE).It is dedicated to support genomics and post-genomics programs developed on insects associated with agroecosystems, and assists cooperation and coordination of multiple communities working on arthopod genomics.The Information System has been created more than 10 years ago to support the International Aphid Genomics Consortium (IAGC), in order to annotate and curate the pea aphid genome [1], and has been continuously improved and extended until the recent achievement of the genomes of phylloxera (Daktulosphaira vitifoliae) [2], various parasitoid wasps (Hyposoter didymator, Campoletis sonorensis [3], Cotesia congregata [4], Aphidius ervi and Lysiphlebus fabarum [5]) or Spodoptera frugiperda [6].Consequently, BIPAA is the home of several public reference databases including AphidBase, LepidoDB and ParWaspDB, each hosting multiple insect genomes.Altogether, 38 genomes are currently available online, and its infrastructure has evolved to support the load of numerous new genomes and to facilitate browsing and navigating.For each species, a collection of web applications allows users to explore reference genomes or transcriptome assemblies and annotations (e.g.genome browser, gene reports), to compare genomics regions (synteny viewer), to analyze these data with multiple tools (e.g.alignment of various sequences, annotation, SNP prediction etc.) through a dedicated Galaxy server [7] or specific web applications (e.g. a blast form), or to correct or add information by curating the genome annotations within Apollo [8]).
RNA-Seq studies are now affordable and widely used in many laboratories.In insect science, this method is still currently use for studying phenomena, such as the molecular responses of whole insects, organs, and tissues to different biotic or abiotic stresses, including exposure to insecticides, infection with microbes, or feeding on different hosts, and improving our knowledge of gene expression changes associated with immunity, detoxification, chemoreception, or reproduction [9][10][11][12][13][14][15].Many statistical methods have been developed, so far, for analyzing RNA-Seq data from raw counts and to identify differentially expressed genes (DEGs) [16][17][18], search for enrichment of functions in genelists [19,20], create Venn diagrams [21], or for searching for clusters of co-expressed genes [22,23].Many of these analysis tools require programming skills in R or paid software subscriptions, such as OmicsBox [24] or CLC Workbenches [25], making it difficult for many researchers to implement reproducible workflows.On the other hand, some reproducible workflows have been designed [26] for managing RNA-Seq data from the mapping on a reference annotation (genome or transcriptome) but these workflows do not include a full set of supplementary analysis for statistically and graphically exploring data by classification or enrichment.
Additionally, DEGs can often be integrated with other data types to improve the robustness of studies, such as QTL analysis, orthology studies, epigenomic landscapes, and proteomics or metabolomic studies.Thereupon, the semantic web technology (consisting in data modeling in structured format such as the Ressource Description Framework (RDF) allowing its querying with the SPARQL language), gives the opportunity to link various information from various sources into graphs of data.Moreover, we are developing AskOmics [27,28] which provides a web interface to upload and integrate heterogeneous data files (GFF, BED, and tabulated formats) into RDF and a visual SPARQL query builder software to allow the experts to compose and execute expressive and semantically-rich queries.
Here, we report AskoR, a R pipeline for the analysis of gene expression data with a simple and reproducible script.It includes several steps (data filtering, normalization, sample validation, differential expression analysis, Gene Ontology enrichment and coexpression) and produces numerous files directly uploadable into an AskOmics instance in order to be linked with other data.

Installation and deployment
AskoR can be directly installed from the git repository (https://github.com/askomics/askoR).An AskoR Docker image, including a preconfigured RStudio instance, is maintained on GitHub (https://github.com/genouest/docker-galaxy-rstudio-askor).This image can be used in a Galaxy server as an interactive tool, allowing to use the preconfigured RStudio-AskoR environment directly from the Galaxy web interface.Instructions on configuring a Galaxy server to make use of this image are available on the GitHub repository.

AskoR implementation, usage and dependencies
AskoR is a R package, it consists in a R library, which has to be loaded in a script.A user guide is available at the web site.As a template or example, we are providing a R script running sequentially each function which can be adapted to the needs of the user.AskoR has many parameters and flags (for example method of normalization, p-values or CPM thresholds) with default values which can be modified before or during the process.A complete list of parameters is available on the wiki.Because AskoR takes advantages of various R packages, it has many dependencies which have to be installed locally on the user R environment, or already available in the docker environment.The main dependencies are edgeR, limma, ggfortify, ggplot2, topGO, UpSetR and coseq.

Differential expression analysis
After a filtering step where the genes with low counts were excluded from further analyses (i.e.keeping genes with CPM values higher than a threshold for a minimal number of samples).AskoR uses the calcNormFactors function of the edgeR library [16], for scaling the data among all libraries and removing the effects of outliers.The default method of this normalization procedure uses a trimmed mean of M-values (TMM) between each pair of samples, but can be changed by the users to Relative Log expression (RLE) or upperquartile.Next, AskoR applies the Cox-Reid profile-adjusted likelihood method of edgeR for estimating the dispersion then a generalized linear model (GLM) for testing the differential expression, the latter is based on a "quasi likelihood F-test" qlf, but can be changed to the likelihood Ratio Test LRT with the glm parameter.

Venn and Upset diagrams
For the automatic production of Venn and Upset diagrams with VennDiagram and UpSetR R packages, we use two parameters indicating which gene-list has to be involved in the graphs.The first parameter compaVD or upset_list (for Venn or Upset diagrams respectively) allows to choose which contrasts will be included in the comparisons (multiple graphs can be produced).The second parameter VD or upset_type allows to select for each contrast the subset of DEGs to compare (up, down or both).

GO-term Enrichment
By default, for each gene list, the Gene-Ontology enrichment is evaluated using a Fisher test adjusted by the Benjamini-Hochberg (BH) method with the weight01 algorithm against the complete list of GO assigned genes given by the user.All statistical tests (fisher, ks, t, globaltest, sum or ks.ties) and/or algorithms (classic, elim, weight, weight01, lea or parentchild) supported by TopGO [29] can be selected with the parameters GO_stats and GO_algo respectively.

Clustering
The clustering of expression profiles is handled with the coseq R package [22].However, the methods can be chosen with the coseq_model parameter as well as the transformation with coseq_transformation and the range of cluster numbers to be evaluated for identifying the best K value coseq_ClustersNb.

Graphics
The graphs allowing the representation of clusters and enrichments are produced with the R package ggplot2, ComplexHeatmap, and circlize allowing the visualization of the intersections of DEG lists between contrasts produced using the VennDiagram and UpSetR R packages.

Production of tabular files for AskOmics
In a so-called "AskoTables" directory, AskoR produces files which could be imported directly into AskOmics.Some are directly derived from the input file, describing 3 entities : 1) Condition : a group of samples with corresponding characteristics (tissue, stress, development stage, etc...); 2) Context : a group of conditions which are compared within a contrast; 3) Contrast : the comparison of 2 contexts.
In addition, AskoR provides for each contrast a tabular file, with all the tested genes in lines with the AskOmics compliant columns : Test_id (a unique id for a test), mea-sured_in@Contrast (the name of the contrast), is@gene (the name of the gene), FC (foldchange) and logFC, PValue and FDR (p-value and adjusted p-value of the test), Expression and Significance.Furthermore AskoR supplies a summarized table including all the genes (and annotation) and their significativity at each contrast.

Results
AskoR requires only a few tabular files including standard raw counts of reads for each gene, descriptions of samples, including biological and technical replicates, and a list of contrasts between the different treatments and conditions.The structure of the table describing the contrasts is rather simple as it includes a line for each condition and each contrast reported to a column containing "+" or "-" for the condition to be compared and 0 for the others.Additionally, the users can provide a gene ontology assignation file for performing the GO enrichment step, and a list of complementary annotations which will be transferred to the final tabular files.
To expedite analysis and improve false discovery rate, genes with low CPM values should be removed prior to performing the differential expression analysis, which would be unlikely to be detected as differentially expressed anyhow.The user can adjust several parameters, such as threshold_cpm and replicate_cpm, and visualize the results in density graphs to determine whether the low expressed genes have been successfully removed.
From the matrix of CPM, AskoR produces Multi-dimensional Scaling (MDS) plot, displaying the coordinates of the samples on 3 axes, a heatmap of the correlation between the samples (with dendograms) and with their respective conditions encoded by a color, and a correlogram (Fig. 1A).This result provides an overview of the samples and allows for the identification of outliers or inconsistencies in biological or technical replicates.
The normalization and differential expression analyses are performed with the popular edgeR package [16], including the functions calcNormFactors for the normalization, estimateGLMCommonDisp, estimateGLMTrendedDisp, estimateGLMTrendedDisp or estimateDisp for the estimation of the dispersion and glmFit and glmLRT or glmQLFit, and glmQLFTest for the DE tests with the Generalized Linear Models (GLM).With a set of default parameters which can be changed by the users, such as the normalization and dispersion methods (normal_method, glm_disp), GLM (glm) or multi-test correction (p_adj_method), the AskoR pipeline performs the adequate functions.
For each contrast, a file is created to report the results of the tests in a format readable by AskOmics to facilitate the integration of the results.It also compiles and summarizes all the results of the tests for a batch of contrasts into a single tabular file.It generates as well numerous graphical outputs such as volcano plots, mean-difference plots and heatmaps of top list of the DEGs.
Venn and Upset diagrams allow users to identify common DEGs shared between multiple contrasts.While Venn diagrams allow to compare up to 4 lists, Upset allows users to make comparisons among more constrats (Fig. 1B).However, when making many contrasts, a complete graph including all gene-lists may be unreadable or meaningless, then with only two parameters (VD and compaVD or upset_list and upset_type) the user can select precisely the lists of genes to be displayed in the graphs.If gene ontology terms are provided, AskoR can perform enrichment analysis in lists of DEGs annotated and produce a table (Table 1) with a statistical test from TopGO R package [29] and dedicated figure (Fig. 2).
It is also often useful to group genes that show a similar expression in several conditions (expression profile), to identify co-regulated genes and to characterize genes with no function having similar expression to candidate genes.AskoR performs a clustering with the Coseq R package [22] that uses two classification algorithms (kmeans, and gaussian mixture) with respective transformations, tests for the best number of clusters (K), and produces statistics and graphics helping to check the quality and robustness of the chosen model.Additionally, AskoR outputs several graphs (Fig. 3) to display the expression profiles for each contrast, and search for GO enrichment in each cluster.
Table 1.Gene-Ontology terms for a contrast, each GO term from the 3 categories (molecular function, biological process or cellular component) is reported in the table.The Annotated column indicates the number of genes assigned to the term in the complete gene set.The Significant column shows the number of genes assigned to the term in the tested list while the Expected column gives the expected value, the Ratio value corresponds to the Significant column divided by Expected column, the statisticTest displays the adjusted p-value of the test, and GO_cat is the GO category of the term ("CC" for cellular component, "BP" for biological process, and "MF" for molecular function).

Discussion
AskoR is highly customizable, and includes many parameters with default values modifiable according to the needs of the analysis.Thus, it is straightforward to run an analysis with a single file and a set of data.The results of the analyses are then stored in multiple files under a named directory.Then, any set of results obtained with a particular set of parameters can be stored, compared and reproduced.
AskoR is working as a standalone library and no web interface has been developed yet (with R Shiny for instance).Nevertheless, it can be run as a single script and easily modified directly on any laptop or server or integrate directly in workflows for an improved reproducibility.Furthermore, the docker image of AskoR can be integrated directly as an interactive environment in a Galaxy instance.As a result, with the help of the online documentation, it is straighforward to explore transcriptomics data and produce publishable figures.Consequently, AskoR has already been used for various analysis [12,[30][31][32].
Additionally, AskoR prepares output files following the AskOmics requirements.Within this tool, complex and expressive queries can be generated via a graphical interface in order to extract the most interesting genes by combining several lists of genes by a logical operator (e.g.DEGs from one or more particular contrast or cluster).For instance, it is possible to extract genes over-expressed in a condition at some selected time-points but or under-expressed (or not differentially expressed) at other time points.Furthermore, DEGs can be supplemented with other data such as genome locations, assignation to metabolic networks, miRNA targets, or epigenomics landscapes to promote selection of candidate genes.
AskoR is open-source available online at a git repository, where anyone can contribute, push an issue or ask for a specific request.For example, we are adding functions to create graphical representation combining the expression levels and the of a set of genes list defined by a user.

Figure 1 .
Figure 1. A. Correlogram plot generated by AskoR to describe sample correlations.Each cell represents a pairwise comparison and each correlation coefficient is represented by an ellipse whose 'diameter', direction, and color depict the accordance for that pair of samples.Highly correlated samples are depicted as thin blue ellipses, while poorly correlated samples are depicted as red ellipses with wide diameters.B. Intersections of DEGs lists between contrasts.DEGs lists in each contrast and their size are shown in the horizontal histograms.The lines connected by dots represent the intersection between these lists.The vertical histograms represent the number of DEGs in each intersection.Colors (green/red) represents UP-and DOWN-regulated DEGs.

Figure 2 .
Figure 2. Plot of GO enrichment for one contrast.Each line displays a significant term grouped by GO category, the position of the dot refers to the significance of the test and the size of the circle corresponds to the number of observed genes in the tested list which are assigned to that terms.

Figure 3 .
Figure 3. Gene clustering based on expression profiles.A. Heatmap of scaled expression.Each line represents a condition and each column a gene.The color indicates the scaled expression.The genes are grouped in clusters.The condition are ordered by the user or grouped by correlation (with a dendogram).B. Boxplots of scaled expression by clusters