Using the RRegrs R package for Automating Predictive Modelling

Cheminformatics and bioinformatics are extensively using predictive modelling and exhibit a need for standardization of methodologies such as data splitting, cross-validation methods, best model criteria and Y-randomization. RRegrs is a new R package, available at https://www.github.com/enanomapper/RRegrs (0.05 release), which suggests an integrated framework to assist model selection and speed up the process of predictive model development. The tool proposes a fully validated scheme by employing repeated 10-fold and leave-one-out cross-validation for ten linear and non-linear regression methods. Standardized reports are produced to compare the output of modelling algorithms and assess cross-validation results for selected models. Here, we demonstrate RRegrs capabilities in terms of performance using five well-established data sets.

A single RRegrs function call is needed to run the entire workflow and obtain the produced validated models in a reproducible format.
RRegrs suggests an easy way to explore the models' search space of linear and non-linear models with special parameters specifications and cross-validation (CV) schemes.Furthermore, model outputs are easily accessible and readable, organized by methods, centralized and averaged by multiple reproducible data set splits.Summary files are also produced helping the user to easily access all methodologies results, which can then be prioritized based on various statistics.A main feature of the package is its exhaustive validation scheme which introduces multiple random data splits.For each algorithm and data split, the model is produced based on training and validation sets, however, the test set is used to select the final best model.Parallel processing is enabled for accelerating the process.

Results and Discussion
Although the primary applications of RRegrs are aimed at finding Quantitative Structure -Activity Relationships (QSAR) models [3] under the settings of cheminformatics and nanotoxicology, here we demonstrate its efficiency for five standard data sets from UC Irvine Machine Learning Repository [4], using RRegrs current release 0.05.The five data sets considered, which are derived from diverse disciplines such as environmental economics and medical research, are the Housing [5], Computer Hardware, Wine Quality [6], Automobile [7] and Parkinsons Telemonitoring [8] data sets.
In Table 1 we present two statistic values for the five data sets, namely the R

Materials and Methods
In order to run RRegrs with full functionality a call to the RRegrs() function is required.All parameters have default values; a detailed list of parameters and functions' descriptions is given in the RRegrs package tutorial available online at https://github.com/enanomapper/RRegrs/blob/master/RRegrs-package-tutorial.pdf.Within the default values a default location for the output files is set, execution of all modelling steps (removal of NA, and near zero variance features, and of correlated features), normalization of the data set, ten splits, ten Y-randomization steps, and running of all ten regression methods.RRegrs function calls can be integrated into complex desktop and web tools for QSAR modelling.

Conclusions
RRegrs integrates results of individual models and decides on the best model given the data set and the user specified parameters.We have demonstrated its performance with five well-established data sets and showed that good performance results are produced in all cases.Its efficiency suggests that RRegrs can be used as a reliable fully-validated and automated predictive modelling framework, and a baseline for comparable results across various studies.

Table 1 .
Averaged R 2 Test and RMSETest values for the five data sets.