Please login first
Categorisation of continuous variables in a logistic regression model using the R package CatPredi
* 1, 2 , 3 , 1, 2, 4
1  Departamento de Matemática Aplicada, Estadística e Investigación Operativa, Universidad del País Vasco UPV/EHU, Leioa, Spain.
2  Red de Investigación en Servicios de Salud en Enfermedades Crónicas (REDISSEC), Galdakao, Spain.
3  Departamento de Estadística e Investigación Operativa. Universidade de Vigo, Vigo, Spain.
4  BCAM – Basque Center for Applied Mathematics, Bilbao, Spain.


Prediction models are gaining importance in many areas such as medicine, meteorology, finance, toxicology, etc. In this context, a common distribution for the response variable is the binomial distribution and hence the logistic regression model is a commonly used regression modelling approach. Although it is not recommended from a statistical points of view due to loss of information and power, the categorisation of continuous variables is a common practice in the development of prediction models. However, there are no unified criteria for the selection of the cut points in the categorisation process. In order to provide valid cut points whenever a categorisation is going to be performed, we have developed a valid methodology to categorise continuous variables in a logistic regression model based on the maximisation of the AUC. This methodology has been implemented in an R package called CatPredi . This is a package of R functions that allows the user to categorise a continuous predictor variable in a univariate or multiple logistic regression model. It provides the optimal location of cut points for a chosen number of cut points, fits the prediction model with the categorised predictor variable and returns the estimated and bias-corrected discriminative ability index for this model. Additionally, it allows a comparison of two categorisation proposals for different number of cut points and the selection of the optimal number of cut points.

Keywords: categorisation, R package, prediction model
Comments on this paper
Humbert G. Díaz
Software availability and other doubts!
Dear colleagues

Thank you by this excellent communication; which I guess is also suitable for the software section [f];
anyhow I have a couple of doubts.

Is categorization useful when I am using data from different sources with different errors?

Is your procedure valid, or can be extended to, classification algorithms like Linear Discriminant Analysis /
General Discriminant Analysis (LDA / GDA), Support Vector Machine (SVM), or Classification Trees?

Thank you by posting the link for free download of the software CatPredi

PD: Thank you by your support to mol2net!!!
You can participate on the conference making scientific questions/comments to other papers as well.
link to other works:

Irantzu Barrio
Dear Humberto,

Thank you very much for your comments about our contribution.

Regarding to your questions, first of all, in my opinion categorization can be useful when you use data from different sources as long as you have the same variable which you want to categorize in all the sources. Of course, if the way that variable has been collected is very different from one source to another that may have an impact of the selection of the cut points. But on the other hand, the categorization of a continuos variable could facilitate reconciliation of information from different sources.

On the other hand, this methodology is based on the maximisation of the AUC, so as long as we can estimate the AUC I think we could extend this methodology to other methods too. However, we have focused on regression modelling approaches so far and hence it is not available for LDA or SVM methods. Nevertheless, it is a very interesting proposal which we will consider for further research.

Thank you very much!