Classification of tumors with penalized logistic regression on microarray data

Paul H. C. Eilers

Department of Medical Statistics, Leiden University Medical Center

Classification with microarray data needs a firm statistical basis. In principle, logistic regression can provide it, modelling probabilities of membership of classes with (inverse logit transforms) of linear combinations of explanatory variables. However, classical logistic regression does not work for microarrays, because generally there will be many more variables (the measured expression levels) then observations. One problem is multicollinearity: the estimating equations become singular and thus have no unique and stable solution. A second problem is over-fitting: a model may fit well to a data set, but perform badly when used to classify new data. Penalized likelihood is a solution to both problems. The values of the regression coefficients are constrained in a similar way as in ridge regression. All variables play an equal role, there is no prior selection.

The dimension of the resulting systems of equations is equal to the number of variables an so may run into the tens of thousands. This is too large for most computers, but it can dramatically be reduced (to the number of observations) with the singular value decomposition of some matrices. The penalty is optimized with AIC (Akaike's Information Criterion), which essentially is a measure of prediction performance.

Penalized logistic regression has been applied successfully to a public data set (the MIT AML-ALL data, as published on the web). Performance was better than that of a method published by Slonim et al. Presently experiments with other data sets are underway. The results will be reported too.