All Leave-One-Out Models (ALOOM)

Author: Damjan Krstajic

Available as R CRAN aloom(1) package.


Instead of creating one binary classifier from a training set containing N samples, ALOOM(2)(3)(4) creates N binary classifiers in the exactly way, i.e. using the same hyper-parameters, but on samples of size (N-1). The model trained on all N samples is here referred as the original model.

For a single test sample ALOOM produces N predicted probabilities and thus one may create an ALOOM individual prediction interval(4) (min ALOOM probabilities, max ALOOM probabilities) for the test sample.

ALOOM provides a solution for assessing the reliability(5) of a single binary prediction. As shown below, the widths of ALOOM individual prediction intervals vary between test samples. Therefore, the width of the ALOOM individual prediction interval may be used as a measure of the reliability of the original model's single predicted probability.

ALOOM also provides a solution for assessing the decideability(5) for the single binary prediction. If All Leave-One-Out Models do not all agree on the predicted category for the test sample, then we would suggest returning NotAvailable.

ALOOM is a non-parametric approach where binary model and data define NotAvailable predictions.

In our experience ALOOM predictions which are available, i.e. not NotAvailable, have on average higher accuracy. Otherwise, there would be no point in using the ALOOM approach.

An initial simulation study(3) has shown that ALOOM may be very useful in active learning. This means that if one has a binary model and wants to update the training set with new samples, then we would suggest to update it with samples that ALOOM currently predicts as NotAvailable.

ALOOM is not meaningful for model building algorithms which are affected by the value of a seed number. It is not suitable for deep learning models. However, for random forests it works fine with large-ish number of trees.

ALOOM is a simple idea, but its application is very computer-intensive and thus not really suitable for personal computers.

Example of using aloom and some interesting results

We use publicly available mutagenicity dataset from Kazius et al. (2005)(6). It contains 4335 compounds, 2400 categorised as “mutagen” and the remaining 1935 compounds as “nonmutagen”. The dataset is available from the QSARdata(7) R package and each compound comes with 1579 descriptors. Half of the dataset is used for training and the remaining half for testing.

1. Create train and test datasets


x           <- as.matrix(Mutagen_Dragon)
rownames(x) <- rownames(Mutagen_Dragon)
colnames(x) <- colnames(Mutagen_Dragon)
y           <- Mutagen_Outcome

lvFolds <- createFolds(y,k=2)

train.x <- x[-lvFolds[[1]],]
train.y <- y[-lvFolds[[1]]]
test.x  <- x[lvFolds[[1]],]
test.y  <- y[lvFolds[[1]]]

The train dataset consists of 2168 samples, while test has 2167. Here they are available as csv files: train_x.csv, train_y.csv, test_x.csv, test_y.csv.

2. Create aloom object with randomForest(9)

NOTE 1: On machine with 48 CPUs AMD Opteron 6168 with 15 Gb RAM when using all 48 CPUs this takes 23 hours
NOTE 2: On machine with 24 CPUs Intel Xeon 6240@2.60GHz with 4 Gb RAM when using all 24 CPUs this takes 8 hours


ntree     <- 1000
num.cores <- detectCores()

fit <- aloom(train.x, train.y, test.x, method="rf",list(ntree=ntree),mc.cores=num.cores)

2. Create aloom object with glmnet(10)

NOTE 1: On machine with 48 CPUs AMD Opteron 6168 with 15 Gb RAM when using all 48 CPUs this takes 4 hours
NOTE 2: On machine with 24 CPUs Intel Xeon 6240@2.60GHz with 4 Gb RAM when using all 24 CPUs this takes 2 hours

Prior to calling aloom() we execute cv.glmnet() to find optimal lambda.

library(glmnet)          <- cv.glmnet(train.x,train.y,family="binomial",type.measure="auc")
selected.lambda <-$lambda.1se
lambda          <-$lambda
model.params    <- list(lambda=lambda, alpha=1, selected.lambda=selected.lambda)

num.cores <- detectCores()

fit <- aloom(train.x, train.y, test.x, method="glmnet",model.params,mc.cores=num.cores)

3. Examine aloom object

All Leave-One-Out Models, as well as the original model, are created during the execution of aloom(). Their predictions of test samples are the return results.
An aloom object is a list containing:

predicted.y      <- fit$predicted.y
predicted.prob.y <- fit$predicted.prob.y
aloom.probs      <- fit$aloom.probs

Calculate original's misclassification error, ALOOM's proportion of NA and ALOOM's misclassification error.

original.misclassification <- sum(predicted.y!=test.y)/length(test.y)       <- function(x){if ((min(x) < 0.5) & (max(x) > 0.5)) TRUE else FALSE}  <- apply(aloom.probs,1, <- sum(

aloom.misclassification <-  sum(predicted.y[!]!=test.y[!])/length(test.y[!])
Original misclassification % NotAvailable ALOOM misclassification
glmnet 0.196 7.24 0.17
randomForest 0.182 15.32% 0.152

4. Calculate ALOOM individual prediction intervals, examine their width and show the sample with the maximum width

min.aloom <- apply(aloom.probs,1,min)
max.aloom <- apply(aloom.probs,1,max)

Calculate width of every ALOOM individual prediction interval and examine its distrubition.

width <- max.aloom - min.aloom

ALOOM individual prediction interval width stats are:

Min Q1 Median Mean Q3 Max
glmnet 0 0.019 0.034 0.053 0.062 0.766
randomForest 0.009 0.085 0.104 0.109 0.119 0.571
id.with.max.width <- rownames(test.x)[which.max(width)]
max.width.range   <- c(min.aloom[which.max(width)],max.aloom[which.max(width)])
Compound ID Min ALOOM interval Max ALOOM interval
glmnet '3224' 0.135 0.901
randomForest '782' 0.261 0.832


1. Damjan Krstajic (2023). aloom: All Leave-One-Out Models. R package version 0.1.1
2. Krstajic, D., Buturovic, L., Thomas, S., & Leahy, D. E. (2017). Binary classification models with" Uncertain" predictions. arXiv preprint arXiv:1711.09677.
3. Krstajic, D. (2020). Non-applicability Domain. The Benefits of Defining “I Don't Know” in Artificial Intelligence. Artificial Intelligence in Drug Discovery, 75, 102.
4. Krstajic, D. (2021). The Costs and Potential Benefits of Introducing the “I Don’t Know” Answer in Binary Classification Settings. Preprints.
5. Hanser, T., Barber, C., Marchaland, J. F., & Werner, S. (2016). Applicability domain: towards a more formal definition. SAR and QSAR in Environmental Research, 27(11), 865-881.
6. Kazius, J., McGuire, R., & Bursi, R. (2005). Derivation and validation of toxicophores for mutagenicity prediction. Journal of medicinal chemistry, 48(1), 312-320.
7. Max Kuhn (2013). QSARdata: Quantitative Structure Activity Relationship (QSAR) Data Sets. R package version 1.3.
8. Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1–26.
9. A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.
10. Friedman J, Tibshirani R, Hastie T (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software_, *33*(1), 1-22. doi: 10.18637/jss.v033.i01 (URL: