Author: Damjan Krstajic

Available as R CRAN aloom^{(1)} package.

Instead of creating one binary classifier from a training set containing N samples,
ALOOM^{(2)}^{(3)}^{(4)} creates N binary classifiers in the exactly way, i.e. using the same hyper-parameters,
but on samples of size (N-1). The model trained on all N samples is here referred as the *original* model.

For a single test sample ALOOM produces N predicted probabilities and thus one may create
an *ALOOM individual prediction interval*^{(4)} (min ALOOM probabilities, max ALOOM probabilities) for the test sample.

ALOOM provides a solution for assessing the reliability^{(5)} of a single binary prediction. As shown below,
the widths of ALOOM individual prediction intervals vary between test samples.
Therefore, the width of the ALOOM individual prediction interval may be used as a measure of
the reliability of the original model's single predicted probability.

ALOOM also provides a solution for assessing the decideability^{(5)} for the single binary prediction.
If All Leave-One-Out Models do not *all* agree on the predicted category for the test sample, then we would suggest
returning NotAvailable.

ALOOM is a non-parametric approach where binary model and data define NotAvailable predictions.

In our experience ALOOM predictions which are available, i.e. not NotAvailable, have on average higher accuracy. Otherwise, there would be no point in using the ALOOM approach.

An initial simulation study^{(3)} has shown that ALOOM may be very useful in
active learning.
This means that if one has a binary model and wants to update the training set with new samples, then we would suggest to
update it with samples that ALOOM currently predicts as NotAvailable.

ALOOM is not meaningful for model building algorithms which are affected by the value of a seed number. It is not suitable for deep learning models. However, for random forests it works fine with large-ish number of trees.

ALOOM is a simple idea, but its application is very computer-intensive and thus not really suitable for personal computers.

We use publicly available mutagenicity dataset from Kazius et al. (2005)^{(6)}.
It contains 4335 compounds, 2400 categorised as “mutagen” and the remaining 1935 compounds as “nonmutagen”.
The dataset is available from the QSARdata^{(7)} R package
and each compound comes with 1579 descriptors. Half of the dataset is used for training and the remaining half for testing.

```
```library(QSARdata)
library(caret)
data(Mutagen)
x <- as.matrix(Mutagen_Dragon)
rownames(x) <- rownames(Mutagen_Dragon)
colnames(x) <- colnames(Mutagen_Dragon)
y <- Mutagen_Outcome
set.seed(1)
lvFolds <- createFolds(y,k=2)
train.x <- x[-lvFolds[[1]],]
train.y <- y[-lvFolds[[1]]]
test.x <- x[lvFolds[[1]],]
test.y <- y[lvFolds[[1]]]

The train dataset consists of 2168 samples, while test has 2167. Here they are available as csv files: train_x.csv, train_y.csv, test_x.csv, test_y.csv.

NOTE 1: On machine with 48 CPUs AMD Opteron 6168 with 15 Gb RAM when using all
48 CPUs this takes **23 hours**

NOTE 2: On machine with 24 CPUs Intel Xeon 6240@2.60GHz with 4 Gb RAM when using all
24 CPUs this takes **8 hours**

```
```library(aloom)
library(parallel)
library(randomForest)
ntree <- 1000
num.cores <- detectCores()
fit <- aloom(train.x, train.y, test.x, method="rf",list(ntree=ntree),mc.cores=num.cores)

NOTE 1: On machine with 48 CPUs AMD Opteron 6168 with 15 Gb RAM when using all
48 CPUs this takes **4 hours**

NOTE 2: On machine with 24 CPUs Intel Xeon 6240@2.60GHz with 4 Gb RAM when using all
24 CPUs this takes **2 hours**

Prior to calling aloom() we execute cv.glmnet() to find optimal lambda.

```
```library(aloom)
library(parallel)
library(glmnet)
cv.fit <- cv.glmnet(train.x,train.y,family="binomial",type.measure="auc")
selected.lambda <- cv.fit$lambda.1se
lambda <- cv.fit$lambda
model.params <- list(lambda=lambda, alpha=1, selected.lambda=selected.lambda)
num.cores <- detectCores()
fit <- aloom(train.x, train.y, test.x, method="glmnet",model.params,mc.cores=num.cores)

All Leave-One-Out Models, as well as the original model, are created during the execution of aloom().
Their predictions of test samples are the return results.

An aloom object is a list containing:

- predicted.y - predicted categories of test samples produced by the original model
- predicted.prob.y - predicted probabilities of test samples produced by the original model
- aloom.probs - ALOOM probabilities of test samples. It is a matrix with rownames equal to names of test samples and colnames equal to names of training samples.

```
```predicted.y <- fit$predicted.y
predicted.prob.y <- fit$predicted.prob.y
aloom.probs <- fit$aloom.probs

Calculate original's misclassification error, ALOOM's proportion of NA and ALOOM's misclassification error.

```
```original.misclassification <- sum(predicted.y!=test.y)/length(test.y)
find.na <- function(x){if ((min(x) < 0.5) & (max(x) > 0.5)) TRUE else FALSE}
predicted.na <- apply(aloom.probs,1,find.na)
aloom.proportion.na <- sum(predicted.na)/length(predicted.na)
aloom.misclassification <- sum(predicted.y[!predicted.na]!=test.y[!predicted.na])/length(test.y[!predicted.na])

Original misclassification | % NotAvailable | ALOOM misclassification | |
---|---|---|---|

glmnet | 0.196 | 7.24 | 0.17 |

randomForest | 0.182 | 15.32% | 0.152 |

```
```min.aloom <- apply(aloom.probs,1,min)
max.aloom <- apply(aloom.probs,1,max)

Calculate width of every ALOOM individual prediction interval and examine its distrubition.

```
```width <- max.aloom - min.aloom
summary(width)

ALOOM individual prediction interval width stats are:

Min | Q1 | Median | Mean | Q3 | Max | |
---|---|---|---|---|---|---|

glmnet | 0 | 0.019 | 0.034 | 0.053 | 0.062 | 0.766 |

randomForest | 0.009 | 0.085 | 0.104 | 0.109 | 0.119 | 0.571 |

```
```id.with.max.width <- rownames(test.x)[which.max(width)]
max.width.range <- c(min.aloom[which.max(width)],max.aloom[which.max(width)])

Compound ID | Min ALOOM interval | Max ALOOM interval | |
---|---|---|---|

glmnet | '3224' | 0.135 | 0.901 |

randomForest | '782' | 0.261 | 0.832 |

1. Damjan Krstajic (2023). aloom: All Leave-One-Out Models. R package version 0.1.1 https://CRAN.R-project.org/package=aloom

2. Krstajic, D., Buturovic, L., Thomas, S., & Leahy, D. E. (2017). Binary classification models with" Uncertain" predictions. arXiv preprint arXiv:1711.09677. https://doi.org/10.48550/arXiv.1711.09677

3. Krstajic, D. (2020). Non-applicability Domain. The Benefits of Defining “I Don't Know” in Artificial Intelligence. Artificial Intelligence in Drug Discovery, 75, 102. https://doi.org/10.1039/9781788016841-00102

4. Krstajic, D. (2021). The Costs and Potential Benefits of Introducing the “I Don’t Know” Answer in Binary Classification Settings. Preprints. https://doi.org/10.20944/preprints202108.0521.v1

5. Hanser, T., Barber, C., Marchaland, J. F., & Werner, S. (2016). Applicability domain: towards a more formal definition. SAR and QSAR in Environmental Research, 27(11), 865-881.

6. Kazius, J., McGuire, R., & Bursi, R. (2005). Derivation and validation of toxicophores for mutagenicity prediction. Journal of medicinal chemistry, 48(1), 312-320.

7. Max Kuhn (2013). QSARdata: Quantitative Structure Activity Relationship (QSAR) Data Sets. R package version 1.3. https://CRAN.R-project.org/package=QSARdata

8. Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05

9. A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

10. Friedman J, Tibshirani R, Hastie T (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software_, *33*(1), 1-22. doi: 10.18637/jss.v033.i01 (URL: https://doi.org/10.18637/jss.v033.i01).