Image credit: Nelo Hotsuma

Day 36-37: Concerned DALEX

I was working on a longer post continuing the metaprogramming series, and realised I wasn’t going to get it done this evening. But it’s been a couple of days since I tried out something new, so I resorted to the twitters to find inspiration. As always, the wonderful twitter rstats folks rose to the occasion:

Ooh. What is this DALEX package? I am curious. I hope it’s as kind and lovely as the concerned dalek:

A very brief investigation!

I don’t have a lot of time this evening, but upon checking out the homepage, I discover that DALEX is short for Descriptive mAchine Learning EXplanations. This sounds very lovely, and I think Concerned Dalek would be very concerned to know that DALEX is working hard to help the humans understand what the machine learners are doing. Concerned Dalek would not want us to be worried about these things. All I have time for is a quick run through for two of the examples, but they are nice!

library("breakDown")
library("DALEX")
## Welcome to DALEX (version: 0.2.2).

First we run a linear regression model predicting wine quality as a function of pH, sugar, sulphates and alcohol. Then we imagine seeing a new bottle of wine with known properties, and want to use this regression model (imaginatively named mod) to make a prediction about whether this new wine will be any good:

mod <- lm(quality ~ pH + residual.sugar + sulphates + alcohol, data = wine)

new <- data.frame(
  citric.acid = .35,
  sulphates = .6,
  alcohol = 12.5,
  pH = 3.36,
  residual.sugar = 4.8
)

predict(object = mod, newdata = new)
##        1 
## 6.648226

That’s nice but it isn’t immediately obvious why the model has made that prediction. To help with this, the explain function in DALEX lets me created an explainer object ex, and then use that to tell us something about the prediction

ex <- explain(model = mod, data = wine)
pre <- prediction_breakdown(explainer = ex, observation = new)
print(pre)
plot(pre)

##                            variable contribution  variable_name
## alcohol              alcohol = 12.5   0.70174103        alcohol
## pH                        pH = 3.36   0.05780936             pH
## sulphates           sulphates = 0.6   0.04865885      sulphates
## residual.sugar residual.sugar = 4.8  -0.03789283 residual.sugar
## 1                   final_prognosis   0.77031640               
##                variable_value cummulative sign position label
## alcohol                  12.5   0.7017410    1        1    lm
## pH                       3.36   0.7595504    1        2    lm
## sulphates                 0.6   0.8082092    1        3    lm
## residual.sugar            4.8   0.7703164   -1        4    lm
## 1                               0.7703164    X        5    lm

If I’ve understood this correctly, what the figure is showing me is what happens to the model prediction as I add the predictors in one by one. With just alcohol included, the prediction is just under 6.6. It goes up a little when pH is included, a little more when sulphates are added, and then goes down when residual sugar is considered.

It is kind of nice

For linear regression models, I could probably have done this myself with only slightly more effort, but most machine learning models are harder to interpret and I always feel very wary of trusting models whose behaviour is opaque. It’s helpful, then, that DALEX also lets you do this for things like random forest models:

library("randomForest", quietly = TRUE)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
wine_rf_model4 <- randomForest(quality ~ pH + residual.sugar + sulphates + alcohol, data = wine)
wine_rf_explainer4 <- explain(wine_rf_model4, data = wine, label = "model_rf")
wine_rf_predict4 <- prediction_breakdown(wine_rf_explainer4, observation = new)

print(wine_rf_predict4)
plot(wine_rf_predict4)

##                              variable contribution  variable_name
## 1                         (Intercept)    0.0000000      Intercept
## alcohol              + alcohol = 12.5    0.5948835        alcohol
## sulphates           + sulphates = 0.6    0.2246142      sulphates
## pH                        + pH = 3.36    0.2490506             pH
## residual.sugar + residual.sugar = 4.8    0.2843780 residual.sugar
## citric.acid      + citric.acid = 0.35    0.0000000    citric.acid
## 11                    final_prognosis    1.3529263               
##                variable_value cummulative sign position    label
## 1                           1   0.0000000    0        1 model_rf
## alcohol                  12.5   0.5948835    1        2 model_rf
## sulphates                 0.6   0.8194977    1        3 model_rf
## pH                       3.36   1.0685483    1        4 model_rf
## residual.sugar            4.8   1.3529263    1        5 model_rf
## citric.acid              0.35   1.3529263    0        6 model_rf
## 11                              1.3529263    X        7 model_rf

Yay. I like this. At some stage I’d like to have the time to look into this properly, but for now that will have to do.

I know, kind dalek. I know. ❤️

Avatar
Danielle Navarro
Associate Professor of Cognitive Science

Related