A much-delayed addendum while I’ve attempted to get the following paper out
which deals with ensuring the stability of model distillation using trees.
This actually touches on both the global approach to understanding machine learning that I have mostly explored, and the local approach to explaining a particular prediction (see http://blogs.cornell.edu/modelmeanings/2014/10/17/local-interpretation/).
I’ve previously discussed the issue of uncertainty in machine learning in terms of global interpretation — are you trying to understand how a particular function arrives at a prediction, or are you trying to say something about the underlying causes of that prediction?
Here I’ll ask the same question about local explanations. This is particularly relevant in the light of the European General Data Protection Regulation (GDPR) which can be read to impose a “right to an explanation” for decisions that are made about individuals. Exactly what such a right entails (and whether it really is implied in the GDPR) is not currently well defined; interpretations vary from a requirement to provide a formula, to identifying actions that could be taken to change a decision, to requiring something closer to causal reasoning identifying why that decision was reasonable.
To be concrete, let’s take a bank that develops a tool to automatically decide whether or not to give a mortgage to an applicant. If they don’t approve the loan, they can be asked “Why not?” with the implication that “because my neural network said so” would not really cut it.
In fact, this is not really so an idea: I recall hearing someone from Fair Isaac discussing just this problem in the early 2000’s and their solution was pretty natural: use a decision tree. I discussed the basic ideas of decision trees earlier in my blog where we see that a tree has a pretty easy to follow glyph. There is also a natural explanation for why you didn’t get your loan — the decision for the last node on the tree. ****
I also noted that trees don’t perform very well. Fair Isaac’s solution was to first train a neural network (if I remember correctly) and then use this to generate a huge amount of data to create a tree that mimics the neural network. This idea has since been given the term “model distillation” — first generate a complex model (a teacher), then produce an interpretable model (student) that mimics the predictions of the original model. I repeated Fair Isaac’s ideas for an application that involved shortening psyciatric questionnaires (you can traverse the tree asking questions only as you need them) a few years ago. I couldn’t find a citation to the Fair Isaac work, but it does seem that the idea goes back to at least 1995.
The paper I shamelessly plugged above asks “how much data do we need to stabilize the tree structure”? If we characterize our procedure as
1. Train some ML model F(X) to predict Y from X.
2. Generate a huge set of new example X’s and give each example the response F(X)
3. Use this large new data set to train a Tree that mimics F.
we (well, mostly my students Yichen and Zhengze Zhou) asked “How many new X’s do you need so that repeating Steps 2 and 3 get you the same tree?” The answer turns out to be a couple of orders of magnitude more that seems to be regularly used. And we developed some tools to assess the stability of this type of approximation tree and to look at how deep that tree should be (more on that later).
Now a reasonable question might be “why should I care?” It actually turns out that the huge amounts of data we ended up using weren’t necessary to produce a tree that gave accurate predictions — a tree could split on X1 first and then X2, or vice versa and end up with fairly similar predictions. We needed this much data solely to ensure that the structure of the tree remained stable.
So again, why should I care? This is an iteration on uncertainty in ML. If I am just seeking to explain how a prediction was arrived at, I could obtain the tree and actually use it to make predictions (as opposed to explaining predictions). I can then readily claim that I am providing a description of how the tree I happen to use arrives at its predictions. This is ok, even if the structure of my tree is chosen by random chance.
However, if I don’t replace my neural network or random forest with the tree, and just use it to explain a different model, we might start getting dubious. If you are told you didn’t receive a mortgage because you are over the age of 37, but there is a parallel universe in which exactly the same decision is explained by your income being below $50,000, you might start questioning the worth of that explanation.
Trees, of course, are only one example of this. LIME — the currently most popular source of local explanations based on approximating the gradient of F(X), also uses a random sample whose stability we might query. I have to admit to not yet having gotten around to looking at that.
The point of this is that you can have gotten your original F(X) any way you like; we are only look at Monte Carlo error — the variability due to the random samples you generate in order to arrive at your explanation. There are already plenty of debates about what a “right to an explanation” entails, but I’ve yet to hear of “statistical stability” being added as a criterion. Perhaps it should be — or at least be made explicit that various others (such as some form of causal explanation) assume such stability.
Of course, we can ask about stability more generally — does our explanation have to be non-random, or is it ok to pick up on random features of the data? Ie, does the fact that people over 180cm high and October 19 happen to have an unusually high default rate in our dat set allow us to use that as an explanation even if another data set would be very unlikely to find that this is replicable? Alternatively, what standard of scientific replicability of explanations is required in the GDPR?
We couldn’t do much in our paper on that. We can assess whether the tree is capturing signal that is distinguishable from noise when it makes a split. We could potentially ask whether the same split would be chosen, not just with another randomly generated sample from F(X), but if we re-learned F(X) using a new data set. But changes in one split cascade down the rest of the tree and it’s then very hard to represent variability between different trees, or to know what to do with it, even if you could.
So this is imperfect: there is a form of explanation where “this is what we do, you don’t get to query why” means you don’t have to worry about uncertainty. But I think that disappears as soon as you use one model to predict and a second as an explanation. It seems that full on scientific replicability might be too strong a requirement– that would probably render all of ML useless — but what should the standard be?
There’s not much work on assessing such stability. I don’t know how stable LIME is; and we were only able to assess whether our trees were simply modeling random variation when we used random forests to produce F(X) when we know something about its stability. But the question is barely being asked and I think will become practically relevant fairly quickly.
**** Of course, changing other covariates could affect a decision higher up in the tree, which also changes the decision about your mortgage. So this is fairly imperfect in itself.