So far the types of interpretation that I have discussed in machine learning models have been global in scope. That is, we might ask, “Does x1 make a difference anywhere?”. This is very much a scientific viewpoint — we want to establish some universal laws about what factors make a difference in outcome.
There is, however, an alternative need for explanation that is much more local in scope — what makes a difference close to the following point. This was brought home to me at the start of this year when I spent a couple of months visiting IBM’s Healthcare Analytics working group (a wonderful, smart set of people who very generously hosted me). Without giving away too many industrial secrets, the group is developing systems based on electronic medical records to support medical decision making: we might want to forecast whether a patient has a high risk of developing some particular condition, for example. There are great machine learning tools for doing this and the folks at IBM have remarkably good performance at this.
The question, from a doctor’s point of view, is now “What do I do with this information?” It’s not much good knowing that someone will develop diabetes *** in six months without some idea about what you can do to help. We might therefore be interested in understanding what changes could be made to the patients covariates to reduce their risk. This isn’t a global interpretation — we’re not going to be able to change their weight or blood sugar dramatically (at least not in short order) but we might be able to nudge them in a positive direction. The question is then “What should we nudge, and what should we prioritize nudging?”
To do this, you don’t need a global understanding of the way the model functions, you only need to know how the prediction changes locally around a patient. One way to do this is to experiment with changing a covariate a little (“How much does this prediction change if we reduce weight by 10lb?”). 10lb is rather arbitrary, so we might be interested in something like a derivative: how fast does risk increase with weight at these covariates?
Unfortunately, not all ML functions are differentiable, or are easy to differentiate, but we could imagine some approximation to this. One way to explicitly look at local structure is by local models. These are parametric models — of the sort I have just been decrying — but estimated for each point based only on the points close by.
The easiest way to think about this is with a very simple example: the Nadaraya-Watson smoother in 1 dimension. I’ve sketched an example of this below, but the idea is simple — we take a weighted average of the points around the place in which we are interested, where the weights decrease the further the observations are from our prediction point.
This, by itself, does not generalize easily, but we can instead think of it as minimizing squared error, but with weights
μ(x) = argmin∑ K(x,Xi)(Yi – μ)^2
(to unpack this notation, argmin means the value of mu at which the expression is minimized. K is a “kernel” — think of it as a Gaussian bump centered at x). This can, in fact, be very nicely generalized to something like
β(x) = argmin ∑ K(x,Xi)(Yi – g(Xi,β))^2
where g(Xi,β) is a linear regression (with coefficients beta), for example. There is nothing in here about x being univariate anymore. Now we have a linear regression at x, but it will change over x. The nice thing about this is that we can apply all the standard interpretation for linear regression, but this interpretation is specific to x. So we can look at β(x) to see how quickly risk changes with weight, for example, at this value of the covariates. We can also explore structure — interaction terms etc — that can give us a more detailed view of the response, but at a local scale.
The crucial aspect of this is that it is the local scale that matters when one is asking hypotheticals about specific prediction points — “What would happen if we just changed this, a little bit?” And in fact, this potential-intervention model is really what is relevant.
A last question: can we get back from here to a global interpretation? Well in some sense yes. If we think of ourselves as asking “Is the derivative of F(x) with respect to x1 zero everywhere?” we can express all the results about additive structure discussed two posts ago in terms of some derivative being zero. For example we could see if we could reduce a three dimensional function to two two-dimensional ones:
F(x1,x2,x3) = f1,3(x1,x3) + f2,3(x2,x3)
(ie, there is no interaction between x1 and x>sub>2) either by looking at f1,2(x1,x2) as well as f123, or we could examine if
d2 f/ dx1dx2 = 0
Alternatively, we could look at the size of these derivatives, since none are likely to be exactly zero.
In many ways this is appealing — it appears to remove the problems we had about where we integrate and should be easier to asses. I think there are three issues with it however:
- Most importantly, derivatives are not measured in the same units as each other, or as the original outcome. They’re rates, and thus there are real problems in deciding how to compare them; when is a derivative large relative to the others?
- We haven’t actually escaped the range of integration problem. We still have to assess the size of d2 f/ dx1dx2, presumably integrated over some set of covariates. It does, however, make it easier to localize this measure of size to the covariates we have observed.
- In many models, derivatives are not easy to evaluate. In some they are, but in trees, for example, derivatives are zero nearly everywhere (the model proceeds in steps) and you need to specify some finite difference method to look a certain distance away; How far? But, despite having avoided this approach for a while, I’m seeing more merit in it; at least in so far as it provides a way to link local to global interpretation.
*** I chose diabetes because it was not one of the diseases I was involved in forecasting, so I don’t know how, or if IBM is looking at this.