On Interpretation In Multiple Models

No, dear reader, it has not taken me this long to come back to this blog because I am reluctant to give up talking about my own work. Simply the discipline imposed on me by official service commitments has rather sapped my discipline in other areas.

In this post, I will leave off writing about interpretability and interpretation in machine learning methods and instead focus on the details of statistical interpretation when conducting model selection. This has become particularly relevant given the recent rise of the field of “selective inference”, lead by LINK http://statweb.stanford.edu/~jtaylo/ Jonathan Taylor LINK and colleagues.

While, the work to be discussed is conducted in the context of formal model selection (ie, done automatically, by a computer), I want to step back and look at an older problem — the statistician effect. This was brought up by a paper which conducted the following experiment: the authors collected data on red cards given to soccer (sorry Europeans, I’m unapologetically colonial in vocabulary) players along with a measure of how “black” they were and some other covariates. They then asked 29 teams of statisticians to use these data to determine whether darker players were given more red cards. The results of this were that the statisticians couldn’t agree — about 60% said there was evidence to support this proposition, another 30% said there wasn’t. They used a remarkable variety of models and techniques.

I wrote a commentary on this for STATS.org. I won’t go into the details here, but the summary is that there is a random effect for statistician (ask a different statistician you’ll get a different answer) but the problem in question exaggerated the effects of this by focussing on statistical significance — all statisticians had p-values near 0.05 but fell on one side or the other. Nonetheless, it does lead one to the question of how could you quantify the “statistician variance”.

Enter model selection. Declaring automatic model selection procedures (either old-school stepwise methods or more modern LASSO-based methods) a solution to statisticians using their judgement is a pretty long bow (and doesn’t account for various other modeling choices or outlier removal etc), but it will allow me to make a philosophical connection later. Modern data sets have induced considerable research into model selection, mostly under the guise of “What covariates do we include in the model?”

Without going into details of these techniques, they all share two problems for statistical inference:

1. Traditional statistical procedures such as hypothesis tests, p-values and confidence intervals are no longer valid. This is because the math behind these assumes you fixed your model before you saw the data. It doesn’t account for the fact that you used your data to choose only those covariates which had the strongest relationship to the response, meaning that the signals tend to appear stronger in the data than they are in reality. (For those uninitiated in statistical thinking: if we re-obtaining the data 1,000 times, and repeating the model-selection exercise for each data set, covariates with small effects would only be included for those data sets that over-estimate their effects).

2. We have no quantification of the selection probability. That is, under a hypothetical “re-collect the data 1,000 times” set-up, we don’t know how often we would select the particular model that we obtained from the real data. Worse, there are currently no good ways to estimate this probability.

Jonathan Taylor (in collaboration with many others) has produced a beautiful framework for solving problem 1.*** This goes by the name of selective inference. Taylor describes this as “condition on the model you have selected”; in more layman’s terms — among the 1,000 data sets that we examined above, throw away all those that don’t give us the model we’ve already selected. Then use the remaining data sets to examine the statistical properties of our parameter estimation. Taylor has shown that in so doing, he can provide the usual quantification of statistical uncertainty for the selected parameters.

The (a?) critique of this is that we are conditioning on the model that we get. That is, we ignore all the data sets that don’t give us this model. In doing this, we throw out the variability associated with performing model selection. That is, we still haven’t solved Problem 2. above.

Taylor’s response to this is that the model you select changes the definition of the parameters in it. I’ll illustrate this with an example I use in an introductory statistics class; if you survey students and ask their height and the height of their ideal partner you see a negative relationship — it looks like the taller you are, the shorter your ideal partner. However, if you include gender in the model, the relationship switches. So the coefficient of your height in predicting the height of your ideal partner is interpreted one way when you don’t know gender and in another way if you do. If we perform automatic model selection, we’ve allowed the data to choose the meaning of the co-efficient, and if we repeated the experiment we might get a different model and hence give the co-efficient a different meaning. Taylor would say “Yes, the hypotheses that we choose to test are chosen randomly, but this is of necessity since they mean different things for different models. In any case, this is just what you were doing informally when you let the statistician choose the model by hand.”

I see the point here, but nonetheless it makes me uneasy. In the paper I began with, the researchers asked an intelligible question about racism in football and I don’t think they were looking for an “if you include these covariates” answer. Now to some extent, the statistcian’s analysis of this question supports Taylor’s arguments — one team had simply correlated the number of red-flags with the player’s skin tone and left it at that; others felt that they needed to worry about whether more strikers got red cards and perhaps more black-skinned players took that position, and other sorts of indirect effects. In fact, with the exception of the one team that looked at correlations, the question was universally interpreted as “Can you find an effect for skin tone, once you have also reasonably accounted for the other information that we have?” I generally think this is what statisticians, and the general public, understand us to be doing when we try and present these types of relationships. (Notably, there is a pseudo-causal interpretation here; the data do not support an explicit causal link, but by accounting for other possible relationships we can try to get as close as we can — or at least rule out some correlated factor).

I think this is the key to my concerns. I see model selection as part of the “reasonably accounting for all the information we have” part of conducting inference. In particular, we hope that in performing model selection, the covariates that we remove have small enough effects that removing them doesn’t matter. (Or at least that we don’t have enough information to work out that they do matter). Essentially, my response to Taylor is that “The interpretation of a co-efficient doesn’t change between two models, when they differ only in parameters that are very close to zero”. That is, model selection might be better termed “model simplification” — it doesn’t change the model (much), it just makes it shorter to write down. The inference I want includes both 1. and 2. — my target is the model that makes use of all covariates as well as possible (if I had enough data to do this) and I want my measure of uncertainty to account for the fact that I have done variable selection with the aim of obtaining a simpler model.
Of course, this just leaves us back in the old hole — I care about both Problems 1 and 2 and I have no good way to deal with that. There has been a school of work on “stability selection”; essentially bootstrapping to look at “how frequently is this covariate used in the model”. This has to be done quite carefully to have valid statistical properties, which represents a first problem. However, I have to confess a further problem than this; from stability selection you will get a probability of including each covariate in the model; and then a notion of how important it is, if included. If I have to look through all covariates, why do selection at all? There are plenty of regularization techniques (e.g. ridge regression) that can be employed with a large number of covariates that are much easier to understand than model selection. Why not simply use them and rank them by how much they contribute? I’m not convinced this won’t do just as well in the long run.

 

 

*** I call this a framework because the specifics of carrying it out for various statistical procedures still requires a lot of work.

Statistical Inference for Ensemble Methods

It’s taken me a while to get back to this blog, so thanks to those of you who are still following. The break was partly because I’ve been back from sabbatical and facing the ever-increasing administration burden associated with recently-tenured faculty (there’s a window between being promoted and learning to say no to everything that I think the administration is very good at exploiting).

It’s also partly because I didn’t want to keep harping on about my own research. I will finish off from where I left my last blog post (on producing uncertainties about the predictions from random forests) but I want to then get back to philosophical musings, particularly about some recent developments in statistics.

So, back to self indulgence. Lucas and I managed to produce a central limit theorem for the predictions of random forest (paper now accepted at the Journal of Machine Learning Research after much debate with referees). Great! Now what do we do with it?

Well, one thing is to simply use them as a means of assessing specific predictions. As a potential application, Cornell’s Laboratory of Ornithology  has a wonderful citizen science program called ebird, which collects data from amateur birdwatchers all over the world. They use a random forest-like procedure to build models to predict the presence of birds throughout the US and one of the uses of these is to advise the Nature Conservancy about particular land areas to target for lease or purchase. Clearly, they would like to obtain high-quality habitat, in this case for migratory birds, and can do so off the models that the Lab of O produces. These currently do not provide statements of uncertainty, but we might reasonably think about posing the question “Would a plot with 90% +/- 10% chance of bird use really be better than 87% +/- 3%?” We know the second spot is pretty good, the first might be very good, but we’re much less sure of that.

More importantly, we can start trying to use the values of the predictions to learn about the structure of the random forest. A first approach to this is simply to ask

Do forests that are allowed to use covariate x1 give (statistically) different predictions to those that are not?

This is expressed simply as a hypothesis that if we divide the covariates into x1 and x2, say, we can ask whether the relationship

f(x1,x2) = g(x2)

holds everywhere. Formally, we state this in statistics as a null hypothesis and ask if there is evidence that it isn’t true.

We can assess this by simply looking at the differences between predictions at a representative set of covariate points. We of course need to assess the variability of this difference, taking into account the covariance between predictions at nearby points. It actually turns out that the theoretical machinery we developed for a single prediction from a random forest is fairly easily extensible to the difference between random forests and to looking at a vector of predictions. To formally produce a hypothesis test, we have to look at a Mahalanobis distance between predictions, but with this we can conduct tests pretty reasonably.

In fact, when we did this, we found that our tests gave odd results. Covariates that we knew were not important (because we scrambled the values) were appearing as significant. This was odd, but a possible explanation is that this works just like random forests: a little extra randomness helps. More comforting was the fact that if we compared predictions from two random forests with two different scramblings of a covariate, the predictions were not statistically different. This led us to suggest that a test of whether x1 makes a difference be conducted by comparing a forest with the original values of x1, and one with these values scrambled and that seemed to work fairly well.

Tests between different forests conducted in this way are problematic, however, in two ways. First, the need to scramble a covariate is rather unsatisfying, but it also limits the sort of questions we can ask. We cannot, for example, propose a test of additivity between groups of variables:

f(x1,x2) = g(x1) + h(x2)                 (*)

or, in a more complicated form:

f(x1,x2,x3) = g(x1,x3) + h(x2,x3)

Here x1, x2, and x3 are intended to be non-overlapping groups of columns in the data.

What we decided to do in this case goes back to old ideas of mine, which I think I mentioned in an earlier post. That is, we can assess these quantities if we have a grid  of values at which we make the predictions. That is, if we have a collection of x1’s and of x2’s and we look at every combination of them, we can ask how far away from (*) is f(x1,x2)? I’ve illustrated this in the figures below (rather than drawing out a mathematical caclulaiton here), but this comes out to just be a standard ANOVA on the grid.

IntTestDiagram2

Now in order to test this statistically, we again need to know the covariation between different predictions, but our central limit theorem does this for us. ***  These tests turn out to be conservative, but they still have a fair amount of power.

The point here is a shift in viewpoint for statistical interpretation from the  internal structure  of a functional relationship to being derived from the predictions that are made. By deriving our notion of structure and inference with respect to predictions we can be flexible about the models we fit, we can allow high-dimensional data to enter as nuisance parameters (without a lot of model checking) and we build a bridge to machine learning methods. Leo Brieman wrote a paper in 2001 in which he outlined the distinction between statistics and what he called “algorithmic data analysis” +++. I like to think we’ve at least started a bridge.

Finally a point on the hypothesis testing paradigm that I have pursued here. Hypothesis tests have been rather unpopular recently, and not without reason — over-reliance on low p-values at the expense of meaningful differences is a real scourge of much of the literature. What they do do here, however, is ask “Is there evidence to support the added complexity in the model?” Even better for me is that I noted in a previous post that which prediction points you look at make a bit difference to your variable importance scores. On the other hand, if one of the hypotheses above is true, you can look at any prediction points you like and all you screw up is how easy it would be to detect it not being true. That said, the biases in ML methods, particularly when making predictions where you do not have much data, mean that it’s still best to focus on making predictions where you do have a fair amount of data.

Next time: parametric models and the statistician as a source of randomness.

*** As a further note, these grids can get pretty big and estimating a large covariance matrix in our CLT is pretty unstable. In this case we turned to the technology of random projections that helped improve things a lot.
+++ And largely (and correctly) lamented statistician’s unwillingness to consider the latter

Statistical Inference for Ensemble Methods

The last post was allowed me to get a a pitch in for some of my recent work and I decided that I liked that so much I’d go all-out this post to go into further detail about our results, how we got them, and (just to spice things up) a race for priority.

The work that  Lucas did was on methods for bagged trees and  random forests . I talked a bit about this in an early post, but to avoid sending you there here’s the recap.

1. Bagged trees methods are based (unsurprisingly) on building trees. We do this by dividing the data in two based on one of the covariates. We choose the division by working out how to make the responses in each part as similar as possible. We then do the same thing to each part and keep going until we run out of data.

2. Bagging just means taking a bootstrap sample of the data and building a tree on it. Then taking another, then another until you have a collection of several hundred, or several thousand, trees. To use these, you make a prediction with each tree and then average the results.

3. Random forests are exactly the same, but they use a slightly different tree building mechanism in which you only consider a random subset of covariates each time you split the data. This is supposed to make the trees less aggressive in fitting the data.

The basic thought that we had with this is that “There’s a bootstrap structure going on. This is often used for inference, why don’t we do so here?”

Fair enough, but that leaves two problems: Inference about what? And the small matter that the bootstrap doesn’t work in this case. For the first problem, we return to diagnostics for black box functions where we saw that if there is no parametric structure in the model, the relevant thing to look at are the model’s predictions. Just as we could try and use these predictions as a means of assessing variable importance etc, we can now ask whether those measures are statistically interesting, eg, do they differ from zero?

Now to the technicalities. It would be nice to use, say, the variance between the predictions of different trees, as a means of obtaining the variance of their average. Unfortunately when we use trees built with bootstrap samples, their predictions are highly correlated, so the usual central limit theorem doesn’t apply.

This had us stumped for a while until we hit on using  U-statistics. These are a fairly classical tool that doesn’t necessarily get a lot of attention these days, but basically they’re a particular form of estimator. Suppose you have a function (which we will, with admirable ingenuity call a “kernel”) which takes any two values from your data sets; the U-statistic corresponding to this kernel is the average of giving the function every pair of values in the data. You can do this with kernels that take three values etc.

For us, the kernel is “take these data, build a tree and make a prediction at these covariates”. It doesn’t have a nice algebraic representation, but it can be used nonetheless. For this, rather than taking bootstrap samples, we merely choose a subsample of the data at random, use it to build a tree and make a prediction. This is a fairly minor modification and in fact reduces the computational cost of building the tree.

The nice thing about U-statistics is that they have a central limit theorem. The variance is somewhat different to “take the variance and divide by n” in the classical CLT, but nonetheless it is something we can use. I’ll talk about that further down.

Of course, if we have a data set of size 1000 and took all subsamples of size 100, we would run out of computing power very quickly. So we’ll only take subsamples at random (an “incomplete U-statistic”). To do the asymptotics we also ought to expect that the number of points in each subsample should grow as the over-all size of the data grows (“infinite-order U-statistics”) and random forests also use some randomization in building the trees (we had to invent a term for this; we’ll call it “random kernel U-statistics”). There wasn’t a result for incomplete infinite-order U-statistics with random kernels so we  provided one. As noted, the variance is different from what you would normally expect, but by being a bit careful about how we chose subsamples we have a nice way of calculating that as well.

This represents the very first formal distributional result for the predictions of random forests; although there were a few heuristic precursors.

But as cool as that is, it’s made better by a knuckle-biting race. The context for this is that Lucas and I started working on this in 2011/2012. I was at the Joint Statistical Meetings in 2012 and ran into  Trevor Hastie  and quite excitedly told him about the results that we were developing. It was a bit of a shock, then, to be told that he had a student working on the same problem. As it happened  Stefan Wager  came out with a paper producing confidence intervals for predictions of random forests a couple of months later, although without any theory. It’s actually a rather nice application of the infinitesimal bootstrap, which is a somewhat different means of estimating the variance of a random forest than we were using.

Lucas and I didn’t manage to put  our paper on ArXiv until April, followed a week later by another  by Stefan Wager which developed a different central limit theorem. Arguably, we won the theory race, just; but as is often the case in statistics, priority really isn’t as clear. Stefan’s result contains some tighter bounds than ours has, but we cover a somewhat more general class of estimators. In any case, Stefan and Lucas along with Gerard Biau  (who has been studying the  consistency of Random Forests  for many years) and  Shane Jensen who has studied a Bayesian Alternative, will all present at a session at  ENAR this Spring. If you happen to be in Miami in March, I think it will be really exciting.

On second thoughts, I’ll leave uses of these results for statistical inference for the next post.

Inference for this Model or for this Algorithm?

In this post I want to discuss an important but subtle distinction in statistical inference about predictability that is too-often glossed over. It is this: “is the relevant inferential question is
about the particular prediction function that you have derived, or about the process of deriving it?” This question has been an issue both in Statistics and in Machine Learning. It also allows me to both relate an anecdote **** and to skyte about some of my own (and, more importantly, my student’s) work.

I’ll lead into this with the annecdote. Several years ago John Langford visited Cornell (at the same time as he caused the anecdote in  Post 3) and I described a particular problem to him

I want to be able to tell whether x affects an outcome y, controlling for other covariates z without making parametric assumptions. To do this, I want to build a bagged decision tree and develop theory to test if it uses x more than we would expect by chance.

Langford’s response to this was to say,

Why not build any sort of model using x and z and one using only z and then use PAC bounds on a test set to see if the errors are different?

The reader might, at this point, need some background. PAC (probably approximately correct) bounds are mathematical bounds on how well we can estimate a quantity. They’re of the form “The probability that we make an error of more than δ is less than ε”. There is a complex mathematical background to these that I won’t go into, but the relevant point here is that the bounds are usually provided for our ability to estimate the error rate of an ML algorithm, usually from the training data. So comparing error rates might be useful.

Although this is a natural suggestion for someone from a PAC background, I was not particularly convinced that this was a useful direction to go in, and I remain that way for two reasons:

  1. 1. PAC bounds are very conservative. They hold for our estimate of the error rate of ITAL every ITAL function in a very large class, not just the individual function you are currently looking at. This uniformity can be very powerful, but tends to produce deltas or epsilons much larger than you actually see in practice. In practical terms for statistical inference, it means that such a method would have low power — ie, we’d need an awful lot of data to detect a difference.
  2. This is more fundamental and is the impetus for this blog: it misses the scientific question of interest. The proposed method would have told me that one particular prediction function had a lower error rate than another particular prediction function. That is, if we fixed the two models and decided that these exact models were what we would employ, we could examine whether one is likely to perform better than the other.  But that isn’t really what I wanted. I wanted to understand whether the difference between the two models is reliable if we generated new data and repeated the exercise. +++

This second distinction is what I really want to get at. Are we publishing a particular prediction function (all parameters etc fixed from here on in) or are we examining a method of obtaining one? If the former, we need only examine empirical error rates for this function. If the latter, we need to understand the variability of this function under repeated sampling.

In fact, this distinction is germane to the use of  de Long’s test. This is a method designed to test exactly a parametric version of this hypothesis through the use of the AUC statistic. However, as Demler et. al. points out, it does so, assuming that the particular prediction functions you use are fixed — you will use these particular functions forever more. If you also want to account for the fact that you have estimated your functions, it is no longer valid.

These, of course, are examples of the traditional statistical concerns: we are asking “If you re-collected your data and re-fit the model, would you get the same answers?” But just a computer scientists are too-ready to fix their model, statisticians are too-ready to re-estimate it. If we are publishing a psychological test, for example, the specific values of parameters in that test will remain fixed — there is no re-collection and re-fitting — we just want to know how well it will do in the future (or that it will do better than some other, fixed, procedure). In this case, de Long’s test is valid (as is using PAC bounds in the way that Langford suggests) and the more traditional statistical concerns are overly-conservative (because they try to account for more variation than necessary).

However, more often than not, the statisticians question is the relevant one. In my situation “Does lead pollution affect tree swallow numbers?” really implies “You saw a relationship in this
data; could you have just gotten it by chance?” ie, exactly the sort of question I want to ask. This hasn’t been possible in Machine Learning up until the last year and I’m really pleased that my student,  Lucas Mentch  has produced the first ever central limit theorem for ensembles of decision trees.  ###  This really does give us a handle on “if we re-collected the data, how different might our predictions be” and we can back my questions above out of this. All this is early days, though and I’ll post a little more about how this works — and a friendly race of priority — in the next post.

A final note on all of this. I’ve largely stayed out of the Bayesian/Frequentist debate, but will here lay down my cards as the latter. To me, statistics is really about assessing “Is this
reproducible?” Isn’t that the standard for scientific evidence, anyway? More realistically; “To what extent is this reproducible?” %%% There is a lot of analysis that says that Bayesian methods often approximately provide an answer to this — and I will more than happily make use of them under that framework — but they don’t automatically come with such guarantees. I don’t particularly like the idea of subjectivist probability but rather than arguing against everyone having their own science, I simply think that Bayes methods answer the wrong questions, in much the same way that PAC bounds do — they’re both conditioned on the current data.

Interestingly, to some extent this view would appear to support an ML viewpoint — I only care about quantifiable predictive power (in this case of inferential statements over imaginary replications). To some extent that’s right — I think interpretation (and statistical inference) is hugely important for humans building models. I don’t think it should be the means by which we judge a model’s correctness.

 

****  If it is not already obvious, I enjoy these

+++  To some extend the uniformity of PAC bounds would allow us to examine that, if we looked at estimators that explicitly (and exactly) minimized a loss. That is, if my function with just z minimized the error rate, and I could show that the error for the function using both x and z was smaller than the error using z by more than the bound, I could say that this has to be true of ANY function that only uses z. Unfortunately, the error rate we have bounds for is often not the error rate that we minimize and in the case of things like decision trees these bounds really aren’t available.

### I’m nearly as tickled to have a paper coming out that can be cited as Mentch and Hooker.

%%% This gives us scope to play with changing the data generating model, the sorts of assumptions we make, etc.

Local Interpretation

So far the types of interpretation that I have discussed in machine learning models have been global  in scope. That is, we might ask, “Does x1 make a difference anywhere?”. This is very much a scientific viewpoint — we want to establish some universal laws about what factors make a difference in outcome.

There is, however, an alternative need for explanation that is much more local in scope — what makes a difference close to the following point. This was brought home to me at the start of this year when I spent a couple of months visiting IBM’s Healthcare Analytics working group (a wonderful, smart set of people who very generously hosted me). Without giving away too many industrial secrets, the group is developing systems based on electronic medical records to support medical decision making: we might want to forecast whether a patient has a high risk of developing some particular condition, for example. There are great machine learning tools for doing this and the folks at IBM have remarkably good performance at this.

The question, from a doctor’s point of view, is now “What do I do with this information?” It’s not much good knowing that someone will develop diabetes ***  in six months without some idea about what you can do to help. We might therefore be interested in understanding what changes could be made to the patients covariates to reduce their risk. This isn’t a global interpretation — we’re not going to be able to change their weight or blood sugar dramatically (at least not in short order) but we might be able to nudge them in a positive direction. The question is then “What should we nudge, and what should we prioritize nudging?”

To do this, you don’t need a global understanding of the way the model functions, you only need to know how the prediction changes locally around a patient. One way to do this is to experiment with changing a covariate a little (“How much does this prediction change if we reduce weight by 10lb?”). 10lb is rather arbitrary, so we might be interested in something like a derivative: how fast does risk increase with weight at these covariates?

Unfortunately, not all ML functions are differentiable, or are easy to differentiate, but we could imagine some approximation to this. One way to explicitly look at local structure is by local models. These are parametric models — of the sort I have just been decrying — but estimated for each point based only on the points close by.

The easiest way to think about this is with a very simple example: the Nadaraya-Watson smoother in 1 dimension. I’ve sketched an example of this below, but the idea is simple — we take a weighted average of the points around the place in which we are interested, where the weights decrease the further the observations are from our prediction point.

NadarayaWatson

 

This, by itself, does not generalize easily, but we can instead think of it as minimizing squared error, but with weights

μ(x) = argmin∑ K(x,Xi)(Yi – μ)^2

(to unpack this notation, argmin means the value of mu at which the expression is minimized. K is a “kernel” — think of it as a Gaussian bump centered at x). This can, in fact, be very nicely generalized to something like

β(x) = argmin ∑ K(x,Xi)(Yi – g(Xi,β))^2

where g(Xi,β) is a linear regression (with coefficients beta), for example. There is nothing in here about x being univariate anymore. Now we have a linear regression at x, but it will change over x. The nice thing about this is that we can apply all the standard interpretation for linear regression, but this interpretation is specific to x. So we can look at β(x) to see how quickly risk changes with weight, for example, at this value of the covariates. We can also explore structure — interaction terms etc — that can give us a more detailed view of the response, but at a local scale.

The crucial aspect of this is that it is the local scale that matters when one is asking hypotheticals about specific prediction points — “What would happen if we just changed this, a little bit?” And in fact, this potential-intervention model is really what is relevant.

A last question: can we get back from here to a global interpretation? Well in some sense yes. If we think of ourselves as asking “Is the derivative of F(x) with respect to x1 zero everywhere?” we can express all the results about additive structure discussed two posts ago in terms of some derivative being zero. For example we could see if we could reduce a three dimensional function to two two-dimensional ones:

F(x1,x2,x3) = f1,3(x1,x3) + f2,3(x2,x3)

(ie, there is no interaction between x1 and x>sub>2) either by looking at f1,2(x1,x2) as well as f123, or we could examine if

d2 f/ dx1dx2 = 0

Alternatively, we could look at the size of these derivatives, since none are likely to be exactly zero.
In many ways this is appealing — it appears to remove the problems we had about where we integrate and should be easier to asses. I think there are three issues with it however:

  1. Most importantly, derivatives are not measured in the same units as each other, or as the original outcome. They’re rates, and thus there are real problems in deciding how to compare them; when is a derivative large relative to the others?
  2. We haven’t actually escaped the range of integration problem. We still have to assess the size of d2 f/ dx1dx2, presumably integrated over some set of covariates. It does, however, make it easier to localize this measure of size to the covariates we have observed.
  3. In many models, derivatives are not easy to evaluate. In some they are, but in trees, for example, derivatives are zero nearly everywhere (the model proceeds in steps) and you need to specify some finite difference method to look a certain distance away; How far? But, despite having avoided this approach for a while, I’m seeing more merit in it; at least in so far as it provides a way to link local to global interpretation.

Next up: from interpretation to inference and a plug for some recent work and especially for one of my students.

 

*** I chose diabetes because it was not one of the diseases I was involved in forecasting, so I don’t know how, or if IBM is looking at this.

 

Weasel Words

One common current to the diagnostics presented so far is the interpretation of the effect of a covariate “with all other terms held fixed”. This formulation is taught rather ritualistically in introductory statistics classes when multiple linear regression is introduced. ***  So what does it really mean? In a hypothetical linear regression of weight on height and age

weight = b0 + b1 * height + b2 * age

the interpretation of b1 is “suppose we could just change age”. Or more explicitly, “If we found two people who were exactly the same height and differed by 1 year in age, they would differ by b2 kg, on average”.

Fundamentally, the diagnostic methods for machine learning I introduced in the previous post — and generalized additive models more generally — also rely on this. We can understand

F(x1,x2) = g1(x1) + g2(x2)

because we can understand what g1(x1) does, WITH x2 FIXED.

So what? The issue here is that all other variables generally aren’t fixed. Alright, in nicely designed experiments we can actually fix the levels of one variable and vary the other variables. But for observational data — and almost all ML is conducted on observational data — there are relationships among the covariates, so the notion of “other terms held fixed” really doesn’t apply. Even in designed experiments, the “with all other covariates fixed” won’t happen in the non-experimental world that the experiments are presumably meant to partially explain.

To make this concrete, if you age a year, your height is likely to change as well. If you have the fortune of youth, you’ll get taller as your bones grow and later as your posture improves ***; if you’re older, your height will decrease as the disks in your spine slowly compact. Moreover, this will happen at a population level, too — people aged 37 do not have the same distribution of heights as those aged 38. So there is no real physical reality to holding all other covariates fixed.

This blog is by no means the first place to make this observation. The discipline of path analysis  has been around for nearly as long as the modern discipline of statistics. Basically, it boils down to a series of regressions:

weight = b0 + b1 * height + b2 * age + epsilon

height = c0 + c1 * age + eta

that is, there is a regression relationship between weight and age and height, but height is also dependent on age (yes, linearly for now, but just to show this up). Both these can be estimated directly from the data and we can start partitioning out blame for our spare tires by saying

The DIRECT effect of age on weight is b2, but there is also an INDIRECT effect of age through height, to give a TOTAL effect of b2 + c1 * b1.

In fact, this is now a special case of a Bayesian Network  in which we can skip some of the insistence on linearity (although Bayesian Networks often restrict to categorical-valued data). It can also be couched in the language of structural equation models.

The problem with these, in the simplest case, is that you need to know the structure of the dependences (we can’t have height depends on age AND age depends on height) before you do anything else. This is usually guided by subject matter knowledge — a kind of pseudo-mechanistic model ***. There are ways to also attempt to determine the structure of dependencies, but these are not always used and are subject to considerable uncertainty.

Now, there is a weasely retort to all these objections that “we are only trying to understand the geometry of the function”. That is, we are trying to get a visual picture of what the function looks like if we graphed it over the input space. This isn’t unreasonable, except in so far as it is often used because we can do it easily, rather than because that is what is actually desired. And this is reasonable if you actually think you have a generalized additive model (or, reality forgive you, a linear regression) in which you can always look at dependences between covariates post-hoc (but you really ought to).

However, in the diagnostic methods for black box functions explored in the previous post, we explicitly based the representation that we looked at on averaging over covariates independently — ie by breaking up any relationships between them. This affects all sorts of things, particularly the measures of variable and interaction importance that we used to decide what components to plot as an approximation to the function. This concern took up a considerable amount of my PhD thesis and I’ll return to these ideas in a later post, although even then the solutions I proposed resulted in additive model approximations which again are subject to that “all other covariates held fixed” caveate.

So what can we do? We could simply try to regress the response on each covariate alone, graph that relationship and measure it’s predictive strength. The problem is that this doesn’t then let you combine these relationships to get relationships between pairs of covariates (it doesn’t account for the relationship between even pairs of covariates). Or we could resort to a non-linear form of path analysis, requiring either the specification of causal dependences between covariates or some model selection to work out what the correct diagram should be (subject to a whole lot of unquantified uncertainty). I’ll post about some of my very early work in a few posts, but frankly I still don’t think we have any particularly good answers to all this.

There are two answers, though. One is the classical ML answer — why are you trying to interpret at all? The other is to pursue a local interpretation; we’ll look at this next post.

*** There’s an awful lot of ritualistic statements that we teach to students in introductory statistics classes — I’m not sure how much of this really gets through to student understanding, or how often these statements are really parsed. The old “If the null hypothesis was true, there would be less than 0.05 probability of obtaining the observed test statistic or something more extreme” is almost legalistic in being exactly correct and largely endarkening on first encounter. Nonetheless, we (myself included) continue to grade students based on their ability to parrot back such phrases.

*** Only after stern commands from my physiotherapist in my case .

*** Pseudo because linear regression really is a particularly blunt instrument for mathematical modeling, but we do have to give way to mathematical convenience at some point.

X-raying the Black Box

Here we get to the original purpose behind the blog and the reason I started thinking about these issues in the first place. It’s also rather self-serving, since it really gets to be about my own work. (Hence the reason it’s three months late? Not really — I had thought of it as being background material and simply got distracted).

At least one aspect of my work has centered around taking otherwise-unintelligible models and trying to come up with the means of allowing humans to understand, at least approximately, what they are doing. This was the focus of my PhD research (which now seems painfully naive in retrospect, but we’ll get to some of it, anyway), but the ideas have, in fact, a considerably longer history. I recall hearing about neural networks in the early ’90s when they were in their first wave of popularity and thinking “Yes, but how do you understand what it’s doing? And is it science if you don’t?” *** These largely stayed in the background until  Jerry Friedman  suggested machine learning diagnostics as a useful topic to investigate (among a number of others) and I found that I had more ideas about this than about building yet more seat-of-the-pants heuristics for making prediction functions. ***

Most of this post will be a run-down of machine learning diagnostics; rather aptly described as “X-raying the Black Box”, *** although Art Owen  probably more accurately likens what is done to tomography. However, in doing this, I hope to highlight the key question for all of this blog — is the effort put in to developing the tools below of any real scientific value? Or is it all simply cultural affectation that panders to our desire to feel in control of something that we really ought to let run automatically? Models and software that have some of the tools below built in to them certainly sell better, but that doesn’t mean that they’re more useful. And if they are useful for something besides making humans feel better about themselves, what is that, and can we keep it in mind when designing these tools better? It is, as I think I have foreshadowed, difficult to come up with real, practical examples to justify the time I’ve spent on the problem — although the tools I developed work quite well. I do have the beginnings of some notions about this — most of them have a huge roadblock in the form of statistical/machine learning reluctance to examine mechanistic models (see Post 6) — but we’ll need to get to what is actually done by the way of machine learning diagnostics first.

So lets look at the basic problem. We have a function F(X) where X = (x1,…,xp) is a vector and we can very cheaply obtain the value of F(X) at any X, but we can’t write down F(X) in any nice algebraic form (think of a neural network or a bagged decision tree from previous posts). What would we like to understand about F? Most of the work on this has been in terms of global models:

– Which elements of X make a difference to the value of F?
– How does F change with these elements?

(there are also local means of interpreting F — I’ll post about these later***). Most of these center around the problem of “What is the importance and the effect of x3?” Or, in more immediately understandable terms “What happens if we change x3?” Fortunately, F(X) is cheap to evaluate, so we can try this and look at
f3(z) = F(x1,x2,z,x4,…,xp) as we change z (note that this is a direct analogue of the linear regression interpretation “all other terms held constant”). This can then be represented graphically as a function of z (see the last blog post on intelligibility of visualizable relationships even when they aren’t algebraically nice). I’ve done this in the left plot below for x2 in the function from the previous post. We get a picture of what changing x does to the relationship, and if the plot is close to flat, we can decide that x3 really isn’t all that important.

This is fairly reasonable, but presents the problem that the plot you get depends on the values of x1, x2 and all the other elements of X. The right hand plot below provides the relationship graphed at 10 other values of X.

 

EX_pdp1                 EX_pdp2

So what to do? Well the obvious thing to do is to average. There are a bunch of ways to do this; Jerry Friedman defined the notion of partial dependence in one of his seminal papers in which you average the relationship over the values of x that appear in your data set. Leo Breiman defined a notion of variable importance  in a similar manner by saying “mix up the column of x3 values (ie randomly permute these values with respect to the rest of the data set) and measure how much the value of F(X) changes.” There have been a number of variations on this theme. In the right-hand plot above I’ve approximated Jerry’s partial dependence by averaging the 10 curves to produce the thick partial dependence line.

These can all be viewed as some variation on the theme of what is called the functional ANOVA. For those of you used to thinking in rather stodgy statistical terms, fANOVA is just the logical extension of a multi-way ANOVA when you let every possible value of each xi be its own factor level (we can do this because we have a response F(X) at each factor level combination). For those of you for whom the last sentence was so much gibberish, we replace the average above by integrals. So we can define

f3(z) = ∫ … ∫ F(x1,x2>/sub>,z,x4,…,xp) dx1 dx2 dx4… dxp

The point of this is that it also allows us to examine how much difference there is due to pairs of variables, after the “main effects” have been taken out

f2,3(z2,z3) = ∫ … ∫  F(x1,z2,z3,x4,…,xp) dx1 dx4… dxp – f2(z2) – f3(z3)

this can be extended to combinations of three variables and so on; it gives us a representation of the form

F(X) = f0 + ∑i fi(xi) + ∑i,j fi,j(xi,xz) + ….

There are lots of nice properties of this framework; most relevant here is that we can ascribe an amount of variance explained to each of these effects and we can parse out “How important is x3?” in terms of the variance due to all components that include x3. We can also plot f3(x3) as a notion of effects.

This framework was a large part of my PhD thesis (although to be fair, Charles Roosen — an earlier student of Jerry Friedman, laid a lot of the groundwork). You might think of plotting all the individual level effects, and all the pair-wise effects in a grid. If these explain most of the changes in the predictions F(X) over your input space, then you can think of F(X) as being nearly like a generalized additive model (exactly as written above, stopping at functions of two variables) in which all the terms are visually interpretable. One of the things I did was to look at combinations of three variables and ask “Are there things that this interpretation as missing, and which variables to they involve?”

Of course there is a question of “Integrate over what range?” Or alternatively with respect to what distribution. In fact, this can have a profound effect on which variables you think are important, or how well you can reconstruct F(X) based on adding up functions of one dimension. This was another concern of my thesis — particularly when the values for X that we had left “holes” in space that F(X) filled in without much guidance as to what the real relationship was. I’ll come back to some of this later.

For the moment, however, we have that almost all tools for “understanding” machine learning functions come down to representing them approximately in terms of low-dimensional visualizable components. The methods that have come with such tools have been very popular, partly because of them. However, I wonder whether they are really doing anything useful and I’ve rarely seen such visualization tools then used in genuine scientific analysis. They can be (and have been) employed to develop more algebraically tractable representations for F(X). This sometimes improves predictive performance by reducing the variance of the estimated relationship, but mostly comes down to having a relationship that humans can get their heads around, and this still doesn’t answer the question of “Why do we need to?”

 

*** As an aside I don’t know how common it is for academics to trace their research interests to questions and ideas far earlier than their conceptualization of their work as a discipline. In my first numerical analysis class, we at one point set up linear regression (I had no statistical training at all) and I recall thinking that I wanted to develop methods to decide if it should instead be quadratic or cubic etc. Some five years or so later I discovered statistical inference. Of course, we always pick out important sign-posts in retrospect.

*** Jerry has an amazing intuition for heuristics and the way that algorithms react to randomness — I quickly decided that competing with him was not really going to be a viable option, but still lacked the patience for long mathematical exercises; hence a penchant for the picking-up of unconsidered problems, even if they lead into Esoterika (that’s got to be a journal name) on occasion.

***  I wish this was my term — I first came across it as the heading of a NISS program.

*** Yes a do have a list of all the things I’ve said I would post about.

One More Intelligible Model

In my most recent entry, I attempted (apparently not particularly successfully) a distinction between interpretable and mechanistic models. The term interpretable appears to be in common use in statistics and machine learning, but it may be that intelligible would be more appropriate. By this I mean that humans can make sense of the specific mathematical forms employed in order to “understand” how function outputs are related to function inputs.

So what can we find intelligible? In an older post I essentially brought this down to small combinations of simple algebraic operations. Of course what “small” and “simple” mean here will depend on the individual in question — I’m pretty decent at understanding how exponentiating affects a quantity, but that certainly isn’t true for my introductory statistics students — but with the possible exception of a very small number of prodigies, I know nobody who can keep even tens of algebraic manipulations in their head at any one time.

There is an important extension of this, which is that some interpretation is still possible if a relatively simple algebraic expression is embedded in a more complex model and retains its interpretation in this context. For example, a linear regression with hundreds of covariates is not particularly interpretable — there are far too many terms for a human to keep track of — but each individual term can be understood in terms of its effect on the prediction. (This is as a function, ie “with the other terms held constant” — I’ll post something on this weasel formulation later). It is, of course, possible to embed simple terms within complex models in which case the relatively easy interpretation of a linear model within a neural network, for example, are lost when its effects are obscured by the more complex manipulation that is then applied to its output.

For the purposes of describing some means of machine learning diagnostics, there are, however, one further class of mathematical function that I think humans can get a handle on — those we can visualize. Here

fnxfny

I have plotted some one and two dimensional functions (I’ll come back to what these are in a bit) that do not have “simple” algebraic structures. Nonetheless, understanding them is easy — just look! We can even read off these numbers. We also know how to plot two-dimensional functions and are pretty good at understanding contour plots, heatmaps, and three-dimensional renderings.

xyfn1 xyfn2 xyfn3

If we wanted a function of three inputs it might be possible to stack some of these, or at least lay them out somehow:

xyzfn1 xyzfn2 xyzfn3 xyzfn4

and nominally we could try to extend this further, but my brain is already starting to dribble when I actually want to look through these and come up with some sense of what is going on.

None of the functions that I have just presented have algebraically simple expressions. The first is given by a combination of three normal densities (apparently I don’t have any sort of equation editor in this tool so I can’t do square roots) which really isn’t nice, but we can examine it visually. Is this more than a special case? Only to some extent — these can be extended into more complex contexts in the same way that the interpretation of linear terms can be: so long as their effects are the same when put within that context. In fact, statisticians have long used generalized additive models of the form

y = g_1(x_1) + g_2(x_2) + g_3(x_3) + g_4(x_4)

precisely because of their intelligibility (and because estimating such models is more statistically stable). Even in machine learning, this is gaining some traction — see  my paper  with Yin Lou who just completed his PhD in Computer Science at Cornell explicitly looking at estimating these types of prediction function because of their intelligibility. ***

By way of example to tie in many themes from the last few posts, we might examine Newton’s law of gravity as it applies to a object near the ground on earth. In the classical form, the vertical height z of an object from the surface is described by the differential equation

D^2 z = – g/(z+c)^2

where c is the distance from earth’s center of mass to its surface and D^2 z means its acceleration. This, of course, is an approximation for many reasons, but partly because earth’s gravity changes over space, and is affected (in very minor ways) by other celestial bodies, so perhaps we should write

D^2 z = – g(x,y,z,t)/(z+c)^2

where x and y provide some representation of latitude and longitude and t is, of course, time. Here the mechanistic interpretation remains — the dynamics of z are governed by acceleration due to gravity — but g(x,y,z,t), unless given in an algebraically nice form, is not particularly intelligible. My larger question in this blog is “does that matter?” Of course, for most practical purposes, the first form of these dynamics is sufficient to predict the trajectory of the object quite well — it’s also a very handy means of producing a simpler, intelligible approximation to the actual underlying dynamics (g is very close to constant) that humans can make use of.+++

This idea of producing a simpler model, along with additive models, really makes up most of the tools used — usually informally — to understand high dimensional prediction functions, and that’s something that I’ll get to in the next post.

 

*** I must also thank Yin for pointing out in his thesis that “intelligible” might be a less ambiguous term than “interpretable”, although there is no alternative verb corresponding to “interpret”.

+++ Now I know that a physicist will object that the general law of gravitation applies to any collection of bodies if you know enough. Besides the fact that you never know enough to account for everything (and once you do, not everything behaves approximately according to Newtonian dynamics), I could still ask — what if the inverse square law were a more complicated function, and does the fact that it has a nice algebraic form matter?

Interpretation and Mechanistic Models

I want to devote this post to a very different modeling style which neither statisticians nor ML-types devote much attention to: what I will refer to as mechanistic models. I think these are worthwhile discussing for a number of reasons.

  1. In one sense, they represent one of the best arguments against the ML viewpoint in terms of identifying where human intelligence and understanding becomes important to science.
  2. I want to distinguish mechanistic from intepretable in this context. In particular, my concerns are not really about the benefits of mechanistic models (although this is also an interesting topic) and I want to clarify this.
  3. Statisticians rarely think of modeling in these terms and I think this represents one of the discipline’s greatest deficiencies.

The sense in which I use mechanistic is somewhat broader than is sometimes employed (ie, it encompasses more than simply physical mechanics). The distinction I am making is between these and what I would describe as data-descriptive models; it also roughly distinguishes the models employed by applied mathematicians from those used by statisticians.

To make it clear for the physicists: I use the word  interpretable to be a property of the mathematical form that a model takes, not of its real-world meaning. Ie, I am asking “Should we worry about whether we can understand what the mathematics does?” I am aware of the vagueness of the term “understand” — that’s a large part of the reason for this blog.

Essentially, mechanistic models are generally dynamic models based around a description of processes that we believe are happening in a system, even if we cannot observe these particularly well. i.e. they provide a mechanism that generates the patterns we see. They are often given by ordinary differential equations, but this has mostly been because ODE’s are easy to analyze, and we can be broader than that. ***

The simplest example that I can think of is the SIR model to describe epidemics and I think this will make a good exposition. We want to describe how a disease spreads through a population. To do so, we’ll divide the population into susceptible individuals (S) who have not been exposed to the disease, infectious (I) who are currently sick, and recovered (R) who have recovered and are now immune***. Any individual has a progression through these stages S -> I -> R; we now need to describe how the progression comes about.

I -> R is the easiest of these to model — people get sick and stay sick for a while before recovering. Since each individual is different, we can expect the length of time that an individual stays sick to be random. For convenience, an exponential distribution is often used (say with parameter m), although the realism of this is debatable.

S -> I is more tricky. In order to become sick you must get infected, presumably by contact with someone in the I group. This means that we must describe both how often you come in contact with an I, and the chances of becoming infected if you do. The simplest models envision that the more I’s there are around, the sooner an S will bump into one and become infected. If we model this waiting time by an exponential distribution (for each S) we give it parameter bI so the more I there are, the sooner you get infected.

If you turn this individual-level model into aggregate numbers (assuming exponential distributions again because of their memoryless property), you get I -> R at rate mI (since we’re talking about the whole I population) and S -> I at rate bSI. You can simulate the model for individuals, or in terms of aggregate quantities, or if the population is large enough (and you re-scale so we don’t have individuals, but a proportion) we can approximate it all by an ODE:

DS = – bSI
DI = bSI – mI
DR = mR

where DS means the time-derivative of S. Doing this turns the model into a deterministic system which can be a reasonable approximation, especially for mathematical analysis, although in real data the noise from individual variability is often evident.

There are obviously many ways to make this model more complicated — stages of disease progression, sub-populations that mix with each other more than others, geographic spread, visitors, immunization, loss of immunity and a whole bunch of others. The epidemiological literature is littered with these types of elaboration.

The point of this model is that it tells a coherent story about what is happening in the system and how it is happening, hence the moniker “mechanistic”. This is in contrast to most statistical and ML models that seek to describe static relationships without concern as to how they came about — even time-series models are usually explanation-free. I have also avoided the term “causal” — although it would be quite appropriate here — in order to not confuse it with the statistical notions of causal modeling as studied by  Judea Pearl, which are similarly static.

Having gone through all this, there are some observations that I now want to make:

1. I think we can distinguish mechanistic versus interpretable here. My father  would be inclined to view this type of model as the only type worth interpreting — he sniffly dismissed the models I examined earlier as all being “correlational”, and would presumably say the same thing of causal models in Pearl’s sense.

I’m not sure he’s wrong in that (see below), but it’s not quite the problem that I want to examine in this blog and I think I can make some distinctions here: while the structure of the SIR model above is clearly motivated by mechanisms, a substantial part of it is dictated by mathematical convenience rather than realism. The exponential distribution, and an assumption that an S is as likely to run into one I as any other are cases in point. Moreover there is no particular reason why the description of some of these mechanisms should have algebraically elegant forms. Newton’s law of gravity, for example, would still be a mechanistic description if the force decayed according to some algebraically-complicated function of the distance between objects rather than the inverse square (even if this would be less mathematically elegant).

Indeed, one might imagine employing ML to obtain some term in a mechanistic model if the relationship was complex and there were data that allowed ML to be used. For example, the bSI term in the SIR model is an over-simplification and is often made more complex — it’s not clear that using some black-box model here would really remove much by the way of interpretation. My central concern — esoteric though it may be — is with regard to the algebraic (or, more generally, cognitive) simplicity of the mathematical functions that we use.

2. Mechanistic models do, however, provide some more-compelling responses to the ML philosophy. A mechanistic understanding of a system is more suggestive of which additional measurements of a system are going to allow for better prediction and therefore what we might want to target. In work I do with colleagues in ecology, we believe that some dynamics are driven by sub-species structure and this suggests we will be able to parse this out better after genotyping individuals. Similarly, it allows us to conceive of interventions in the system that we might hope will either test our hypotheses, or pin down certain system parameters more accurately.

An ML philosophy might retort that we can, of course, predict the future with a black box model, just give us some data. That mechanistic interpretation is mostly developed post-hoc and humans have many times been shown to be very good at making up stories to explain whatever data they see (more on that in another post) and that active learning  looks at what new observations would be most helpful, and you could pose this problem in that context, too. Of course, this does rather rely on the circular argument “interpretation is bullshit therefore interpretation is bullshit”.

3. As a statistician who has spent a considerable amount of time working on these types of models, I am distressed at how foreign this type of modeling is to most of my colleagues. Almost all models taught (and used) in statistics are some variant on linear regression, and basically none attempt to explain how the relationships we see in data come about — even the various time series models (ARIMA, GARCH etc) take a punt on this. The foreignness of these modeling frameworks to statisticians is, I suspect, because they make up no part of the statistical curriculum (when faced with a particularly idiotic referee report I’m somewhat inclined to say it’s that statisticians just aren’t that good at math, myself included) and I think this is the case for three reasons:

a) On the positive side, statisticians have had a healthy skepticism of made-up models (and ODEs really do tend to not fit data well at all). Much of the original statistical modeling categorized the world in terms of levels of an experiment so that exact relationships did not have to be pinned down: your model described plant growth at 0.5kg of fertilizer and 1kg of fertilizer separately and didn’t worry about what happened at 0.75kg. I’m fairly sure many statisticians would be as skeptical about SIR models as and ML-proponent, particularly given all the details it leaves out.
b) More neutrally, in many disciplines such mechanistic explanations simply aren’t available, or are too remote from the observations to be useful. To return to the agricultural example above, we know something about plant biochemistry, but there is a long chain of connections between fertilizing soil, wash-out with water, nutrient uptake and physical growth. When the desire is largely to assess evidence for the effectiveness of the fertilizer, something more straightforward is probably useful.
Of course, statisticians have chosen these fields, and have not attempted to generate mechanistic models of the processes involved. I sometimes feel that this is due to an inclination to work with colleagues who are less mathematically sophisticated than the statistician and hence cannot question their expertise. I sometimes also think it’s due to a lack of interest in the science, or at least the very generalist approach that statisticians take which means that they don’t know enough of any science to attempt a mechanistic explanation. Both of these may be unfair — see uncharitable parenthetical comments above.
c) Most damningly, it isn’t particularly easy to conduct the sort of theoretical analysis that statisticians like to engage in for these models. And it makes this type of work difficult to publish in journals that have a theory fetish. There are plenty of screeds out there condemning this aspect of statistics and I won’t add another here: it’s not as bad as it used to be (in fact, it never was) and theory can be of enormous practical import. However, convenient statistical theory does tend to drive methodology more than it ought, and it does drive the models that statisticians like to examine.

Of course, everyone thinks that all other researchers should do only what they do. *** Case in point was Session 80 at ENAR 2014 which convinced my cynical view that “Big Data” did indeed have a precise definition: it’s whatever the statistician in question found interesting. I’m not an exception to this, but then blogs are a vehicle for espousing opinions that couldn’t get published in more thoughtful journals, so….

In any case, mechanistic modeling might be an answer to ML (see Nate Silver for practical corroboration) and I might explore that in more detail. They are distinct from interpretable models, and although mechanistic models generally employ interpretable mathematical forms, they need not do so. Up next: what can we understand besides simple algebra?
*** Anyone who has examined data from outside the physical sciences should find the idea that an ODE generated it to be laughable, although the ODE can be a useful first approximation.

*** Alternatively R can mean “removed” or dead.

*** This is foolish: who wants all that competition?

 

On Approximate Interpretation

Another seminar that I went to in the McGill Psychology department in 2005 was given by Iris van Rooij, a young researcher in cognitive science with a background in computer science. Her talk focused on looking at the issue of computational complexity within cognitive science and her thesis went something like this:

When psychologists describe humans as performing some task, they need to bear in mind that humans must have the cognitive resources to do so.

This is not particularly controversial. My earlier posts argued that humans DON’T have the cognitive resources to compute or understand the implications of the average of 800 large decision trees.

However, the example she gave was was quite different. Her example was the categorization problem. That is, one of the things cognitive scientists think we do is to automatically categorize the world — to decide that some things are chairs and others plants and yet others pacifiers-with-mustaches-drawn-on. Moreover, we don’t just classify the world, we also work out what the classes (and possibly subclasses are) and we do so at a very young age. There is, after all, no evolutionary reason that we should be born knowing what a chair is, or a pacifier-with-mustaches, either.

van Rooij’s problem with this was that the classification problem is NP-hard. This takes a bit of unpacking. Imagine the problem that we have a set of objects, and have some measure that quantifies how similar each pair of objects is. We now want to sort them into a set of classes where the elements of any class are closer to each other than they are to elements of any other class. It turns out that this problem, if you want to get it exactly right, takes computational effort that grows very quickly as the number of objects you are dealing with increases. For even a few hundred objects the amount of time required to produce a categorization on the sort of laptop that I run will end up measured in years, and humans are certainly not much faster at most computation.** Thus, said van Rooij, we cannot reasonably say that humans are solving the categorization problem.

Now the natural response is that “Well obviously we’re not carrying out this form of mathematical idealization.” In fact, when computers are required to do something like this they use a set of heuristic approaches that don’t exactly solve the problem, but hopefully come somewhere close. van Rooij reply would be (actually was) “Then you should describe what humans are actually doing.” Now this is fair enough as it goes, but I still thought “Surely the description that this is the sort of problem we’re trying to solve still has value.”

This is a specific case of saying “the world behaves approximately like this”, or even “my model behaves approximately like this”. From a scientific perspective, the initial proposition “Humans carry out categorization” opens the way to exploring how we do so, or try to do so. So dismissing this approximate description because it isn’t computationally feasible that we exactly solve a mathematical idealization just prevents psychologists from using a good launching pad. With any such claim, they will almost certainly discover that the statements are naive and humans more error-prone than the claim implies.

But it also opens the question of what description of what humans do would suffice? We could certainly go down to voltages traveling between neurons in the brain, but this is unlikely to be particularly helpful for us “understanding” what is going on (even if that level of detail were experimentally, or computationally, feasible). After all, most of the experiments involving this task involve visual stimuli, at least, so various visual processing systems are involved, as well as memory, spacial processing (since we mostly think of grouping objects into piles) and who knows what else. It’s also not clear how specific all of this will be to the individual human. However, it is likely that any other description is only going to be approximate, even if it is now computationally feasible in a technical sense.

I think the higher level description of “they’re sorting the world into categories” is valuable, even knowing that it’s not exactly right, because it allows scientists to conceptualize the questions they’re asking, or to employ this task for other experiments. Of course, this is a very “science by interpretation” framework; a devotee of the ML viewpoint would presumably say that you should just predict what they will do and plug that into whatever you need it for.

By the same token, an approximate description of what the 800 bagged decision trees are doing is often enough to provide humans will some notion of what they need to think about, at least until we have computers to also plan our experiments for us. Of course any approximation has to come with caveates about where it works and when the narrative it gives you leads you in the wrong direction. It’s perfectly reasonable to say “humans categorize the world” if you are interested in using this task as some form of distraction for subjects while studying something else. It may too simplistic if the categories they come up with is part of what you are going to look at. Cognitive scientists are forced to start from the broad over-simplified statement and work out experimentally how it needs to be made more complicated. When looking at an ML model, we can see all of it directly and I’ll spend a post (in a little bit) on how that gets done.

Next, however, are models for dynamics and the distinction between being mechanistic and being interpretable.

**  Quantum computers don’t face the same hurdles, but while there are quantum effects in biology, I don’t think we can claim it in the brain.