No, dear reader, it has not taken me this long to come back to this blog because I am reluctant to give up talking about my own work. Simply the discipline imposed on me by official service commitments has rather sapped my discipline in other areas.
In this post, I will leave off writing about interpretability and interpretation in machine learning methods and instead focus on the details of statistical interpretation when conducting model selection. This has become particularly relevant given the recent rise of the field of “selective inference”, lead by LINK http://statweb.stanford.edu/~jtaylo/ Jonathan Taylor LINK and colleagues.
While, the work to be discussed is conducted in the context of formal model selection (ie, done automatically, by a computer), I want to step back and look at an older problem — the statistician effect. This was brought up by a paper which conducted the following experiment: the authors collected data on red cards given to soccer (sorry Europeans, I’m unapologetically colonial in vocabulary) players along with a measure of how “black” they were and some other covariates. They then asked 29 teams of statisticians to use these data to determine whether darker players were given more red cards. The results of this were that the statisticians couldn’t agree — about 60% said there was evidence to support this proposition, another 30% said there wasn’t. They used a remarkable variety of models and techniques.
I wrote a commentary on this for STATS.org. I won’t go into the details here, but the summary is that there is a random effect for statistician (ask a different statistician you’ll get a different answer) but the problem in question exaggerated the effects of this by focussing on statistical significance — all statisticians had p-values near 0.05 but fell on one side or the other. Nonetheless, it does lead one to the question of how could you quantify the “statistician variance”.
Enter model selection. Declaring automatic model selection procedures (either old-school stepwise methods or more modern LASSO-based methods) a solution to statisticians using their judgement is a pretty long bow (and doesn’t account for various other modeling choices or outlier removal etc), but it will allow me to make a philosophical connection later. Modern data sets have induced considerable research into model selection, mostly under the guise of “What covariates do we include in the model?”
Without going into details of these techniques, they all share two problems for statistical inference:
1. Traditional statistical procedures such as hypothesis tests, p-values and confidence intervals are no longer valid. This is because the math behind these assumes you fixed your model before you saw the data. It doesn’t account for the fact that you used your data to choose only those covariates which had the strongest relationship to the response, meaning that the signals tend to appear stronger in the data than they are in reality. (For those uninitiated in statistical thinking: if we re-obtaining the data 1,000 times, and repeating the model-selection exercise for each data set, covariates with small effects would only be included for those data sets that over-estimate their effects).
2. We have no quantification of the selection probability. That is, under a hypothetical “re-collect the data 1,000 times” set-up, we don’t know how often we would select the particular model that we obtained from the real data. Worse, there are currently no good ways to estimate this probability.
Jonathan Taylor (in collaboration with many others) has produced a beautiful framework for solving problem 1.*** This goes by the name of selective inference. Taylor describes this as “condition on the model you have selected”; in more layman’s terms — among the 1,000 data sets that we examined above, throw away all those that don’t give us the model we’ve already selected. Then use the remaining data sets to examine the statistical properties of our parameter estimation. Taylor has shown that in so doing, he can provide the usual quantification of statistical uncertainty for the selected parameters.
The (a?) critique of this is that we are conditioning on the model that we get. That is, we ignore all the data sets that don’t give us this model. In doing this, we throw out the variability associated with performing model selection. That is, we still haven’t solved Problem 2. above.
Taylor’s response to this is that the model you select changes the definition of the parameters in it. I’ll illustrate this with an example I use in an introductory statistics class; if you survey students and ask their height and the height of their ideal partner you see a negative relationship — it looks like the taller you are, the shorter your ideal partner. However, if you include gender in the model, the relationship switches. So the coefficient of your height in predicting the height of your ideal partner is interpreted one way when you don’t know gender and in another way if you do. If we perform automatic model selection, we’ve allowed the data to choose the meaning of the co-efficient, and if we repeated the experiment we might get a different model and hence give the co-efficient a different meaning. Taylor would say “Yes, the hypotheses that we choose to test are chosen randomly, but this is of necessity since they mean different things for different models. In any case, this is just what you were doing informally when you let the statistician choose the model by hand.”
I see the point here, but nonetheless it makes me uneasy. In the paper I began with, the researchers asked an intelligible question about racism in football and I don’t think they were looking for an “if you include these covariates” answer. Now to some extent, the statistcian’s analysis of this question supports Taylor’s arguments — one team had simply correlated the number of red-flags with the player’s skin tone and left it at that; others felt that they needed to worry about whether more strikers got red cards and perhaps more black-skinned players took that position, and other sorts of indirect effects. In fact, with the exception of the one team that looked at correlations, the question was universally interpreted as “Can you find an effect for skin tone, once you have also reasonably accounted for the other information that we have?” I generally think this is what statisticians, and the general public, understand us to be doing when we try and present these types of relationships. (Notably, there is a pseudo-causal interpretation here; the data do not support an explicit causal link, but by accounting for other possible relationships we can try to get as close as we can — or at least rule out some correlated factor).
I think this is the key to my concerns. I see model selection as part of the “reasonably accounting for all the information we have” part of conducting inference. In particular, we hope that in performing model selection, the covariates that we remove have small enough effects that removing them doesn’t matter. (Or at least that we don’t have enough information to work out that they do matter). Essentially, my response to Taylor is that “The interpretation of a co-efficient doesn’t change between two models, when they differ only in parameters that are very close to zero”. That is, model selection might be better termed “model simplification” — it doesn’t change the model (much), it just makes it shorter to write down. The inference I want includes both 1. and 2. — my target is the model that makes use of all covariates as well as possible (if I had enough data to do this) and I want my measure of uncertainty to account for the fact that I have done variable selection with the aim of obtaining a simpler model.
Of course, this just leaves us back in the old hole — I care about both Problems 1 and 2 and I have no good way to deal with that. There has been a school of work on “stability selection”; essentially bootstrapping to look at “how frequently is this covariate used in the model”. This has to be done quite carefully to have valid statistical properties, which represents a first problem. However, I have to confess a further problem than this; from stability selection you will get a probability of including each covariate in the model; and then a notion of how important it is, if included. If I have to look through all covariates, why do selection at all? There are plenty of regularization techniques (e.g. ridge regression) that can be employed with a large number of covariates that are much easier to understand than model selection. Why not simply use them and rank them by how much they contribute? I’m not convinced this won’t do just as well in the long run.
*** I call this a framework because the specifics of carrying it out for various statistical procedures still requires a lot of work.