Author Archives: Giles Hooker

About Giles Hooker

Giles Hooker is AssociateProfessor in the Department of Biological Statistics and Computational Biology. He works on Machine Learning, Nonlinear Dynamics, Functional Data Analysis and Dynamic Systems. He spends a lot of time thinking about what models mean and how they are interpreted, and what it means to interpret a model.

Thoughts on the Statistics Debate

On October 16, Jim Berger, Deborah Mayo and David Trafimow took part in a debate about the use of hypothesis tests and p-values in scientific studies hosted by NISS. I’m delighted to see Statistics once again grappling with its philosophic basis, and at least some philosophers coming to help out. I think it’s worth watching if you haven’t. See

Here’s some of my own thoughts, having had a few days to digest.

First, the participants had agreed not to prepare beforehand, and while I understand the motivation I think that I would support Mayo’s tweet wondering if that was, in fact, the best idea. I think that the arguments would have been more cogently stated and perhaps more directly engaged with some more preparation. Trafimow, in particular, took some time to warm up (he may have taken the injunction against preparation most seriously) and that is a shame; I don’t agree with his position (more later), but this is all the more reason to want it to be persuasively argued — there might be something I’ve missed! In retrospect, I might have actually gone in the other direction and had each participant write out a statement of principles to be shared ahead of time and read at the start, and from which the discussion could proceed.

I took there as being really two debates: Mayo vs Berger on p-values versus Bayes factors, and Mayo and Berger versus Trafimow on testing at all. I’ll readily admit to finding it hard to concentrate in online fora the same way I would for an in-person event (no-one can see me check my e-mail) and I had to leave and teach before the discussion period, so I may have missed some details, but these were the most salient discussions that struck me.

For the first of these, it seemed to me that Berger and Mayo largely came from very different perspectives and largely talked past each other. Berger supported Bayes factors with “this is how scientists want to interpret p-values” countered with Mayo’s “but that’s not how they ought to interpret them”. As a philosophy of science, I’m in Mayo’s camp here, but have to acknowledge that nearly a century of statistical education still seems not to have found a way to reliably get the point across. We can certainly say that correct science does not need to account for human cognitive failings — it’s ok that it’s hard — but it does open the question of whether there are more readily-understood but equally rigorous frameworks; although a Popperian description of science certainly points to something like classical statistical methods.

But I was sorry to see neither really address the concerns of the other. Berger, of course, would say that Bayesian frameworks provide a coherent alternative to scientific inference and argues that p-value evidence maps poorly only Bayes factors. Mayo would counter that you can’t equate the scales used by the two systems. I’d agree with Mayo, but the technical argument misses the differentiation of intent: do we work from how scientists actually think and try to improve that, or formulate how they ought to think? The latter is certainly appealing, but does run into human failings.

This is evident in the reproducibility crisis (only part of which is based in bad statistical practice), where Mayo is again technically correct in observing that p-hacking is an indication that statistical significance is actually a pretty challenging requirement, if achieved honestly. However, the neatness of the philosophical system doesn’t account for the crises very neatly demonstrating Goodhart’s law as generalized by Marylin Strathem:

When a measure becomes a target, it ceases to be a good measure.

One can, of course, build in guard-rails: pre-registration, or developing methods of post-selection inference, and I would guess that Mayo would be fine with either of these so long as they preserve her severity requirements. It might be better to find ways to reward scientists, not for publishing papers, but for publishing papers that get replicated (not that I have brilliant ideas about how to do this), and thereby incentivizing scientists to be honest (with themselves) about their statistical practice. That idea isn’t original and quite likely neither Mayo nor Berger would disagree with either of these statements. In fact, there may be little to say about their disgreement over a starting point besides “Let me acknowledge the opposing concern, but say that I think we have to just push past it”, but even that statement is useful.

The debate about using hypotheses at all was somewhat shorter. I had initially understood Trafimow’s editorial decision to ban significance tests as something of sociological response to p-hacking — we will get less distorted models and conclusions if we take away the incentives. I’d agree with Mayo that, on balance, I think that removing checks against randomness is counter-productive, but the proposition is not crazy. However, Trafimow staked out a more philosophical position; initially against basing conclusions on wrong models (the reductio of which would make progress impossible) but then against basing dichotomous decisions on incorrect models. I presume this comes from the discontinuity: I can be slightly off in my picture of the world, but still fairly close in my estimates of effects until I turn it into an either-or proposition about something (assuming my estimates are somehow continuous in my model space). I’m not sure this is so much of a concern: the p-values Trafimow objects to (or at least their distribution) would still be continuous in the model space and the potential for disagreement in dichotomous conclusions reduces as the models converge, as effectively argued by Mayo. I will happily buy into confidence intervals being a more informative summary of statistical evidence, when they are available, but we really shouldn’t pretend that they are anything other than a different summary of statistical tests.

I did think that it was a shame here not to have a performance interpretation of hypothesis tests articulated, since “Using this model, I would not frequently be wrong” is an easy counter to Trafimow’s statements. This interpretation does leave you working with models rather than with the world they purport to describe, which is one of Trafimow’s objections, but they do in fact talk about the projection of the world onto that model. (Inference also requires more assumptions than just obtaining parameter estimates does, but that goes for any sort of uncertainty quantification.) The other concern is that it starts veering a bit Bayesian. Nonetheless, we do work with models and I think the within-model replicability is still at least a minimal-and-non-trivial requirement. The replication crisis, also, is explicitly given in performance terms, albeit extra-modular, so some statement that the performance criteria are at least entailed by severe testing would be helpful.

One statement of Trafimow’s that I would thoroughly get behind, however, is that replications should not just be about hypothesis tests, but needs to also include effect sizes. Indeed, I’ve often worried that the narrow focus of statistical methods on single experiments is damaging: in making evidence for specific effects challenging we make scientific conclusions contingent on the specific experimental setting they were obtained in. This disincentivizes science from developing knowledge that is transferable across situations. Biology or psychology are much more complex that physics, of course, making generalized quantitative effects much more diffcult, but I’m not sure that we couldn’t do better — I have a rant about linear models being deleterious for most fields that statisticians have interacted with, but that’s for another day.

In any case; thanks to all participants for having a go. For all I may seem to complain, the debate clearly stimulated my thought processes and if the same is the case for even a fraction of the audience, that’s the optimal a priori outcome it could have had. Well done.

On The Book of Why

With the semester over, I am finally able to get back for a bit of less-academic writing, in this case about my bedtime reading: a review of The Book of Why  by Judea Pearl.

I was put onto this book by the Nature Podcast who reviewed it on their books segment. Pearl’s ideas had been known on the fringes of the statistics community for years and largely dismissed as “Well if you call part of a model causal, the inference is rather circular”. But enough else was happening in causality in statistics that it seemed a good juncture to actually look at how badly wrong that statement was.

As it turns out, it’s not far off. But that doesn’t mean that Pearl’s ideas are devoid of content (or of no relevance to statistics). In fact, I find myself fairly violently conflicted about his project, which might make for a more-interesting-than-usual review.

The first thing to say is that this is excellent bed-time reading for a statistician. In fact it’s hard to work out who else might be the intended audience. It requires far too much statistical background for a general audience (even, I expect, for most computer scientists) but had the right level of informal discussion for me to read when too tired for technical work. I fear that may have limited its sales, but certainly worked well for me.

The second is that it is worth slogging through the opening chapters. These are prototypical examples of CS salesmanship (“the new science of causation”???), both over-promising the remainder of the book, and under-rating the contributions of others, particularly in statistics. I spent a good deal of time wanting to thrown the book across the room as Pearl airily dismissed Fisher as being completely uninterested in causation (What did he think randomized trials were for?) and dismissed the entire field of statistics as following from that tradition (Had he not heard of Don Rubin or Jamie Robbins? How did he think he was going to get inferences out of data, whether he called them causal or not? What about structural equation models?) As it turns out later, Pearl does in fact discuss RCTs as a gold standard (though not as the only standard) and both Rubin and Robbins play large roles. Indeed, it’s hard to find a non-historical figure in the book who isn’t either Pearl’s student (and he’s admirably supportive of these) or a statistician! I rather suspect Pearl should have been a statistician, though I don’t know that he’d be prepared to admit that.

But, after plowing through the aggravation, we get down to business. I’ll divide the book into two broad topics. The first of these is, to me, relatively uncontroversial (at least in parts): causality has not (surprisingly) been formalized in terms of probabilistic mathematical models. The do-calculus (I find this name awkward, but at least it’s descriptive) provides this and, along the way, some insight about what relationships one should examine in data if you are looking for downstream causal effects.

Don’t control for mediators” is, for example, a statistically counter-intuitive statement, though it makes causal sense. (Except the statistician in me is more inclined to condition, and then reconstruct the causal effect — I haven’t yet worked out if that leads to a loss of efficiency or the other way around). He also shows how to view a set of not-obviously-causal problems through this lens, which is certainly interesting and useful. I’ll leave it to others who have looked at the do-calculus more carefully to assess the philosophical contribution (although things are slipperier for continuous quantities than for the discrete values that Pearl — clearly a computer scientist here — finds more comfortable), but I’ll readily admit to surprise at the realisation that it hadn’t been formalized a century or so ago. Statistical application or not, that is an important contribution within philosophy. Some of this really is slippery, though; I’d like to see the Rudin potential-outcomes framework re-written in do-calculus (I think this is possible) or how this relates to the much weaker notion of Granger causality. I’m less convinced that Pearl’s “ladder of causation” (Maybe step-ladder? It has but three rungs) is really as clear as all that, but I’ll accept it as a means of introducing the lay person to thinking about things.

Along the way, he skewers statisticians’ traditional interpretation of their own linear models. See my complains on this at Weasel Words where it’s not only causal relationships that are ignored (contrary to Pearl, I think a comparison of individuals is still interpretable even though not causal), but those are, too.

The second issue is much more controversial: that scientists are far too afraid of causal language (and in particular, that this is statistician’s fault). The argument for this claim goes something like

  1. Scientists do, in fact, think that they are finding causal relationships.
  2. They are hampered in discussing this by depressing party-pooper statisticians forever warning about hidden confounders.

And in fact I agree with both these and recall thinking 1. on many occasions myself! Indeed, I also find myself in emphatic agreement with Pearls’ frustration at statistician’s refusal to go outside of their own (fairly narrow) toolbox and incorporate an understanding of the domain that they work with. See, for example, my comments in Interpretation and Mechanistic Models  although I will come back to that (and have been recently guilty myself of having too little time to develop enough depth to really work with a problem).

BUT: these claims, and the benefits of causal understanding, are easy to make with the rather pat models that Pearl produces. It’s much harder in the real world where, as in nutrition or public health, the unlicensed causal interpretation of findings and the constant attenuation of cautionary languages (eg recently by Andrew Gelman) produces features and fads and much more malicious effects, too. My sister‘s reaction to the set of ideas was that encouraging more unsupported causal interpretation would be disastrous in public health.

And in fact, this really is the nub: Pearls’ analysis works beautifully once you have agreed on what the causal relationships are. But almost everywhere, that agreement doesn’t exist (for a nice interaction with the fairness debate, see this paper). Even in The Book of Why, I kept looking at the already-overly-simple models and saying “I’m not sure that’s right.” How only earth are we to agree on causal relations in real scenarios? Every example in Chapter 9 illustrates this for me.

Now I’m pretty sure that Pearl’s response would be that the solution to this isn’t to avoid discussing causation, but to make it explicit. That is to say “A causal claim is being made here, let’s explicitly discuss that and the evidence for it.” He even (although only once, and in passing) countenances establishing tiers of evidence for causal effects: randomized trials, from observational data, etc.

And I’m generally sympathetic to “Let’s be honest about what we think is going on.” even after accounting for “But human’s have a horrible tendency to run away with an idea.” And it might be worthwhile to produce a somewhat more formalized framework for discussing causal claims. But the discussion sections of paper do, in fact, often it clear what the authors think the causal relations are. These are not stated elsewhere precisely because the evidence for them is weak.

I think that what Pearl perceives as hostility to causation on the part of statistics can be reasonably attributed to caution. Statistics suffers from an over-abundance of this, partly due to statistician’s unwillingness (or lack of time) to really get involved with a subject. This leads us to be rather scared of writing down anything but the weakest models, and it certainly leads us to be highly skeptical of causal statements where the do-calculus (ie, performing an intervention) hasn’t been physically instantiated in an experiment. That attitude is a hindrance, but it’s also born of a century of experience of poor replicability and bad scientific consequences. Even the more-often accepted Rudin analysis has a “no hidden confounders” assumption that many (myself included) find hard to swallow.

Now I agree with Pearl that statisticians are far too scared of writing down a model. One doesn’t have to be Bayesian to say “Let’s start with writing down what I think I know and see how that agrees with the data” — we do that in sample size calculations already. But I’m really not persuaded that sticking with linear models or categorical relationships is the way to do this. One thing that Pearl leaves out of this book is mechanism: how, physically, does this effect take place? (Ok, this comes up as mediation, but in a quite different context) More importantly, can I transfer this understanding to a different system? Without this sort of understanding — and that’s very difficult — causality has relatively limited uses. But Pearls’ observation-based and rather pat models don’t really get us anything like that.

So do read The Book of Why; it is a set of ideas that you should know about, at least informally, and it is useful to think about more than just phenomenological correlation. But then also go and read some physics, or some mathematical biology as well. Statisticians need a really good dose of both.

Uncertain Explanations

A much-delayed addendum while I’ve attempted to get the following paper out

which deals with ensuring the stability of model distillation using trees.

This actually touches on both the global approach to understanding machine learning that I have mostly explored, and the local approach to explaining a particular prediction (see
I’ve previously discussed the issue of uncertainty in machine learning in terms of global interpretation — are you trying to understand how a particular function arrives at a prediction, or are you trying to say something about the underlying causes of that prediction?

Here I’ll ask the same question about local explanations. This is particularly relevant in the light of the European General Data Protection Regulation (GDPR) which can be read to impose a “right to an explanation” for decisions that are made about individuals. Exactly what such a right entails (and whether it really is implied in the GDPR) is not currently well defined; interpretations vary from a requirement to provide a formula, to identifying actions that could be taken to change a decision, to requiring something closer to causal reasoning identifying why that decision was reasonable.

To be concrete, let’s take a bank that develops a tool to automatically decide whether or not to give a mortgage to an applicant. If they don’t approve the loan, they can be asked “Why not?” with the implication that “because my neural network said so” would not really cut it.

In fact, this is not really so an idea: I recall hearing someone from Fair Isaac  discussing just this problem in the early 2000’s and their solution was pretty natural: use a decision tree. I discussed the basic ideas of decision trees earlier in my blog where we see that a tree has a pretty easy to follow glyph. There is also a natural explanation for why you didn’t get your loan — the decision for the last node on the tree. ****

I also noted that trees don’t perform very well. Fair Isaac’s solution was to first train a neural network (if I remember correctly) and then use this to generate a huge amount of data to create a tree that mimics the neural network. This idea has since been given the term “model distillation” — first generate a complex model (a teacher), then produce an interpretable model (student) that mimics the predictions of the original model. I repeated Fair Isaac’s ideas for an application that involved shortening psyciatric questionnaires  (you can traverse the tree asking questions only as you need them) a few years ago. I couldn’t find a citation to the Fair Isaac work, but it does seem that the idea goes back to at least 1995.

The paper I shamelessly plugged above asks “how much data do we need to stabilize the tree structure”? If we characterize our procedure as

1. Train some ML model F(X) to predict Y from X.

2. Generate a huge set of new example X’s and give each example the response F(X)

3. Use this large new data set to train a Tree that mimics F.

we (well, mostly my students Yichen and Zhengze Zhou) asked “How many new X’s do you need so that repeating Steps 2 and 3 get you the same tree?” The answer turns out to be a couple of orders of magnitude more that seems to be regularly used. And we developed some tools to assess the stability of this type of approximation tree and to look at how deep that tree should be (more on that later).

Now a reasonable question might be “why should I care?” It actually turns out that the huge amounts of data we ended up using weren’t necessary to produce a tree that gave accurate predictions — a tree could split on X1 first and then X2, or vice versa and end up with fairly similar predictions. We needed this much data solely to ensure that the structure of the tree remained stable.

So again, why should I care? This is an iteration on uncertainty in ML. If I am just seeking to explain how a prediction was arrived at, I could obtain the tree and actually use it to make predictions (as opposed to explaining predictions). I can then readily claim that I am providing a description of how the tree I happen to use arrives at its predictions. This is ok, even if the structure of my tree is chosen by random chance.

However, if I don’t replace my neural network or random forest with the tree, and just use it to explain a different model, we might start getting dubious. If you are told you didn’t receive a mortgage because you are over the age of 37, but there is a parallel universe in which exactly the same decision is explained by your income being below $50,000, you might start questioning the worth of that explanation.

Trees, of course, are only one example of this. LIME  — the currently most popular source of local explanations based on approximating the gradient of F(X), also uses a random sample whose stability we might query. I have to admit to not yet having gotten around to looking at that.

The point of this is that you can have gotten your original F(X) any way you like; we are only look at Monte Carlo error — the variability due to the random samples you generate in order to arrive at your explanation. There are already plenty of debates about what a “right to an explanation” entails, but I’ve yet to hear of “statistical stability” being added as a criterion. Perhaps it should be — or at least be made explicit that various others (such as some form of causal explanation) assume such stability.

Of course, we can ask about stability more generally — does our explanation have to be non-random, or is it ok to pick up on random features of the data? Ie, does the fact that people over 180cm high and October 19 happen to have an unusually high default rate in our dat set allow us to use that as an explanation even if another data set would be very unlikely to find that this is replicable? Alternatively, what standard of scientific replicability of explanations is required in the GDPR?

We couldn’t do much in our paper on that. We can assess whether the tree is capturing signal that is distinguishable from noise when it makes a split. We could potentially ask whether the same split would be chosen, not just with another randomly generated sample from F(X), but if we re-learned F(X) using a new data set. But changes in one split cascade down the rest of the tree and it’s then very hard to represent variability between different trees, or to know what to do with it, even if you could.

So this is imperfect: there is a form of explanation where “this is what we do, you don’t get to query why” means you don’t have to worry about uncertainty. But I think that disappears as soon as you use one model to predict and a second as an explanation. It seems that full on scientific replicability might be too strong a requirement– that would probably render all of ML useless — but what should the standard be?

There’s not much work on assessing such stability. I don’t know how stable LIME is; and we were only able to assess whether our trees were simply modeling random variation when we used random forests to produce F(X) when we know something about its stability. But the question is barely being asked and I think will become practically relevant fairly quickly.


**** Of course, changing other covariates could affect a decision higher up in the tree, which also changes the decision about your mortgage. So this is fairly imperfect in itself.


Interventions, Interpretation and Ethics

Herein are a set of musings prompted by Rich Caruana’s wonderful talk at the workshop  I boasted about last post. I don’t think that this is a knock-out for ML interpretation (as I’ll say at the end) but the ethics of ML derived from observational data do start to give a real case for interpretation. That’s not without complications or difficulties, something I want to work through here.

The essential issue here is that the principle use of ML is for interventions: what search results will we show you? Will you replay a loan? Will a prisoner re-offend? A job candidate perform well? Is there a high probability of crime on this corner? Is that a truck on the road ahead, or a street sign? What direction is this bicycle moving? These predictive questions are all relevant because they imply actions that change the outcomes of events.

What this means is that to the extent that those actions are directed at changing outcome, any relationship between covariates and outcomes has to be thought of as causal. And in particular, when we employ ML trained on observational data, the potential for causal confounders must always be considered.

In particular, Rich’s talk highlighted this beautifully. I encourage you all to watch it, but for those without an hour’s spare time, he presented a wonderful study in all of the confounding effects you might find if you’re not careful. The highlight, as I discussed last time, was the finding that asthma appears to substantially reduce your mortality risk from pneumonia. Rich’s explanation for this was that this was confounded with either time of onset (asthmatics notice symptoms earlier) or the aggressiveness of care. The key issue being that if a ML tool was naively employed to triage patients, it would send all the asthmatics home — the opposite of good medical practice.

The point here being that if you simply threw a deep network at the data (or a random forest) and didn’t look at how it was using it, you’d miss the effect. Of course, you’d put the tool through some trials first and hopefully notice that your improvement in outcomes is not what you’d estimate from the data (or more realistically, that doctors would start saying “this isn’t what I would have done”, first). At that point, however, one does have the question of what to do: you need to do some interpretation to work out that the asthmatics are all being sent home, so a die-hard non-interpretation advocate can’t do much but try training more and then give the whole project over when it doesn’t work out.

Of course, what should be done to fix up a tool once you’ve discovered this effect is an issue, but that’s for another time.

The real issue is that Rich found more and more of these effects, often which got increasingly subtle. There are round-number effects of age that might affect treatment: over 85 and you’ve had a good life anyway — your risk jumps — over 100 and you’ve become special and your risk goes down again. Plus a whole bunch of other things. And here’s the ethical problem: how hard do we have to explore for these effects before giving the tool a try?

Let’s get back to a clinical trials analogy. If you have a new drug, you run trials to check both for safety (side effects) and for efficacy. You also control for things you think might be a risk: you don’t give a new Sulfa drug to someone who is allergic, in the same way that we might exclude asthmatics from our automatic triage tool. Once you’ve excluded people on these and similar grounds, you test to see if there’s anything else that you (and reviewers of your trial) haven’t thought of. There is no way comprehensively enumerate all the things that might go wrong.

So far, so much analogy, but I think there are a couple of important distinctions here. First, the concept of a safely study doesn’t really apply — side effects are the result of actions recommended by our auto-triage, not the use of the triage itself. You can’t apply our auto-triage to healthy patience and see if they have an adverse outcome just because of it. Second, the involvement of social factors creates a much more copmlex set of problems. If I look at a molecular compound that I want to use to treat some disease, we do have some reasonable chemical knowledge that would indicate a relatively small set of things to worry about. When we consider all the things that could be confounding a data set like Rich’s, there are a much larger body of possible negative effects, some of which we might find hints of in data, some of which we might not. Given time, we might start cataloguing common confounders in a particular setting (so long as we look for them), but given we know these effects are likely, but don’t know what they might be, how hard should we search for them?

I don’t have a great standard for this, and I think we also have to ask “What can we do to fix our auto-triager, now we know it’s biased against asthmatics?” before we really develop it. There are some forms of safety trials that one can do: have a doctor also triage cases and flag where she differs from our model (we still have to work out why), but how many are enough? More automatedly, we probably can make guarantees that suspicious effects that _might_ occur in the data are smaller than some level (but a cut-off has to be agreed), but these are tools that need to be developed. We also need to be confident that our model exploration is catching suspicious effects (more on that later).

Outside of clinical settings, though, the same questions arise. A large part of the fairness literature in ML is about having the wrong data for exactly these reasons, and in contexts that may be harder to understand. How do we establish that facial recognition software performs worse on African Americans because its training corpus is white? What about sexist automatic translators?
These are all interventions where confounding relationships have made an algorithm behave poorly. In both cases, they aren’t problems you can pick up pre-deployment because even your test data has the same confounding effect.

Now of course, it may become more obvious in the real world when, for example, black women have a harder time getting their iPhone 10’s to unlock, but even here it’s fairly reasonable to assume there will be some confounding and how, and how much, should we check before deployment? What about monitoring afterwards?

So does this mean that ML is never exempt from interpretation? Not really: all these examples involve an intervention that changes the observed outcome. There’s lots of prediction problems that don’t: astronomers want to classify galaxies, we might want to predict sales volume for the sake of inventory management (this won’t affect demand particularly). As Rich notes, an insurance company might want to forecast the outcome of a pneumonia patient, but not actually change the treatment.

Even when we are going to intervene, I might know that there is a random process that breaks a confounder. I might take a clinical trial, for example, and then try and predict a treatment effect in finer detail: some particular set of patients respond well to this particular drug. Since the drug assignment is randomized, I’m confident that it’s not associated with an effect.

So, bad data is a reason to want to interpret ML, but that doesn’t mean you always need to. We do still need to work out how hard you should work at that interpretation, and what you should do if you think you’ve found a problem.

A Brief and Undisciplined Rant Against Refereed Conferences

This post is off-topic, but striking while hot and ill-considered.

I get to send me a digest of the previous days submissions in Statistics and in Machine Learning. It sends out the titles and abstracts of submissions (from two days previously) and titles of anything that’s been updated. It’s usually a good way of keeping track of developments and I’ve even started trying to get through the five year back-log of papers I intended to read.

This morning, however, the e-mail was novellete-length and I was getting somewhat exasperated by all the people suddenly doing random forests until I realised

Of course! The KDD deadline was this weekend!

I should have known about that since I was kept online by one of my students trying to make the deadline.

So I’m going to be hypocritical — given that I’ve got a paper going into KDD — but declare that I think that Computer Science-style conference publications are a disservice to research.  This was made evident both by the experience of putting the paper together and this morning’s digest.

I’ve sometimes heard the opinion that conference publications let you get work out faster and make the field move quicker.

There’s some truth in that, but frankly it also means that you don’t think as carefully about what you write and I’m sure that referees don’t think as carefully about what they read. This means that there’s more junk and a worse filter.   It’s certainly true that the paper I was involved ended up glossing over a bunch of unresolved issues and while I think it has good content, it really could have been thought through more carefully, its analysis better targeted and its experiments more fully explored. I suspect that this is also true of the small mountain of papers I’ve felt that I should take a look at this morning.

That is, there is a point to slow scholarship; to papers that are published because they say something well thought out and have a good reason for saying it.  There’s benefits to a referee process that improves the paper and takes the time to do so.

Given the current hype around CS/ML and the explosion of interest in these conferences, I expect that the process of selecting papers will simply get noisier.

But there are CS journals! And I’m prepared to acknowledge that. What bugs me here is the double counting. I hadn’t realised until recently, but apparently its de rigueur to publish in KDD or ICML or NIPS and turn around and put the same material — somewhat expanded — into a paper.  I’m not completely aghast at intellectual dishonesty here; but I do feel that CS publications ought to be viewed as different, and lesser, than journals. I’ve previously been sold on a 1:1 conversion between NIPS and a stats journal paper, and I’ve tended to see them discussed in these terms. I think it’s worrying.

Now anyone who has been to a Statistics conference knows the nightmare of 60 parallel sessions with everyone and their pet tapir giving 12min talks in rooms containing only the other presenters. I could readily see an advantage in having a randomized selection process that hopefully improves the average quality of the presented papers and cuts down on the size of the program book; but that’s about how a conference publication should be valued, and someone tell me why I should take the time to read it?


The Interpretation Debate

It’s been some time since I’ve written anything here, although in that time the intelligibility debate within the machine learning community has substantially increased ***. This has included my own contributions (with thanks to Dad), but developments like the European Data Statute with its “right to an explanation” and exactly what it means have helped. In the last two years, NIPS has featured large events (2016, 2017-what a great URL!) on the topic (co-organized by my fantastic Cornell colleague Andrew Wilson, and my fantastic collaborator Rich Caruana).

Alongside these, the Fairness, Accountability and Transparency (FATML) movement in Machine Learning has started raising concerns about the use of ML and the ways it can perpetuate or exacerbate social biases. Some of this was catalyzed by a book by Cathy O’Neil and by Pro-publica investigations of the use of risk scores in parole hearings.

I’ve also just returned to teaching after running a workshop on ML and Inference. I wouldn’t normally be quite self-promoting, but really all I did was get really interesting people together in one place; the talks are all online, some more technical than others, but they were all interesting, innovative and well worth watching.

So I’m inspired to write again; in particular, I think there are topics in explanations and uncertainty quantification that are unexplored, and there are important relationships between causal inference and fairness that I don’t yet understand and want to chew over here.

But first: a debate! The most recent Workshop on Interpretable Machine Learning featured such a debate, with the proposition that “Interpretation is Necessary in Machine Learning”, argued by some of the best in the field. Having gotten around to watching it, I can’t help but pass commentary.

The first thing to say is that I think the debate ended up being pretty conciliatory. One of the standard debating tactics is to make the debate a disagreement about definitions and to some extent this debate could be summarized as:

Affirmative: There are some occasions when you need to understand what your ML model has learned.

Negative: There are some occasions where you don’t.

Not that this isn’t, in itself, illuminating. See Rich’s talk at my workshop for why one might want to understand what you’re doing. I want to expand on this in a later post, but the essence is that when using ML for an intervention, you need to view your covariates causally.

In particular, Rich had a data set that he wanted to use to predict mortality risk from pneumonia; this could then be used to triage patients between a high risk group to hospitalize and a low-risk group to send home. The thing was that he found that having asthma meant you were predicted to be at lower risk. This is almost certainly not because asthma protects you, but because asthmatics get more aggressive treatment. Thus, employing this tool would cause a spike in asthmatic mortality. Crucially, since the test set will have the same properties, out of sample performance won’t help uncover this before it is deployed.

The counter to this is the statement “Well, you would always run a randomized trial to test this sort of intervention” and it’s fair to say that you don’t necessarily know that some new drug, say, won’t have counterproductive effects in some subset of the population. (This was effectively Yann LeCun’s response in the debate). Ethically, of course, we couldn’t do that in this case since we have reason to believe it could be harmful, but that potentially applies to any intervention based on an ML tool that’s trained with observational data. That is, since we know that observational data is prone to confounding effects, can we ever employ it without seeking to check whether there might be some reason to believe they might be deleterious?

I think the answer to that is “It depends” and will come back to this, as I will to the notion of performance evaluation. At one point during the debate Killian Weinberger asked whether one would trust a doctor or a robot to perform surgery where the latter had much better outcomes than the former. Everyone said yes; but Rich’s point would be “Not if the robot got only the easy cases”.

This will all take me too long for one post.

Some other thoughts from the debate:

1) There seemed to be an assumption that performance was incompatible with intelligibility. I’m not sure that that is really the case. I’d agree that I’m less sure about what it means to be intelligible in image processing, but when the individual covariates are interpretable, one can often use a black box model as a guide to produce something that’s fairly understandable that does just as well. At least within the span of additive models (something that Rich as spent a good deal of time working on), I’ve never found a model that really needed more than a three-way (non-parametric) interaction.

Along those lines, Andrew would even challenge the notion that deep learning is unintelligible. I’m less convinced about this. Yes one could think of the bottom layer of a network as learning concepts (in image recognition this can be compelling) and then try to work out the more complex concepts in the next layer and so on; but I’ve never really seen that carried through convincingly. (Admittedly, I’ve also never really looked).

2) Some of what I would call interpretation was dismissed as not being interpretation. In particular, local interpretation — referred to as “sensitivity analysis” in the debate (how does changing one covariate affect the prediction) was characterized as something other than interpretation; presumably this would apply to the “right to an explanation” too. I think you’re learning something human-understandable from that and as such it counts as interpreting.

Similarly, model distillation (ie, building a simpler model to mimic a black box: a term that wasn’t around when I wrote about it in the mid-noughties ***) wasn’t regarded as interpretation. Here you do need to worry about accurately capturing the model you’re mimicking, but so long as you can do that to within acceptable bounds, I’d call it interpretation.

Similar remarks can be made about Yann LeCun’s employing background knowledge (as in that asthma is likely to make pneumonia worse). In some cases — such as structuring convolutional neural networks for image processing — you aren’t encoding a lot of explicit knowledge about the particular task, just what pictures are like and what works well for processing them. But in health care, the act of saying “Hang on, we really ought not to send all our asthmatics home with asprin” really ought to be interpretation.

So here I am arguing over a different definition in the debate. If you do say that only algebraically simple models (say, defined by a human being able to manually produce an output from it in under 10min) are interpretable, then yes, you likely have to sacrifice accuracy. If you ask about local interpretation (sensitivity analysis) there is nothing to do. On the other hand, if you ask about models that can be interrogate to reveal global-level patterns, I think there’s a lot more flexibility than the negative team suggested.

Is this an interesting shift from the ML “human interpretation is just ego-boosting” view I looked at a few years ago? Well a lot of this is really not about “we need algebraically simple models” but rather “We don’t have the right data”, something I’ll get into next.



*** This is clearly not due to me, or I’d be getting more citations.

*** One of those papers that isn’t getting more citations — even by my own students! 

On Interpretation In Multiple Models

No, dear reader, it has not taken me this long to come back to this blog because I am reluctant to give up talking about my own work. Simply the discipline imposed on me by official service commitments has rather sapped my discipline in other areas.

In this post, I will leave off writing about interpretability and interpretation in machine learning methods and instead focus on the details of statistical interpretation when conducting model selection. This has become particularly relevant given the recent rise of the field of “selective inference”, lead by LINK Jonathan Taylor LINK and colleagues.

While, the work to be discussed is conducted in the context of formal model selection (ie, done automatically, by a computer), I want to step back and look at an older problem — the statistician effect. This was brought up by a paper which conducted the following experiment: the authors collected data on red cards given to soccer (sorry Europeans, I’m unapologetically colonial in vocabulary) players along with a measure of how “black” they were and some other covariates. They then asked 29 teams of statisticians to use these data to determine whether darker players were given more red cards. The results of this were that the statisticians couldn’t agree — about 60% said there was evidence to support this proposition, another 30% said there wasn’t. They used a remarkable variety of models and techniques.

I wrote a commentary on this for I won’t go into the details here, but the summary is that there is a random effect for statistician (ask a different statistician you’ll get a different answer) but the problem in question exaggerated the effects of this by focussing on statistical significance — all statisticians had p-values near 0.05 but fell on one side or the other. Nonetheless, it does lead one to the question of how could you quantify the “statistician variance”.

Enter model selection. Declaring automatic model selection procedures (either old-school stepwise methods or more modern LASSO-based methods) a solution to statisticians using their judgement is a pretty long bow (and doesn’t account for various other modeling choices or outlier removal etc), but it will allow me to make a philosophical connection later. Modern data sets have induced considerable research into model selection, mostly under the guise of “What covariates do we include in the model?”

Without going into details of these techniques, they all share two problems for statistical inference:

1. Traditional statistical procedures such as hypothesis tests, p-values and confidence intervals are no longer valid. This is because the math behind these assumes you fixed your model before you saw the data. It doesn’t account for the fact that you used your data to choose only those covariates which had the strongest relationship to the response, meaning that the signals tend to appear stronger in the data than they are in reality. (For those uninitiated in statistical thinking: if we re-obtaining the data 1,000 times, and repeating the model-selection exercise for each data set, covariates with small effects would only be included for those data sets that over-estimate their effects).

2. We have no quantification of the selection probability. That is, under a hypothetical “re-collect the data 1,000 times” set-up, we don’t know how often we would select the particular model that we obtained from the real data. Worse, there are currently no good ways to estimate this probability.

Jonathan Taylor (in collaboration with many others) has produced a beautiful framework for solving problem 1.*** This goes by the name of selective inference. Taylor describes this as “condition on the model you have selected”; in more layman’s terms — among the 1,000 data sets that we examined above, throw away all those that don’t give us the model we’ve already selected. Then use the remaining data sets to examine the statistical properties of our parameter estimation. Taylor has shown that in so doing, he can provide the usual quantification of statistical uncertainty for the selected parameters.

The (a?) critique of this is that we are conditioning on the model that we get. That is, we ignore all the data sets that don’t give us this model. In doing this, we throw out the variability associated with performing model selection. That is, we still haven’t solved Problem 2. above.

Taylor’s response to this is that the model you select changes the definition of the parameters in it. I’ll illustrate this with an example I use in an introductory statistics class; if you survey students and ask their height and the height of their ideal partner you see a negative relationship — it looks like the taller you are, the shorter your ideal partner. However, if you include gender in the model, the relationship switches. So the coefficient of your height in predicting the height of your ideal partner is interpreted one way when you don’t know gender and in another way if you do. If we perform automatic model selection, we’ve allowed the data to choose the meaning of the co-efficient, and if we repeated the experiment we might get a different model and hence give the co-efficient a different meaning. Taylor would say “Yes, the hypotheses that we choose to test are chosen randomly, but this is of necessity since they mean different things for different models. In any case, this is just what you were doing informally when you let the statistician choose the model by hand.”

I see the point here, but nonetheless it makes me uneasy. In the paper I began with, the researchers asked an intelligible question about racism in football and I don’t think they were looking for an “if you include these covariates” answer. Now to some extent, the statistcian’s analysis of this question supports Taylor’s arguments — one team had simply correlated the number of red-flags with the player’s skin tone and left it at that; others felt that they needed to worry about whether more strikers got red cards and perhaps more black-skinned players took that position, and other sorts of indirect effects. In fact, with the exception of the one team that looked at correlations, the question was universally interpreted as “Can you find an effect for skin tone, once you have also reasonably accounted for the other information that we have?” I generally think this is what statisticians, and the general public, understand us to be doing when we try and present these types of relationships. (Notably, there is a pseudo-causal interpretation here; the data do not support an explicit causal link, but by accounting for other possible relationships we can try to get as close as we can — or at least rule out some correlated factor).

I think this is the key to my concerns. I see model selection as part of the “reasonably accounting for all the information we have” part of conducting inference. In particular, we hope that in performing model selection, the covariates that we remove have small enough effects that removing them doesn’t matter. (Or at least that we don’t have enough information to work out that they do matter). Essentially, my response to Taylor is that “The interpretation of a co-efficient doesn’t change between two models, when they differ only in parameters that are very close to zero”. That is, model selection might be better termed “model simplification” — it doesn’t change the model (much), it just makes it shorter to write down. The inference I want includes both 1. and 2. — my target is the model that makes use of all covariates as well as possible (if I had enough data to do this) and I want my measure of uncertainty to account for the fact that I have done variable selection with the aim of obtaining a simpler model.
Of course, this just leaves us back in the old hole — I care about both Problems 1 and 2 and I have no good way to deal with that. There has been a school of work on “stability selection”; essentially bootstrapping to look at “how frequently is this covariate used in the model”. This has to be done quite carefully to have valid statistical properties, which represents a first problem. However, I have to confess a further problem than this; from stability selection you will get a probability of including each covariate in the model; and then a notion of how important it is, if included. If I have to look through all covariates, why do selection at all? There are plenty of regularization techniques (e.g. ridge regression) that can be employed with a large number of covariates that are much easier to understand than model selection. Why not simply use them and rank them by how much they contribute? I’m not convinced this won’t do just as well in the long run.



*** I call this a framework because the specifics of carrying it out for various statistical procedures still requires a lot of work.

Statistical Inference for Ensemble Methods

It’s taken me a while to get back to this blog, so thanks to those of you who are still following. The break was partly because I’ve been back from sabbatical and facing the ever-increasing administration burden associated with recently-tenured faculty (there’s a window between being promoted and learning to say no to everything that I think the administration is very good at exploiting).

It’s also partly because I didn’t want to keep harping on about my own research. I will finish off from where I left my last blog post (on producing uncertainties about the predictions from random forests) but I want to then get back to philosophical musings, particularly about some recent developments in statistics.

So, back to self indulgence. Lucas and I managed to produce a central limit theorem for the predictions of random forest (paper now accepted at the Journal of Machine Learning Research after much debate with referees). Great! Now what do we do with it?

Well, one thing is to simply use them as a means of assessing specific predictions. As a potential application, Cornell’s Laboratory of Ornithology  has a wonderful citizen science program called ebird, which collects data from amateur birdwatchers all over the world. They use a random forest-like procedure to build models to predict the presence of birds throughout the US and one of the uses of these is to advise the Nature Conservancy about particular land areas to target for lease or purchase. Clearly, they would like to obtain high-quality habitat, in this case for migratory birds, and can do so off the models that the Lab of O produces. These currently do not provide statements of uncertainty, but we might reasonably think about posing the question “Would a plot with 90% +/- 10% chance of bird use really be better than 87% +/- 3%?” We know the second spot is pretty good, the first might be very good, but we’re much less sure of that.

More importantly, we can start trying to use the values of the predictions to learn about the structure of the random forest. A first approach to this is simply to ask

Do forests that are allowed to use covariate x1 give (statistically) different predictions to those that are not?

This is expressed simply as a hypothesis that if we divide the covariates into x1 and x2, say, we can ask whether the relationship

f(x1,x2) = g(x2)

holds everywhere. Formally, we state this in statistics as a null hypothesis and ask if there is evidence that it isn’t true.

We can assess this by simply looking at the differences between predictions at a representative set of covariate points. We of course need to assess the variability of this difference, taking into account the covariance between predictions at nearby points. It actually turns out that the theoretical machinery we developed for a single prediction from a random forest is fairly easily extensible to the difference between random forests and to looking at a vector of predictions. To formally produce a hypothesis test, we have to look at a Mahalanobis distance between predictions, but with this we can conduct tests pretty reasonably.

In fact, when we did this, we found that our tests gave odd results. Covariates that we knew were not important (because we scrambled the values) were appearing as significant. This was odd, but a possible explanation is that this works just like random forests: a little extra randomness helps. More comforting was the fact that if we compared predictions from two random forests with two different scramblings of a covariate, the predictions were not statistically different. This led us to suggest that a test of whether x1 makes a difference be conducted by comparing a forest with the original values of x1, and one with these values scrambled and that seemed to work fairly well.

Tests between different forests conducted in this way are problematic, however, in two ways. First, the need to scramble a covariate is rather unsatisfying, but it also limits the sort of questions we can ask. We cannot, for example, propose a test of additivity between groups of variables:

f(x1,x2) = g(x1) + h(x2)                 (*)

or, in a more complicated form:

f(x1,x2,x3) = g(x1,x3) + h(x2,x3)

Here x1, x2, and x3 are intended to be non-overlapping groups of columns in the data.

What we decided to do in this case goes back to old ideas of mine, which I think I mentioned in an earlier post. That is, we can assess these quantities if we have a grid  of values at which we make the predictions. That is, if we have a collection of x1’s and of x2’s and we look at every combination of them, we can ask how far away from (*) is f(x1,x2)? I’ve illustrated this in the figures below (rather than drawing out a mathematical caclulaiton here), but this comes out to just be a standard ANOVA on the grid.


Now in order to test this statistically, we again need to know the covariation between different predictions, but our central limit theorem does this for us. ***  These tests turn out to be conservative, but they still have a fair amount of power.

The point here is a shift in viewpoint for statistical interpretation from the  internal structure  of a functional relationship to being derived from the predictions that are made. By deriving our notion of structure and inference with respect to predictions we can be flexible about the models we fit, we can allow high-dimensional data to enter as nuisance parameters (without a lot of model checking) and we build a bridge to machine learning methods. Leo Brieman wrote a paper in 2001 in which he outlined the distinction between statistics and what he called “algorithmic data analysis” +++. I like to think we’ve at least started a bridge.

Finally a point on the hypothesis testing paradigm that I have pursued here. Hypothesis tests have been rather unpopular recently, and not without reason — over-reliance on low p-values at the expense of meaningful differences is a real scourge of much of the literature. What they do do here, however, is ask “Is there evidence to support the added complexity in the model?” Even better for me is that I noted in a previous post that which prediction points you look at make a bit difference to your variable importance scores. On the other hand, if one of the hypotheses above is true, you can look at any prediction points you like and all you screw up is how easy it would be to detect it not being true. That said, the biases in ML methods, particularly when making predictions where you do not have much data, mean that it’s still best to focus on making predictions where you do have a fair amount of data.

Next time: parametric models and the statistician as a source of randomness.

*** As a further note, these grids can get pretty big and estimating a large covariance matrix in our CLT is pretty unstable. In this case we turned to the technology of random projections that helped improve things a lot.
+++ And largely (and correctly) lamented statistician’s unwillingness to consider the latter

Statistical Inference for Ensemble Methods

The last post was allowed me to get a a pitch in for some of my recent work and I decided that I liked that so much I’d go all-out this post to go into further detail about our results, how we got them, and (just to spice things up) a race for priority.

The work that  Lucas did was on methods for bagged trees and  random forests . I talked a bit about this in an early post, but to avoid sending you there here’s the recap.

1. Bagged trees methods are based (unsurprisingly) on building trees. We do this by dividing the data in two based on one of the covariates. We choose the division by working out how to make the responses in each part as similar as possible. We then do the same thing to each part and keep going until we run out of data.

2. Bagging just means taking a bootstrap sample of the data and building a tree on it. Then taking another, then another until you have a collection of several hundred, or several thousand, trees. To use these, you make a prediction with each tree and then average the results.

3. Random forests are exactly the same, but they use a slightly different tree building mechanism in which you only consider a random subset of covariates each time you split the data. This is supposed to make the trees less aggressive in fitting the data.

The basic thought that we had with this is that “There’s a bootstrap structure going on. This is often used for inference, why don’t we do so here?”

Fair enough, but that leaves two problems: Inference about what? And the small matter that the bootstrap doesn’t work in this case. For the first problem, we return to diagnostics for black box functions where we saw that if there is no parametric structure in the model, the relevant thing to look at are the model’s predictions. Just as we could try and use these predictions as a means of assessing variable importance etc, we can now ask whether those measures are statistically interesting, eg, do they differ from zero?

Now to the technicalities. It would be nice to use, say, the variance between the predictions of different trees, as a means of obtaining the variance of their average. Unfortunately when we use trees built with bootstrap samples, their predictions are highly correlated, so the usual central limit theorem doesn’t apply.

This had us stumped for a while until we hit on using  U-statistics. These are a fairly classical tool that doesn’t necessarily get a lot of attention these days, but basically they’re a particular form of estimator. Suppose you have a function (which we will, with admirable ingenuity call a “kernel”) which takes any two values from your data sets; the U-statistic corresponding to this kernel is the average of giving the function every pair of values in the data. You can do this with kernels that take three values etc.

For us, the kernel is “take these data, build a tree and make a prediction at these covariates”. It doesn’t have a nice algebraic representation, but it can be used nonetheless. For this, rather than taking bootstrap samples, we merely choose a subsample of the data at random, use it to build a tree and make a prediction. This is a fairly minor modification and in fact reduces the computational cost of building the tree.

The nice thing about U-statistics is that they have a central limit theorem. The variance is somewhat different to “take the variance and divide by n” in the classical CLT, but nonetheless it is something we can use. I’ll talk about that further down.

Of course, if we have a data set of size 1000 and took all subsamples of size 100, we would run out of computing power very quickly. So we’ll only take subsamples at random (an “incomplete U-statistic”). To do the asymptotics we also ought to expect that the number of points in each subsample should grow as the over-all size of the data grows (“infinite-order U-statistics”) and random forests also use some randomization in building the trees (we had to invent a term for this; we’ll call it “random kernel U-statistics”). There wasn’t a result for incomplete infinite-order U-statistics with random kernels so we  provided one. As noted, the variance is different from what you would normally expect, but by being a bit careful about how we chose subsamples we have a nice way of calculating that as well.

This represents the very first formal distributional result for the predictions of random forests; although there were a few heuristic precursors.

But as cool as that is, it’s made better by a knuckle-biting race. The context for this is that Lucas and I started working on this in 2011/2012. I was at the Joint Statistical Meetings in 2012 and ran into  Trevor Hastie  and quite excitedly told him about the results that we were developing. It was a bit of a shock, then, to be told that he had a student working on the same problem. As it happened  Stefan Wager  came out with a paper producing confidence intervals for predictions of random forests a couple of months later, although without any theory. It’s actually a rather nice application of the infinitesimal bootstrap, which is a somewhat different means of estimating the variance of a random forest than we were using.

Lucas and I didn’t manage to put  our paper on ArXiv until April, followed a week later by another  by Stefan Wager which developed a different central limit theorem. Arguably, we won the theory race, just; but as is often the case in statistics, priority really isn’t as clear. Stefan’s result contains some tighter bounds than ours has, but we cover a somewhat more general class of estimators. In any case, Stefan and Lucas along with Gerard Biau  (who has been studying the  consistency of Random Forests  for many years) and  Shane Jensen who has studied a Bayesian Alternative, will all present at a session at  ENAR this Spring. If you happen to be in Miami in March, I think it will be really exciting.

On second thoughts, I’ll leave uses of these results for statistical inference for the next post.

Inference for this Model or for this Algorithm?

In this post I want to discuss an important but subtle distinction in statistical inference about predictability that is too-often glossed over. It is this: “is the relevant inferential question is
about the particular prediction function that you have derived, or about the process of deriving it?” This question has been an issue both in Statistics and in Machine Learning. It also allows me to both relate an anecdote **** and to skyte about some of my own (and, more importantly, my student’s) work.

I’ll lead into this with the annecdote. Several years ago John Langford visited Cornell (at the same time as he caused the anecdote in  Post 3) and I described a particular problem to him

I want to be able to tell whether x affects an outcome y, controlling for other covariates z without making parametric assumptions. To do this, I want to build a bagged decision tree and develop theory to test if it uses x more than we would expect by chance.

Langford’s response to this was to say,

Why not build any sort of model using x and z and one using only z and then use PAC bounds on a test set to see if the errors are different?

The reader might, at this point, need some background. PAC (probably approximately correct) bounds are mathematical bounds on how well we can estimate a quantity. They’re of the form “The probability that we make an error of more than δ is less than ε”. There is a complex mathematical background to these that I won’t go into, but the relevant point here is that the bounds are usually provided for our ability to estimate the error rate of an ML algorithm, usually from the training data. So comparing error rates might be useful.

Although this is a natural suggestion for someone from a PAC background, I was not particularly convinced that this was a useful direction to go in, and I remain that way for two reasons:

  1. 1. PAC bounds are very conservative. They hold for our estimate of the error rate of ITAL every ITAL function in a very large class, not just the individual function you are currently looking at. This uniformity can be very powerful, but tends to produce deltas or epsilons much larger than you actually see in practice. In practical terms for statistical inference, it means that such a method would have low power — ie, we’d need an awful lot of data to detect a difference.
  2. This is more fundamental and is the impetus for this blog: it misses the scientific question of interest. The proposed method would have told me that one particular prediction function had a lower error rate than another particular prediction function. That is, if we fixed the two models and decided that these exact models were what we would employ, we could examine whether one is likely to perform better than the other.  But that isn’t really what I wanted. I wanted to understand whether the difference between the two models is reliable if we generated new data and repeated the exercise. +++

This second distinction is what I really want to get at. Are we publishing a particular prediction function (all parameters etc fixed from here on in) or are we examining a method of obtaining one? If the former, we need only examine empirical error rates for this function. If the latter, we need to understand the variability of this function under repeated sampling.

In fact, this distinction is germane to the use of  de Long’s test. This is a method designed to test exactly a parametric version of this hypothesis through the use of the AUC statistic. However, as Demler et. al. points out, it does so, assuming that the particular prediction functions you use are fixed — you will use these particular functions forever more. If you also want to account for the fact that you have estimated your functions, it is no longer valid.

These, of course, are examples of the traditional statistical concerns: we are asking “If you re-collected your data and re-fit the model, would you get the same answers?” But just a computer scientists are too-ready to fix their model, statisticians are too-ready to re-estimate it. If we are publishing a psychological test, for example, the specific values of parameters in that test will remain fixed — there is no re-collection and re-fitting — we just want to know how well it will do in the future (or that it will do better than some other, fixed, procedure). In this case, de Long’s test is valid (as is using PAC bounds in the way that Langford suggests) and the more traditional statistical concerns are overly-conservative (because they try to account for more variation than necessary).

However, more often than not, the statisticians question is the relevant one. In my situation “Does lead pollution affect tree swallow numbers?” really implies “You saw a relationship in this
data; could you have just gotten it by chance?” ie, exactly the sort of question I want to ask. This hasn’t been possible in Machine Learning up until the last year and I’m really pleased that my student,  Lucas Mentch  has produced the first ever central limit theorem for ensembles of decision trees.  ###  This really does give us a handle on “if we re-collected the data, how different might our predictions be” and we can back my questions above out of this. All this is early days, though and I’ll post a little more about how this works — and a friendly race of priority — in the next post.

A final note on all of this. I’ve largely stayed out of the Bayesian/Frequentist debate, but will here lay down my cards as the latter. To me, statistics is really about assessing “Is this
reproducible?” Isn’t that the standard for scientific evidence, anyway? More realistically; “To what extent is this reproducible?” %%% There is a lot of analysis that says that Bayesian methods often approximately provide an answer to this — and I will more than happily make use of them under that framework — but they don’t automatically come with such guarantees. I don’t particularly like the idea of subjectivist probability but rather than arguing against everyone having their own science, I simply think that Bayes methods answer the wrong questions, in much the same way that PAC bounds do — they’re both conditioned on the current data.

Interestingly, to some extent this view would appear to support an ML viewpoint — I only care about quantifiable predictive power (in this case of inferential statements over imaginary replications). To some extent that’s right — I think interpretation (and statistical inference) is hugely important for humans building models. I don’t think it should be the means by which we judge a model’s correctness.


****  If it is not already obvious, I enjoy these

+++  To some extend the uniformity of PAC bounds would allow us to examine that, if we looked at estimators that explicitly (and exactly) minimized a loss. That is, if my function with just z minimized the error rate, and I could show that the error for the function using both x and z was smaller than the error using z by more than the bound, I could say that this has to be true of ANY function that only uses z. Unfortunately, the error rate we have bounds for is often not the error rate that we minimize and in the case of things like decision trees these bounds really aren’t available.

### I’m nearly as tickled to have a paper coming out that can be cited as Mentch and Hooker.

%%% This gives us scope to play with changing the data generating model, the sorts of assumptions we make, etc.