Uncertain Explanations

A much-delayed addendum while I’ve attempted to get the following paper out

https://arxiv.org/abs/1808.07573

which deals with ensuring the stability of model distillation using trees.

This actually touches on both the global approach to understanding machine learning that I have mostly explored, and the local approach to explaining a particular prediction (see http://blogs.cornell.edu/modelmeanings/2014/10/17/local-interpretation/).
I’ve previously discussed the issue of uncertainty in machine learning in terms of global interpretation — are you trying to understand how a particular function arrives at a prediction, or are you trying to say something about the underlying causes of that prediction?

Here I’ll ask the same question about local explanations. This is particularly relevant in the light of the European General Data Protection Regulation (GDPR) which can be read to impose a “right to an explanation” for decisions that are made about individuals. Exactly what such a right entails (and whether it really is implied in the GDPR) is not currently well defined; interpretations vary from a requirement to provide a formula, to identifying actions that could be taken to change a decision, to requiring something closer to causal reasoning identifying why that decision was reasonable.

To be concrete, let’s take a bank that develops a tool to automatically decide whether or not to give a mortgage to an applicant. If they don’t approve the loan, they can be asked “Why not?” with the implication that “because my neural network said so” would not really cut it.

In fact, this is not really so an idea: I recall hearing someone from Fair Isaac  discussing just this problem in the early 2000’s and their solution was pretty natural: use a decision tree. I discussed the basic ideas of decision trees earlier in my blog where we see that a tree has a pretty easy to follow glyph. There is also a natural explanation for why you didn’t get your loan — the decision for the last node on the tree. ****

I also noted that trees don’t perform very well. Fair Isaac’s solution was to first train a neural network (if I remember correctly) and then use this to generate a huge amount of data to create a tree that mimics the neural network. This idea has since been given the term “model distillation” — first generate a complex model (a teacher), then produce an interpretable model (student) that mimics the predictions of the original model. I repeated Fair Isaac’s ideas for an application that involved shortening psyciatric questionnaires  (you can traverse the tree asking questions only as you need them) a few years ago. I couldn’t find a citation to the Fair Isaac work, but it does seem that the idea goes back to at least 1995.

The paper I shamelessly plugged above asks “how much data do we need to stabilize the tree structure”? If we characterize our procedure as

1. Train some ML model F(X) to predict Y from X.

2. Generate a huge set of new example X’s and give each example the response F(X)

3. Use this large new data set to train a Tree that mimics F.

we (well, mostly my students Yichen and Zhengze Zhou) asked “How many new X’s do you need so that repeating Steps 2 and 3 get you the same tree?” The answer turns out to be a couple of orders of magnitude more that seems to be regularly used. And we developed some tools to assess the stability of this type of approximation tree and to look at how deep that tree should be (more on that later).

Now a reasonable question might be “why should I care?” It actually turns out that the huge amounts of data we ended up using weren’t necessary to produce a tree that gave accurate predictions — a tree could split on X1 first and then X2, or vice versa and end up with fairly similar predictions. We needed this much data solely to ensure that the structure of the tree remained stable.

So again, why should I care? This is an iteration on uncertainty in ML. If I am just seeking to explain how a prediction was arrived at, I could obtain the tree and actually use it to make predictions (as opposed to explaining predictions). I can then readily claim that I am providing a description of how the tree I happen to use arrives at its predictions. This is ok, even if the structure of my tree is chosen by random chance.

However, if I don’t replace my neural network or random forest with the tree, and just use it to explain a different model, we might start getting dubious. If you are told you didn’t receive a mortgage because you are over the age of 37, but there is a parallel universe in which exactly the same decision is explained by your income being below $50,000, you might start questioning the worth of that explanation.

Trees, of course, are only one example of this. LIME  — the currently most popular source of local explanations based on approximating the gradient of F(X), also uses a random sample whose stability we might query. I have to admit to not yet having gotten around to looking at that.

The point of this is that you can have gotten your original F(X) any way you like; we are only look at Monte Carlo error — the variability due to the random samples you generate in order to arrive at your explanation. There are already plenty of debates about what a “right to an explanation” entails, but I’ve yet to hear of “statistical stability” being added as a criterion. Perhaps it should be — or at least be made explicit that various others (such as some form of causal explanation) assume such stability.

Of course, we can ask about stability more generally — does our explanation have to be non-random, or is it ok to pick up on random features of the data? Ie, does the fact that people over 180cm high and October 19 happen to have an unusually high default rate in our dat set allow us to use that as an explanation even if another data set would be very unlikely to find that this is replicable? Alternatively, what standard of scientific replicability of explanations is required in the GDPR?

We couldn’t do much in our paper on that. We can assess whether the tree is capturing signal that is distinguishable from noise when it makes a split. We could potentially ask whether the same split would be chosen, not just with another randomly generated sample from F(X), but if we re-learned F(X) using a new data set. But changes in one split cascade down the rest of the tree and it’s then very hard to represent variability between different trees, or to know what to do with it, even if you could.

So this is imperfect: there is a form of explanation where “this is what we do, you don’t get to query why” means you don’t have to worry about uncertainty. But I think that disappears as soon as you use one model to predict and a second as an explanation. It seems that full on scientific replicability might be too strong a requirement– that would probably render all of ML useless — but what should the standard be?

There’s not much work on assessing such stability. I don’t know how stable LIME is; and we were only able to assess whether our trees were simply modeling random variation when we used random forests to produce F(X) when we know something about its stability. But the question is barely being asked and I think will become practically relevant fairly quickly.

 

**** Of course, changing other covariates could affect a decision higher up in the tree, which also changes the decision about your mortgage. So this is fairly imperfect in itself.

 

Interventions, Interpretation and Ethics

Herein are a set of musings prompted by Rich Caruana’s wonderful talk at the workshop  I boasted about last post. I don’t think that this is a knock-out for ML interpretation (as I’ll say at the end) but the ethics of ML derived from observational data do start to give a real case for interpretation. That’s not without complications or difficulties, something I want to work through here.

The essential issue here is that the principle use of ML is for interventions: what search results will we show you? Will you replay a loan? Will a prisoner re-offend? A job candidate perform well? Is there a high probability of crime on this corner? Is that a truck on the road ahead, or a street sign? What direction is this bicycle moving? These predictive questions are all relevant because they imply actions that change the outcomes of events.

What this means is that to the extent that those actions are directed at changing outcome, any relationship between covariates and outcomes has to be thought of as causal. And in particular, when we employ ML trained on observational data, the potential for causal confounders must always be considered.

In particular, Rich’s talk highlighted this beautifully. I encourage you all to watch it, but for those without an hour’s spare time, he presented a wonderful study in all of the confounding effects you might find if you’re not careful. The highlight, as I discussed last time, was the finding that asthma appears to substantially reduce your mortality risk from pneumonia. Rich’s explanation for this was that this was confounded with either time of onset (asthmatics notice symptoms earlier) or the aggressiveness of care. The key issue being that if a ML tool was naively employed to triage patients, it would send all the asthmatics home — the opposite of good medical practice.

The point here being that if you simply threw a deep network at the data (or a random forest) and didn’t look at how it was using it, you’d miss the effect. Of course, you’d put the tool through some trials first and hopefully notice that your improvement in outcomes is not what you’d estimate from the data (or more realistically, that doctors would start saying “this isn’t what I would have done”, first). At that point, however, one does have the question of what to do: you need to do some interpretation to work out that the asthmatics are all being sent home, so a die-hard non-interpretation advocate can’t do much but try training more and then give the whole project over when it doesn’t work out.

Of course, what should be done to fix up a tool once you’ve discovered this effect is an issue, but that’s for another time.

The real issue is that Rich found more and more of these effects, often which got increasingly subtle. There are round-number effects of age that might affect treatment: over 85 and you’ve had a good life anyway — your risk jumps — over 100 and you’ve become special and your risk goes down again. Plus a whole bunch of other things. And here’s the ethical problem: how hard do we have to explore for these effects before giving the tool a try?

Let’s get back to a clinical trials analogy. If you have a new drug, you run trials to check both for safety (side effects) and for efficacy. You also control for things you think might be a risk: you don’t give a new Sulfa drug to someone who is allergic, in the same way that we might exclude asthmatics from our automatic triage tool. Once you’ve excluded people on these and similar grounds, you test to see if there’s anything else that you (and reviewers of your trial) haven’t thought of. There is no way comprehensively enumerate all the things that might go wrong.

So far, so much analogy, but I think there are a couple of important distinctions here. First, the concept of a safely study doesn’t really apply — side effects are the result of actions recommended by our auto-triage, not the use of the triage itself. You can’t apply our auto-triage to healthy patience and see if they have an adverse outcome just because of it. Second, the involvement of social factors creates a much more copmlex set of problems. If I look at a molecular compound that I want to use to treat some disease, we do have some reasonable chemical knowledge that would indicate a relatively small set of things to worry about. When we consider all the things that could be confounding a data set like Rich’s, there are a much larger body of possible negative effects, some of which we might find hints of in data, some of which we might not. Given time, we might start cataloguing common confounders in a particular setting (so long as we look for them), but given we know these effects are likely, but don’t know what they might be, how hard should we search for them?

I don’t have a great standard for this, and I think we also have to ask “What can we do to fix our auto-triager, now we know it’s biased against asthmatics?” before we really develop it. There are some forms of safety trials that one can do: have a doctor also triage cases and flag where she differs from our model (we still have to work out why), but how many are enough? More automatedly, we probably can make guarantees that suspicious effects that _might_ occur in the data are smaller than some level (but a cut-off has to be agreed), but these are tools that need to be developed. We also need to be confident that our model exploration is catching suspicious effects (more on that later).

Outside of clinical settings, though, the same questions arise. A large part of the fairness literature in ML is about having the wrong data for exactly these reasons, and in contexts that may be harder to understand. How do we establish that facial recognition software performs worse on African Americans because its training corpus is white? What about sexist automatic translators?
These are all interventions where confounding relationships have made an algorithm behave poorly. In both cases, they aren’t problems you can pick up pre-deployment because even your test data has the same confounding effect.

Now of course, it may become more obvious in the real world when, for example, black women have a harder time getting their iPhone 10’s to unlock, but even here it’s fairly reasonable to assume there will be some confounding and how, and how much, should we check before deployment? What about monitoring afterwards?

So does this mean that ML is never exempt from interpretation? Not really: all these examples involve an intervention that changes the observed outcome. There’s lots of prediction problems that don’t: astronomers want to classify galaxies, we might want to predict sales volume for the sake of inventory management (this won’t affect demand particularly). As Rich notes, an insurance company might want to forecast the outcome of a pneumonia patient, but not actually change the treatment.

Even when we are going to intervene, I might know that there is a random process that breaks a confounder. I might take a clinical trial, for example, and then try and predict a treatment effect in finer detail: some particular set of patients respond well to this particular drug. Since the drug assignment is randomized, I’m confident that it’s not associated with an effect.

So, bad data is a reason to want to interpret ML, but that doesn’t mean you always need to. We do still need to work out how hard you should work at that interpretation, and what you should do if you think you’ve found a problem.

A Brief and Undisciplined Rant Against Refereed Conferences

This post is off-topic, but striking while hot and ill-considered.

I get ArXiv.org to send me a digest of the previous days submissions in Statistics and in Machine Learning. It sends out the titles and abstracts of submissions (from two days previously) and titles of anything that’s been updated. It’s usually a good way of keeping track of developments and I’ve even started trying to get through the five year back-log of papers I intended to read.

This morning, however, the e-mail was novellete-length and I was getting somewhat exasperated by all the people suddenly doing random forests until I realised

Of course! The KDD deadline was this weekend!

I should have known about that since I was kept online by one of my students trying to make the deadline.

So I’m going to be hypocritical — given that I’ve got a paper going into KDD — but declare that I think that Computer Science-style conference publications are a disservice to research.  This was made evident both by the experience of putting the paper together and this morning’s digest.

I’ve sometimes heard the opinion that conference publications let you get work out faster and make the field move quicker.

There’s some truth in that, but frankly it also means that you don’t think as carefully about what you write and I’m sure that referees don’t think as carefully about what they read. This means that there’s more junk and a worse filter.   It’s certainly true that the paper I was involved ended up glossing over a bunch of unresolved issues and while I think it has good content, it really could have been thought through more carefully, its analysis better targeted and its experiments more fully explored. I suspect that this is also true of the small mountain of papers I’ve felt that I should take a look at this morning.

That is, there is a point to slow scholarship; to papers that are published because they say something well thought out and have a good reason for saying it.  There’s benefits to a referee process that improves the paper and takes the time to do so.

Given the current hype around CS/ML and the explosion of interest in these conferences, I expect that the process of selecting papers will simply get noisier.

But there are CS journals! And I’m prepared to acknowledge that. What bugs me here is the double counting. I hadn’t realised until recently, but apparently its de rigueur to publish in KDD or ICML or NIPS and turn around and put the same material — somewhat expanded — into a paper.  I’m not completely aghast at intellectual dishonesty here; but I do feel that CS publications ought to be viewed as different, and lesser, than journals. I’ve previously been sold on a 1:1 conversion between NIPS and a stats journal paper, and I’ve tended to see them discussed in these terms. I think it’s worrying.

Now anyone who has been to a Statistics conference knows the nightmare of 60 parallel sessions with everyone and their pet tapir giving 12min talks in rooms containing only the other presenters. I could readily see an advantage in having a randomized selection process that hopefully improves the average quality of the presented papers and cuts down on the size of the program book; but that’s about how a conference publication should be valued, and someone tell me why I should take the time to read it?

 

The Interpretation Debate

It’s been some time since I’ve written anything here, although in that time the intelligibility debate within the machine learning community has substantially increased ***. This has included my own contributions (with thanks to Dad), but developments like the European Data Statute with its “right to an explanation” and exactly what it means have helped. In the last two years, NIPS has featured large events (2016, 2017-what a great URL!) on the topic (co-organized by my fantastic Cornell colleague Andrew Wilson, and my fantastic collaborator Rich Caruana).

Alongside these, the Fairness, Accountability and Transparency (FATML) movement in Machine Learning has started raising concerns about the use of ML and the ways it can perpetuate or exacerbate social biases. Some of this was catalyzed by a book by Cathy O’Neil and by Pro-publica investigations of the use of risk scores in parole hearings.

I’ve also just returned to teaching after running a workshop on ML and Inference. I wouldn’t normally be quite self-promoting, but really all I did was get really interesting people together in one place; the talks are all online, some more technical than others, but they were all interesting, innovative and well worth watching.

So I’m inspired to write again; in particular, I think there are topics in explanations and uncertainty quantification that are unexplored, and there are important relationships between causal inference and fairness that I don’t yet understand and want to chew over here.

But first: a debate! The most recent Workshop on Interpretable Machine Learning featured such a debate, with the proposition that “Interpretation is Necessary in Machine Learning”, argued by some of the best in the field. Having gotten around to watching it, I can’t help but pass commentary.

The first thing to say is that I think the debate ended up being pretty conciliatory. One of the standard debating tactics is to make the debate a disagreement about definitions and to some extent this debate could be summarized as:

Affirmative: There are some occasions when you need to understand what your ML model has learned.

Negative: There are some occasions where you don’t.

Not that this isn’t, in itself, illuminating. See Rich’s talk at my workshop for why one might want to understand what you’re doing. I want to expand on this in a later post, but the essence is that when using ML for an intervention, you need to view your covariates causally.

In particular, Rich had a data set that he wanted to use to predict mortality risk from pneumonia; this could then be used to triage patients between a high risk group to hospitalize and a low-risk group to send home. The thing was that he found that having asthma meant you were predicted to be at lower risk. This is almost certainly not because asthma protects you, but because asthmatics get more aggressive treatment. Thus, employing this tool would cause a spike in asthmatic mortality. Crucially, since the test set will have the same properties, out of sample performance won’t help uncover this before it is deployed.

The counter to this is the statement “Well, you would always run a randomized trial to test this sort of intervention” and it’s fair to say that you don’t necessarily know that some new drug, say, won’t have counterproductive effects in some subset of the population. (This was effectively Yann LeCun’s response in the debate). Ethically, of course, we couldn’t do that in this case since we have reason to believe it could be harmful, but that potentially applies to any intervention based on an ML tool that’s trained with observational data. That is, since we know that observational data is prone to confounding effects, can we ever employ it without seeking to check whether there might be some reason to believe they might be deleterious?

I think the answer to that is “It depends” and will come back to this, as I will to the notion of performance evaluation. At one point during the debate Killian Weinberger asked whether one would trust a doctor or a robot to perform surgery where the latter had much better outcomes than the former. Everyone said yes; but Rich’s point would be “Not if the robot got only the easy cases”.

This will all take me too long for one post.

Some other thoughts from the debate:

1) There seemed to be an assumption that performance was incompatible with intelligibility. I’m not sure that that is really the case. I’d agree that I’m less sure about what it means to be intelligible in image processing, but when the individual covariates are interpretable, one can often use a black box model as a guide to produce something that’s fairly understandable that does just as well. At least within the span of additive models (something that Rich as spent a good deal of time working on), I’ve never found a model that really needed more than a three-way (non-parametric) interaction.

Along those lines, Andrew would even challenge the notion that deep learning is unintelligible. I’m less convinced about this. Yes one could think of the bottom layer of a network as learning concepts (in image recognition this can be compelling) and then try to work out the more complex concepts in the next layer and so on; but I’ve never really seen that carried through convincingly. (Admittedly, I’ve also never really looked).

2) Some of what I would call interpretation was dismissed as not being interpretation. In particular, local interpretation — referred to as “sensitivity analysis” in the debate (how does changing one covariate affect the prediction) was characterized as something other than interpretation; presumably this would apply to the “right to an explanation” too. I think you’re learning something human-understandable from that and as such it counts as interpreting.

Similarly, model distillation (ie, building a simpler model to mimic a black box: a term that wasn’t around when I wrote about it in the mid-noughties ***) wasn’t regarded as interpretation. Here you do need to worry about accurately capturing the model you’re mimicking, but so long as you can do that to within acceptable bounds, I’d call it interpretation.

Similar remarks can be made about Yann LeCun’s employing background knowledge (as in that asthma is likely to make pneumonia worse). In some cases — such as structuring convolutional neural networks for image processing — you aren’t encoding a lot of explicit knowledge about the particular task, just what pictures are like and what works well for processing them. But in health care, the act of saying “Hang on, we really ought not to send all our asthmatics home with asprin” really ought to be interpretation.

So here I am arguing over a different definition in the debate. If you do say that only algebraically simple models (say, defined by a human being able to manually produce an output from it in under 10min) are interpretable, then yes, you likely have to sacrifice accuracy. If you ask about local interpretation (sensitivity analysis) there is nothing to do. On the other hand, if you ask about models that can be interrogate to reveal global-level patterns, I think there’s a lot more flexibility than the negative team suggested.

Is this an interesting shift from the ML “human interpretation is just ego-boosting” view I looked at a few years ago? Well a lot of this is really not about “we need algebraically simple models” but rather “We don’t have the right data”, something I’ll get into next.

 

 

*** This is clearly not due to me, or I’d be getting more citations.

*** One of those papers that isn’t getting more citations — even by my own students! 

On Interpretation In Multiple Models

No, dear reader, it has not taken me this long to come back to this blog because I am reluctant to give up talking about my own work. Simply the discipline imposed on me by official service commitments has rather sapped my discipline in other areas.

In this post, I will leave off writing about interpretability and interpretation in machine learning methods and instead focus on the details of statistical interpretation when conducting model selection. This has become particularly relevant given the recent rise of the field of “selective inference”, lead by LINK http://statweb.stanford.edu/~jtaylo/ Jonathan Taylor LINK and colleagues.

While, the work to be discussed is conducted in the context of formal model selection (ie, done automatically, by a computer), I want to step back and look at an older problem — the statistician effect. This was brought up by a paper which conducted the following experiment: the authors collected data on red cards given to soccer (sorry Europeans, I’m unapologetically colonial in vocabulary) players along with a measure of how “black” they were and some other covariates. They then asked 29 teams of statisticians to use these data to determine whether darker players were given more red cards. The results of this were that the statisticians couldn’t agree — about 60% said there was evidence to support this proposition, another 30% said there wasn’t. They used a remarkable variety of models and techniques.

I wrote a commentary on this for STATS.org. I won’t go into the details here, but the summary is that there is a random effect for statistician (ask a different statistician you’ll get a different answer) but the problem in question exaggerated the effects of this by focussing on statistical significance — all statisticians had p-values near 0.05 but fell on one side or the other. Nonetheless, it does lead one to the question of how could you quantify the “statistician variance”.

Enter model selection. Declaring automatic model selection procedures (either old-school stepwise methods or more modern LASSO-based methods) a solution to statisticians using their judgement is a pretty long bow (and doesn’t account for various other modeling choices or outlier removal etc), but it will allow me to make a philosophical connection later. Modern data sets have induced considerable research into model selection, mostly under the guise of “What covariates do we include in the model?”

Without going into details of these techniques, they all share two problems for statistical inference:

1. Traditional statistical procedures such as hypothesis tests, p-values and confidence intervals are no longer valid. This is because the math behind these assumes you fixed your model before you saw the data. It doesn’t account for the fact that you used your data to choose only those covariates which had the strongest relationship to the response, meaning that the signals tend to appear stronger in the data than they are in reality. (For those uninitiated in statistical thinking: if we re-obtaining the data 1,000 times, and repeating the model-selection exercise for each data set, covariates with small effects would only be included for those data sets that over-estimate their effects).

2. We have no quantification of the selection probability. That is, under a hypothetical “re-collect the data 1,000 times” set-up, we don’t know how often we would select the particular model that we obtained from the real data. Worse, there are currently no good ways to estimate this probability.

Jonathan Taylor (in collaboration with many others) has produced a beautiful framework for solving problem 1.*** This goes by the name of selective inference. Taylor describes this as “condition on the model you have selected”; in more layman’s terms — among the 1,000 data sets that we examined above, throw away all those that don’t give us the model we’ve already selected. Then use the remaining data sets to examine the statistical properties of our parameter estimation. Taylor has shown that in so doing, he can provide the usual quantification of statistical uncertainty for the selected parameters.

The (a?) critique of this is that we are conditioning on the model that we get. That is, we ignore all the data sets that don’t give us this model. In doing this, we throw out the variability associated with performing model selection. That is, we still haven’t solved Problem 2. above.

Taylor’s response to this is that the model you select changes the definition of the parameters in it. I’ll illustrate this with an example I use in an introductory statistics class; if you survey students and ask their height and the height of their ideal partner you see a negative relationship — it looks like the taller you are, the shorter your ideal partner. However, if you include gender in the model, the relationship switches. So the coefficient of your height in predicting the height of your ideal partner is interpreted one way when you don’t know gender and in another way if you do. If we perform automatic model selection, we’ve allowed the data to choose the meaning of the co-efficient, and if we repeated the experiment we might get a different model and hence give the co-efficient a different meaning. Taylor would say “Yes, the hypotheses that we choose to test are chosen randomly, but this is of necessity since they mean different things for different models. In any case, this is just what you were doing informally when you let the statistician choose the model by hand.”

I see the point here, but nonetheless it makes me uneasy. In the paper I began with, the researchers asked an intelligible question about racism in football and I don’t think they were looking for an “if you include these covariates” answer. Now to some extent, the statistcian’s analysis of this question supports Taylor’s arguments — one team had simply correlated the number of red-flags with the player’s skin tone and left it at that; others felt that they needed to worry about whether more strikers got red cards and perhaps more black-skinned players took that position, and other sorts of indirect effects. In fact, with the exception of the one team that looked at correlations, the question was universally interpreted as “Can you find an effect for skin tone, once you have also reasonably accounted for the other information that we have?” I generally think this is what statisticians, and the general public, understand us to be doing when we try and present these types of relationships. (Notably, there is a pseudo-causal interpretation here; the data do not support an explicit causal link, but by accounting for other possible relationships we can try to get as close as we can — or at least rule out some correlated factor).

I think this is the key to my concerns. I see model selection as part of the “reasonably accounting for all the information we have” part of conducting inference. In particular, we hope that in performing model selection, the covariates that we remove have small enough effects that removing them doesn’t matter. (Or at least that we don’t have enough information to work out that they do matter). Essentially, my response to Taylor is that “The interpretation of a co-efficient doesn’t change between two models, when they differ only in parameters that are very close to zero”. That is, model selection might be better termed “model simplification” — it doesn’t change the model (much), it just makes it shorter to write down. The inference I want includes both 1. and 2. — my target is the model that makes use of all covariates as well as possible (if I had enough data to do this) and I want my measure of uncertainty to account for the fact that I have done variable selection with the aim of obtaining a simpler model.
Of course, this just leaves us back in the old hole — I care about both Problems 1 and 2 and I have no good way to deal with that. There has been a school of work on “stability selection”; essentially bootstrapping to look at “how frequently is this covariate used in the model”. This has to be done quite carefully to have valid statistical properties, which represents a first problem. However, I have to confess a further problem than this; from stability selection you will get a probability of including each covariate in the model; and then a notion of how important it is, if included. If I have to look through all covariates, why do selection at all? There are plenty of regularization techniques (e.g. ridge regression) that can be employed with a large number of covariates that are much easier to understand than model selection. Why not simply use them and rank them by how much they contribute? I’m not convinced this won’t do just as well in the long run.

 

 

*** I call this a framework because the specifics of carrying it out for various statistical procedures still requires a lot of work.

Statistical Inference for Ensemble Methods

It’s taken me a while to get back to this blog, so thanks to those of you who are still following. The break was partly because I’ve been back from sabbatical and facing the ever-increasing administration burden associated with recently-tenured faculty (there’s a window between being promoted and learning to say no to everything that I think the administration is very good at exploiting).

It’s also partly because I didn’t want to keep harping on about my own research. I will finish off from where I left my last blog post (on producing uncertainties about the predictions from random forests) but I want to then get back to philosophical musings, particularly about some recent developments in statistics.

So, back to self indulgence. Lucas and I managed to produce a central limit theorem for the predictions of random forest (paper now accepted at the Journal of Machine Learning Research after much debate with referees). Great! Now what do we do with it?

Well, one thing is to simply use them as a means of assessing specific predictions. As a potential application, Cornell’s Laboratory of Ornithology  has a wonderful citizen science program called ebird, which collects data from amateur birdwatchers all over the world. They use a random forest-like procedure to build models to predict the presence of birds throughout the US and one of the uses of these is to advise the Nature Conservancy about particular land areas to target for lease or purchase. Clearly, they would like to obtain high-quality habitat, in this case for migratory birds, and can do so off the models that the Lab of O produces. These currently do not provide statements of uncertainty, but we might reasonably think about posing the question “Would a plot with 90% +/- 10% chance of bird use really be better than 87% +/- 3%?” We know the second spot is pretty good, the first might be very good, but we’re much less sure of that.

More importantly, we can start trying to use the values of the predictions to learn about the structure of the random forest. A first approach to this is simply to ask

Do forests that are allowed to use covariate x1 give (statistically) different predictions to those that are not?

This is expressed simply as a hypothesis that if we divide the covariates into x1 and x2, say, we can ask whether the relationship

f(x1,x2) = g(x2)

holds everywhere. Formally, we state this in statistics as a null hypothesis and ask if there is evidence that it isn’t true.

We can assess this by simply looking at the differences between predictions at a representative set of covariate points. We of course need to assess the variability of this difference, taking into account the covariance between predictions at nearby points. It actually turns out that the theoretical machinery we developed for a single prediction from a random forest is fairly easily extensible to the difference between random forests and to looking at a vector of predictions. To formally produce a hypothesis test, we have to look at a Mahalanobis distance between predictions, but with this we can conduct tests pretty reasonably.

In fact, when we did this, we found that our tests gave odd results. Covariates that we knew were not important (because we scrambled the values) were appearing as significant. This was odd, but a possible explanation is that this works just like random forests: a little extra randomness helps. More comforting was the fact that if we compared predictions from two random forests with two different scramblings of a covariate, the predictions were not statistically different. This led us to suggest that a test of whether x1 makes a difference be conducted by comparing a forest with the original values of x1, and one with these values scrambled and that seemed to work fairly well.

Tests between different forests conducted in this way are problematic, however, in two ways. First, the need to scramble a covariate is rather unsatisfying, but it also limits the sort of questions we can ask. We cannot, for example, propose a test of additivity between groups of variables:

f(x1,x2) = g(x1) + h(x2)                 (*)

or, in a more complicated form:

f(x1,x2,x3) = g(x1,x3) + h(x2,x3)

Here x1, x2, and x3 are intended to be non-overlapping groups of columns in the data.

What we decided to do in this case goes back to old ideas of mine, which I think I mentioned in an earlier post. That is, we can assess these quantities if we have a grid  of values at which we make the predictions. That is, if we have a collection of x1’s and of x2’s and we look at every combination of them, we can ask how far away from (*) is f(x1,x2)? I’ve illustrated this in the figures below (rather than drawing out a mathematical caclulaiton here), but this comes out to just be a standard ANOVA on the grid.

IntTestDiagram2

Now in order to test this statistically, we again need to know the covariation between different predictions, but our central limit theorem does this for us. ***  These tests turn out to be conservative, but they still have a fair amount of power.

The point here is a shift in viewpoint for statistical interpretation from the  internal structure  of a functional relationship to being derived from the predictions that are made. By deriving our notion of structure and inference with respect to predictions we can be flexible about the models we fit, we can allow high-dimensional data to enter as nuisance parameters (without a lot of model checking) and we build a bridge to machine learning methods. Leo Brieman wrote a paper in 2001 in which he outlined the distinction between statistics and what he called “algorithmic data analysis” +++. I like to think we’ve at least started a bridge.

Finally a point on the hypothesis testing paradigm that I have pursued here. Hypothesis tests have been rather unpopular recently, and not without reason — over-reliance on low p-values at the expense of meaningful differences is a real scourge of much of the literature. What they do do here, however, is ask “Is there evidence to support the added complexity in the model?” Even better for me is that I noted in a previous post that which prediction points you look at make a bit difference to your variable importance scores. On the other hand, if one of the hypotheses above is true, you can look at any prediction points you like and all you screw up is how easy it would be to detect it not being true. That said, the biases in ML methods, particularly when making predictions where you do not have much data, mean that it’s still best to focus on making predictions where you do have a fair amount of data.

Next time: parametric models and the statistician as a source of randomness.

*** As a further note, these grids can get pretty big and estimating a large covariance matrix in our CLT is pretty unstable. In this case we turned to the technology of random projections that helped improve things a lot.
+++ And largely (and correctly) lamented statistician’s unwillingness to consider the latter

Statistical Inference for Ensemble Methods

The last post was allowed me to get a a pitch in for some of my recent work and I decided that I liked that so much I’d go all-out this post to go into further detail about our results, how we got them, and (just to spice things up) a race for priority.

The work that  Lucas did was on methods for bagged trees and  random forests . I talked a bit about this in an early post, but to avoid sending you there here’s the recap.

1. Bagged trees methods are based (unsurprisingly) on building trees. We do this by dividing the data in two based on one of the covariates. We choose the division by working out how to make the responses in each part as similar as possible. We then do the same thing to each part and keep going until we run out of data.

2. Bagging just means taking a bootstrap sample of the data and building a tree on it. Then taking another, then another until you have a collection of several hundred, or several thousand, trees. To use these, you make a prediction with each tree and then average the results.

3. Random forests are exactly the same, but they use a slightly different tree building mechanism in which you only consider a random subset of covariates each time you split the data. This is supposed to make the trees less aggressive in fitting the data.

The basic thought that we had with this is that “There’s a bootstrap structure going on. This is often used for inference, why don’t we do so here?”

Fair enough, but that leaves two problems: Inference about what? And the small matter that the bootstrap doesn’t work in this case. For the first problem, we return to diagnostics for black box functions where we saw that if there is no parametric structure in the model, the relevant thing to look at are the model’s predictions. Just as we could try and use these predictions as a means of assessing variable importance etc, we can now ask whether those measures are statistically interesting, eg, do they differ from zero?

Now to the technicalities. It would be nice to use, say, the variance between the predictions of different trees, as a means of obtaining the variance of their average. Unfortunately when we use trees built with bootstrap samples, their predictions are highly correlated, so the usual central limit theorem doesn’t apply.

This had us stumped for a while until we hit on using  U-statistics. These are a fairly classical tool that doesn’t necessarily get a lot of attention these days, but basically they’re a particular form of estimator. Suppose you have a function (which we will, with admirable ingenuity call a “kernel”) which takes any two values from your data sets; the U-statistic corresponding to this kernel is the average of giving the function every pair of values in the data. You can do this with kernels that take three values etc.

For us, the kernel is “take these data, build a tree and make a prediction at these covariates”. It doesn’t have a nice algebraic representation, but it can be used nonetheless. For this, rather than taking bootstrap samples, we merely choose a subsample of the data at random, use it to build a tree and make a prediction. This is a fairly minor modification and in fact reduces the computational cost of building the tree.

The nice thing about U-statistics is that they have a central limit theorem. The variance is somewhat different to “take the variance and divide by n” in the classical CLT, but nonetheless it is something we can use. I’ll talk about that further down.

Of course, if we have a data set of size 1000 and took all subsamples of size 100, we would run out of computing power very quickly. So we’ll only take subsamples at random (an “incomplete U-statistic”). To do the asymptotics we also ought to expect that the number of points in each subsample should grow as the over-all size of the data grows (“infinite-order U-statistics”) and random forests also use some randomization in building the trees (we had to invent a term for this; we’ll call it “random kernel U-statistics”). There wasn’t a result for incomplete infinite-order U-statistics with random kernels so we  provided one. As noted, the variance is different from what you would normally expect, but by being a bit careful about how we chose subsamples we have a nice way of calculating that as well.

This represents the very first formal distributional result for the predictions of random forests; although there were a few heuristic precursors.

But as cool as that is, it’s made better by a knuckle-biting race. The context for this is that Lucas and I started working on this in 2011/2012. I was at the Joint Statistical Meetings in 2012 and ran into  Trevor Hastie  and quite excitedly told him about the results that we were developing. It was a bit of a shock, then, to be told that he had a student working on the same problem. As it happened  Stefan Wager  came out with a paper producing confidence intervals for predictions of random forests a couple of months later, although without any theory. It’s actually a rather nice application of the infinitesimal bootstrap, which is a somewhat different means of estimating the variance of a random forest than we were using.

Lucas and I didn’t manage to put  our paper on ArXiv until April, followed a week later by another  by Stefan Wager which developed a different central limit theorem. Arguably, we won the theory race, just; but as is often the case in statistics, priority really isn’t as clear. Stefan’s result contains some tighter bounds than ours has, but we cover a somewhat more general class of estimators. In any case, Stefan and Lucas along with Gerard Biau  (who has been studying the  consistency of Random Forests  for many years) and  Shane Jensen who has studied a Bayesian Alternative, will all present at a session at  ENAR this Spring. If you happen to be in Miami in March, I think it will be really exciting.

On second thoughts, I’ll leave uses of these results for statistical inference for the next post.

Inference for this Model or for this Algorithm?

In this post I want to discuss an important but subtle distinction in statistical inference about predictability that is too-often glossed over. It is this: “is the relevant inferential question is
about the particular prediction function that you have derived, or about the process of deriving it?” This question has been an issue both in Statistics and in Machine Learning. It also allows me to both relate an anecdote **** and to skyte about some of my own (and, more importantly, my student’s) work.

I’ll lead into this with the annecdote. Several years ago John Langford visited Cornell (at the same time as he caused the anecdote in  Post 3) and I described a particular problem to him

I want to be able to tell whether x affects an outcome y, controlling for other covariates z without making parametric assumptions. To do this, I want to build a bagged decision tree and develop theory to test if it uses x more than we would expect by chance.

Langford’s response to this was to say,

Why not build any sort of model using x and z and one using only z and then use PAC bounds on a test set to see if the errors are different?

The reader might, at this point, need some background. PAC (probably approximately correct) bounds are mathematical bounds on how well we can estimate a quantity. They’re of the form “The probability that we make an error of more than δ is less than ε”. There is a complex mathematical background to these that I won’t go into, but the relevant point here is that the bounds are usually provided for our ability to estimate the error rate of an ML algorithm, usually from the training data. So comparing error rates might be useful.

Although this is a natural suggestion for someone from a PAC background, I was not particularly convinced that this was a useful direction to go in, and I remain that way for two reasons:

  1. 1. PAC bounds are very conservative. They hold for our estimate of the error rate of ITAL every ITAL function in a very large class, not just the individual function you are currently looking at. This uniformity can be very powerful, but tends to produce deltas or epsilons much larger than you actually see in practice. In practical terms for statistical inference, it means that such a method would have low power — ie, we’d need an awful lot of data to detect a difference.
  2. This is more fundamental and is the impetus for this blog: it misses the scientific question of interest. The proposed method would have told me that one particular prediction function had a lower error rate than another particular prediction function. That is, if we fixed the two models and decided that these exact models were what we would employ, we could examine whether one is likely to perform better than the other.  But that isn’t really what I wanted. I wanted to understand whether the difference between the two models is reliable if we generated new data and repeated the exercise. +++

This second distinction is what I really want to get at. Are we publishing a particular prediction function (all parameters etc fixed from here on in) or are we examining a method of obtaining one? If the former, we need only examine empirical error rates for this function. If the latter, we need to understand the variability of this function under repeated sampling.

In fact, this distinction is germane to the use of  de Long’s test. This is a method designed to test exactly a parametric version of this hypothesis through the use of the AUC statistic. However, as Demler et. al. points out, it does so, assuming that the particular prediction functions you use are fixed — you will use these particular functions forever more. If you also want to account for the fact that you have estimated your functions, it is no longer valid.

These, of course, are examples of the traditional statistical concerns: we are asking “If you re-collected your data and re-fit the model, would you get the same answers?” But just a computer scientists are too-ready to fix their model, statisticians are too-ready to re-estimate it. If we are publishing a psychological test, for example, the specific values of parameters in that test will remain fixed — there is no re-collection and re-fitting — we just want to know how well it will do in the future (or that it will do better than some other, fixed, procedure). In this case, de Long’s test is valid (as is using PAC bounds in the way that Langford suggests) and the more traditional statistical concerns are overly-conservative (because they try to account for more variation than necessary).

However, more often than not, the statisticians question is the relevant one. In my situation “Does lead pollution affect tree swallow numbers?” really implies “You saw a relationship in this
data; could you have just gotten it by chance?” ie, exactly the sort of question I want to ask. This hasn’t been possible in Machine Learning up until the last year and I’m really pleased that my student,  Lucas Mentch  has produced the first ever central limit theorem for ensembles of decision trees.  ###  This really does give us a handle on “if we re-collected the data, how different might our predictions be” and we can back my questions above out of this. All this is early days, though and I’ll post a little more about how this works — and a friendly race of priority — in the next post.

A final note on all of this. I’ve largely stayed out of the Bayesian/Frequentist debate, but will here lay down my cards as the latter. To me, statistics is really about assessing “Is this
reproducible?” Isn’t that the standard for scientific evidence, anyway? More realistically; “To what extent is this reproducible?” %%% There is a lot of analysis that says that Bayesian methods often approximately provide an answer to this — and I will more than happily make use of them under that framework — but they don’t automatically come with such guarantees. I don’t particularly like the idea of subjectivist probability but rather than arguing against everyone having their own science, I simply think that Bayes methods answer the wrong questions, in much the same way that PAC bounds do — they’re both conditioned on the current data.

Interestingly, to some extent this view would appear to support an ML viewpoint — I only care about quantifiable predictive power (in this case of inferential statements over imaginary replications). To some extent that’s right — I think interpretation (and statistical inference) is hugely important for humans building models. I don’t think it should be the means by which we judge a model’s correctness.

 

****  If it is not already obvious, I enjoy these

+++  To some extend the uniformity of PAC bounds would allow us to examine that, if we looked at estimators that explicitly (and exactly) minimized a loss. That is, if my function with just z minimized the error rate, and I could show that the error for the function using both x and z was smaller than the error using z by more than the bound, I could say that this has to be true of ANY function that only uses z. Unfortunately, the error rate we have bounds for is often not the error rate that we minimize and in the case of things like decision trees these bounds really aren’t available.

### I’m nearly as tickled to have a paper coming out that can be cited as Mentch and Hooker.

%%% This gives us scope to play with changing the data generating model, the sorts of assumptions we make, etc.

Local Interpretation

So far the types of interpretation that I have discussed in machine learning models have been global  in scope. That is, we might ask, “Does x1 make a difference anywhere?”. This is very much a scientific viewpoint — we want to establish some universal laws about what factors make a difference in outcome.

There is, however, an alternative need for explanation that is much more local in scope — what makes a difference close to the following point. This was brought home to me at the start of this year when I spent a couple of months visiting IBM’s Healthcare Analytics working group (a wonderful, smart set of people who very generously hosted me). Without giving away too many industrial secrets, the group is developing systems based on electronic medical records to support medical decision making: we might want to forecast whether a patient has a high risk of developing some particular condition, for example. There are great machine learning tools for doing this and the folks at IBM have remarkably good performance at this.

The question, from a doctor’s point of view, is now “What do I do with this information?” It’s not much good knowing that someone will develop diabetes ***  in six months without some idea about what you can do to help. We might therefore be interested in understanding what changes could be made to the patients covariates to reduce their risk. This isn’t a global interpretation — we’re not going to be able to change their weight or blood sugar dramatically (at least not in short order) but we might be able to nudge them in a positive direction. The question is then “What should we nudge, and what should we prioritize nudging?”

To do this, you don’t need a global understanding of the way the model functions, you only need to know how the prediction changes locally around a patient. One way to do this is to experiment with changing a covariate a little (“How much does this prediction change if we reduce weight by 10lb?”). 10lb is rather arbitrary, so we might be interested in something like a derivative: how fast does risk increase with weight at these covariates?

Unfortunately, not all ML functions are differentiable, or are easy to differentiate, but we could imagine some approximation to this. One way to explicitly look at local structure is by local models. These are parametric models — of the sort I have just been decrying — but estimated for each point based only on the points close by.

The easiest way to think about this is with a very simple example: the Nadaraya-Watson smoother in 1 dimension. I’ve sketched an example of this below, but the idea is simple — we take a weighted average of the points around the place in which we are interested, where the weights decrease the further the observations are from our prediction point.

NadarayaWatson

 

This, by itself, does not generalize easily, but we can instead think of it as minimizing squared error, but with weights

μ(x) = argmin∑ K(x,Xi)(Yi – μ)^2

(to unpack this notation, argmin means the value of mu at which the expression is minimized. K is a “kernel” — think of it as a Gaussian bump centered at x). This can, in fact, be very nicely generalized to something like

β(x) = argmin ∑ K(x,Xi)(Yi – g(Xi,β))^2

where g(Xi,β) is a linear regression (with coefficients beta), for example. There is nothing in here about x being univariate anymore. Now we have a linear regression at x, but it will change over x. The nice thing about this is that we can apply all the standard interpretation for linear regression, but this interpretation is specific to x. So we can look at β(x) to see how quickly risk changes with weight, for example, at this value of the covariates. We can also explore structure — interaction terms etc — that can give us a more detailed view of the response, but at a local scale.

The crucial aspect of this is that it is the local scale that matters when one is asking hypotheticals about specific prediction points — “What would happen if we just changed this, a little bit?” And in fact, this potential-intervention model is really what is relevant.

A last question: can we get back from here to a global interpretation? Well in some sense yes. If we think of ourselves as asking “Is the derivative of F(x) with respect to x1 zero everywhere?” we can express all the results about additive structure discussed two posts ago in terms of some derivative being zero. For example we could see if we could reduce a three dimensional function to two two-dimensional ones:

F(x1,x2,x3) = f1,3(x1,x3) + f2,3(x2,x3)

(ie, there is no interaction between x1 and x>sub>2) either by looking at f1,2(x1,x2) as well as f123, or we could examine if

d2 f/ dx1dx2 = 0

Alternatively, we could look at the size of these derivatives, since none are likely to be exactly zero.
In many ways this is appealing — it appears to remove the problems we had about where we integrate and should be easier to asses. I think there are three issues with it however:

  1. Most importantly, derivatives are not measured in the same units as each other, or as the original outcome. They’re rates, and thus there are real problems in deciding how to compare them; when is a derivative large relative to the others?
  2. We haven’t actually escaped the range of integration problem. We still have to assess the size of d2 f/ dx1dx2, presumably integrated over some set of covariates. It does, however, make it easier to localize this measure of size to the covariates we have observed.
  3. In many models, derivatives are not easy to evaluate. In some they are, but in trees, for example, derivatives are zero nearly everywhere (the model proceeds in steps) and you need to specify some finite difference method to look a certain distance away; How far? But, despite having avoided this approach for a while, I’m seeing more merit in it; at least in so far as it provides a way to link local to global interpretation.

Next up: from interpretation to inference and a plug for some recent work and especially for one of my students.

 

*** I chose diabetes because it was not one of the diseases I was involved in forecasting, so I don’t know how, or if IBM is looking at this.

 

Weasel Words

One common current to the diagnostics presented so far is the interpretation of the effect of a covariate “with all other terms held fixed”. This formulation is taught rather ritualistically in introductory statistics classes when multiple linear regression is introduced. ***  So what does it really mean? In a hypothetical linear regression of weight on height and age

weight = b0 + b1 * height + b2 * age

the interpretation of b1 is “suppose we could just change age”. Or more explicitly, “If we found two people who were exactly the same height and differed by 1 year in age, they would differ by b2 kg, on average”.

Fundamentally, the diagnostic methods for machine learning I introduced in the previous post — and generalized additive models more generally — also rely on this. We can understand

F(x1,x2) = g1(x1) + g2(x2)

because we can understand what g1(x1) does, WITH x2 FIXED.

So what? The issue here is that all other variables generally aren’t fixed. Alright, in nicely designed experiments we can actually fix the levels of one variable and vary the other variables. But for observational data — and almost all ML is conducted on observational data — there are relationships among the covariates, so the notion of “other terms held fixed” really doesn’t apply. Even in designed experiments, the “with all other covariates fixed” won’t happen in the non-experimental world that the experiments are presumably meant to partially explain.

To make this concrete, if you age a year, your height is likely to change as well. If you have the fortune of youth, you’ll get taller as your bones grow and later as your posture improves ***; if you’re older, your height will decrease as the disks in your spine slowly compact. Moreover, this will happen at a population level, too — people aged 37 do not have the same distribution of heights as those aged 38. So there is no real physical reality to holding all other covariates fixed.

This blog is by no means the first place to make this observation. The discipline of path analysis  has been around for nearly as long as the modern discipline of statistics. Basically, it boils down to a series of regressions:

weight = b0 + b1 * height + b2 * age + epsilon

height = c0 + c1 * age + eta

that is, there is a regression relationship between weight and age and height, but height is also dependent on age (yes, linearly for now, but just to show this up). Both these can be estimated directly from the data and we can start partitioning out blame for our spare tires by saying

The DIRECT effect of age on weight is b2, but there is also an INDIRECT effect of age through height, to give a TOTAL effect of b2 + c1 * b1.

In fact, this is now a special case of a Bayesian Network  in which we can skip some of the insistence on linearity (although Bayesian Networks often restrict to categorical-valued data). It can also be couched in the language of structural equation models.

The problem with these, in the simplest case, is that you need to know the structure of the dependences (we can’t have height depends on age AND age depends on height) before you do anything else. This is usually guided by subject matter knowledge — a kind of pseudo-mechanistic model ***. There are ways to also attempt to determine the structure of dependencies, but these are not always used and are subject to considerable uncertainty.

Now, there is a weasely retort to all these objections that “we are only trying to understand the geometry of the function”. That is, we are trying to get a visual picture of what the function looks like if we graphed it over the input space. This isn’t unreasonable, except in so far as it is often used because we can do it easily, rather than because that is what is actually desired. And this is reasonable if you actually think you have a generalized additive model (or, reality forgive you, a linear regression) in which you can always look at dependences between covariates post-hoc (but you really ought to).

However, in the diagnostic methods for black box functions explored in the previous post, we explicitly based the representation that we looked at on averaging over covariates independently — ie by breaking up any relationships between them. This affects all sorts of things, particularly the measures of variable and interaction importance that we used to decide what components to plot as an approximation to the function. This concern took up a considerable amount of my PhD thesis and I’ll return to these ideas in a later post, although even then the solutions I proposed resulted in additive model approximations which again are subject to that “all other covariates held fixed” caveate.

So what can we do? We could simply try to regress the response on each covariate alone, graph that relationship and measure it’s predictive strength. The problem is that this doesn’t then let you combine these relationships to get relationships between pairs of covariates (it doesn’t account for the relationship between even pairs of covariates). Or we could resort to a non-linear form of path analysis, requiring either the specification of causal dependences between covariates or some model selection to work out what the correct diagram should be (subject to a whole lot of unquantified uncertainty). I’ll post about some of my very early work in a few posts, but frankly I still don’t think we have any particularly good answers to all this.

There are two answers, though. One is the classical ML answer — why are you trying to interpret at all? The other is to pursue a local interpretation; we’ll look at this next post.

*** There’s an awful lot of ritualistic statements that we teach to students in introductory statistics classes — I’m not sure how much of this really gets through to student understanding, or how often these statements are really parsed. The old “If the null hypothesis was true, there would be less than 0.05 probability of obtaining the observed test statistic or something more extreme” is almost legalistic in being exactly correct and largely endarkening on first encounter. Nonetheless, we (myself included) continue to grade students based on their ability to parrot back such phrases.

*** Only after stern commands from my physiotherapist in my case .

*** Pseudo because linear regression really is a particularly blunt instrument for mathematical modeling, but we do have to give way to mathematical convenience at some point.