One common current to the diagnostics presented so far is the interpretation of the effect of a covariate “with all other terms held fixed”. This formulation is taught rather ritualistically in introductory statistics classes when multiple linear regression is introduced. *** So what does it really mean? In a hypothetical linear regression of weight on height and age
weight = b0 + b1 * height + b2 * age
the interpretation of b1 is “suppose we could just change age”. Or more explicitly, “If we found two people who were exactly the same height and differed by 1 year in age, they would differ by b2 kg, on average”.
Fundamentally, the diagnostic methods for machine learning I introduced in the previous post — and generalized additive models more generally — also rely on this. We can understand
F(x1,x2) = g1(x1) + g2(x2)
because we can understand what g1(x1) does, WITH x2 FIXED.
So what? The issue here is that all other variables generally aren’t fixed. Alright, in nicely designed experiments we can actually fix the levels of one variable and vary the other variables. But for observational data — and almost all ML is conducted on observational data — there are relationships among the covariates, so the notion of “other terms held fixed” really doesn’t apply. Even in designed experiments, the “with all other covariates fixed” won’t happen in the non-experimental world that the experiments are presumably meant to partially explain.
To make this concrete, if you age a year, your height is likely to change as well. If you have the fortune of youth, you’ll get taller as your bones grow and later as your posture improves ***; if you’re older, your height will decrease as the disks in your spine slowly compact. Moreover, this will happen at a population level, too — people aged 37 do not have the same distribution of heights as those aged 38. So there is no real physical reality to holding all other covariates fixed.
This blog is by no means the first place to make this observation. The discipline of path analysis has been around for nearly as long as the modern discipline of statistics. Basically, it boils down to a series of regressions:
weight = b0 + b1 * height + b2 * age + epsilon
height = c0 + c1 * age + eta
that is, there is a regression relationship between weight and age and height, but height is also dependent on age (yes, linearly for now, but just to show this up). Both these can be estimated directly from the data and we can start partitioning out blame for our spare tires by saying
The DIRECT effect of age on weight is b2, but there is also an INDIRECT effect of age through height, to give a TOTAL effect of b2 + c1 * b1.
In fact, this is now a special case of a Bayesian Network in which we can skip some of the insistence on linearity (although Bayesian Networks often restrict to categorical-valued data). It can also be couched in the language of structural equation models.
The problem with these, in the simplest case, is that you need to know the structure of the dependences (we can’t have height depends on age AND age depends on height) before you do anything else. This is usually guided by subject matter knowledge — a kind of pseudo-mechanistic model ***. There are ways to also attempt to determine the structure of dependencies, but these are not always used and are subject to considerable uncertainty.
Now, there is a weasely retort to all these objections that “we are only trying to understand the geometry of the function”. That is, we are trying to get a visual picture of what the function looks like if we graphed it over the input space. This isn’t unreasonable, except in so far as it is often used because we can do it easily, rather than because that is what is actually desired. And this is reasonable if you actually think you have a generalized additive model (or, reality forgive you, a linear regression) in which you can always look at dependences between covariates post-hoc (but you really ought to).
However, in the diagnostic methods for black box functions explored in the previous post, we explicitly based the representation that we looked at on averaging over covariates independently — ie by breaking up any relationships between them. This affects all sorts of things, particularly the measures of variable and interaction importance that we used to decide what components to plot as an approximation to the function. This concern took up a considerable amount of my PhD thesis and I’ll return to these ideas in a later post, although even then the solutions I proposed resulted in additive model approximations which again are subject to that “all other covariates held fixed” caveate.
So what can we do? We could simply try to regress the response on each covariate alone, graph that relationship and measure it’s predictive strength. The problem is that this doesn’t then let you combine these relationships to get relationships between pairs of covariates (it doesn’t account for the relationship between even pairs of covariates). Or we could resort to a non-linear form of path analysis, requiring either the specification of causal dependences between covariates or some model selection to work out what the correct diagram should be (subject to a whole lot of unquantified uncertainty). I’ll post about some of my very early work in a few posts, but frankly I still don’t think we have any particularly good answers to all this.
There are two answers, though. One is the classical ML answer — why are you trying to interpret at all? The other is to pursue a local interpretation; we’ll look at this next post.
*** There’s an awful lot of ritualistic statements that we teach to students in introductory statistics classes — I’m not sure how much of this really gets through to student understanding, or how often these statements are really parsed. The old “If the null hypothesis was true, there would be less than 0.05 probability of obtaining the observed test statistic or something more extreme” is almost legalistic in being exactly correct and largely endarkening on first encounter. Nonetheless, we (myself included) continue to grade students based on their ability to parrot back such phrases.
*** Only after stern commands from my physiotherapist in my case .
*** Pseudo because linear regression really is a particularly blunt instrument for mathematical modeling, but we do have to give way to mathematical convenience at some point.