On October 16, Jim Berger, Deborah Mayo and David Trafimow took part in a debate about the use of hypothesis tests and p-values in scientific studies hosted by NISS. I’m delighted to see Statistics once again grappling with its philosophic basis, and at least some philosophers coming to help out. I think it’s worth watching if you haven’t. See
Here’s some of my own thoughts, having had a few days to digest.
First, the participants had agreed not to prepare beforehand, and while I understand the motivation I think that I would support Mayo’s tweet wondering if that was, in fact, the best idea. I think that the arguments would have been more cogently stated and perhaps more directly engaged with some more preparation. Trafimow, in particular, took some time to warm up (he may have taken the injunction against preparation most seriously) and that is a shame; I don’t agree with his position (more later), but this is all the more reason to want it to be persuasively argued — there might be something I’ve missed! In retrospect, I might have actually gone in the other direction and had each participant write out a statement of principles to be shared ahead of time and read at the start, and from which the discussion could proceed.
I took there as being really two debates: Mayo vs Berger on p-values versus Bayes factors, and Mayo and Berger versus Trafimow on testing at all. I’ll readily admit to finding it hard to concentrate in online fora the same way I would for an in-person event (no-one can see me check my e-mail) and I had to leave and teach before the discussion period, so I may have missed some details, but these were the most salient discussions that struck me.
For the first of these, it seemed to me that Berger and Mayo largely came from very different perspectives and largely talked past each other. Berger supported Bayes factors with “this is how scientists want to interpret p-values” countered with Mayo’s “but that’s not how they ought to interpret them”. As a philosophy of science, I’m in Mayo’s camp here, but have to acknowledge that nearly a century of statistical education still seems not to have found a way to reliably get the point across. We can certainly say that correct science does not need to account for human cognitive failings — it’s ok that it’s hard — but it does open the question of whether there are more readily-understood but equally rigorous frameworks; although a Popperian description of science certainly points to something like classical statistical methods.
But I was sorry to see neither really address the concerns of the other. Berger, of course, would say that Bayesian frameworks provide a coherent alternative to scientific inference and argues that p-value evidence maps poorly only Bayes factors. Mayo would counter that you can’t equate the scales used by the two systems. I’d agree with Mayo, but the technical argument misses the differentiation of intent: do we work from how scientists actually think and try to improve that, or formulate how they ought to think? The latter is certainly appealing, but does run into human failings.
This is evident in the reproducibility crisis (only part of which is based in bad statistical practice), where Mayo is again technically correct in observing that p-hacking is an indication that statistical significance is actually a pretty challenging requirement, if achieved honestly. However, the neatness of the philosophical system doesn’t account for the crises very neatly demonstrating Goodhart’s law as generalized by Marylin Strathem:
When a measure becomes a target, it ceases to be a good measure.
One can, of course, build in guard-rails: pre-registration, or developing methods of post-selection inference, and I would guess that Mayo would be fine with either of these so long as they preserve her severity requirements. It might be better to find ways to reward scientists, not for publishing papers, but for publishing papers that get replicated (not that I have brilliant ideas about how to do this), and thereby incentivizing scientists to be honest (with themselves) about their statistical practice. That idea isn’t original and quite likely neither Mayo nor Berger would disagree with either of these statements. In fact, there may be little to say about their disgreement over a starting point besides “Let me acknowledge the opposing concern, but say that I think we have to just push past it”, but even that statement is useful.
The debate about using hypotheses at all was somewhat shorter. I had initially understood Trafimow’s editorial decision to ban significance tests as something of sociological response to p-hacking — we will get less distorted models and conclusions if we take away the incentives. I’d agree with Mayo that, on balance, I think that removing checks against randomness is counter-productive, but the proposition is not crazy. However, Trafimow staked out a more philosophical position; initially against basing conclusions on wrong models (the reductio of which would make progress impossible) but then against basing dichotomous decisions on incorrect models. I presume this comes from the discontinuity: I can be slightly off in my picture of the world, but still fairly close in my estimates of effects until I turn it into an either-or proposition about something (assuming my estimates are somehow continuous in my model space). I’m not sure this is so much of a concern: the p-values Trafimow objects to (or at least their distribution) would still be continuous in the model space and the potential for disagreement in dichotomous conclusions reduces as the models converge, as effectively argued by Mayo. I will happily buy into confidence intervals being a more informative summary of statistical evidence, when they are available, but we really shouldn’t pretend that they are anything other than a different summary of statistical tests.
I did think that it was a shame here not to have a performance interpretation of hypothesis tests articulated, since “Using this model, I would not frequently be wrong” is an easy counter to Trafimow’s statements. This interpretation does leave you working with models rather than with the world they purport to describe, which is one of Trafimow’s objections, but they do in fact talk about the projection of the world onto that model. (Inference also requires more assumptions than just obtaining parameter estimates does, but that goes for any sort of uncertainty quantification.) The other concern is that it starts veering a bit Bayesian. Nonetheless, we do work with models and I think the within-model replicability is still at least a minimal-and-non-trivial requirement. The replication crisis, also, is explicitly given in performance terms, albeit extra-modular, so some statement that the performance criteria are at least entailed by severe testing would be helpful.
One statement of Trafimow’s that I would thoroughly get behind, however, is that replications should not just be about hypothesis tests, but needs to also include effect sizes. Indeed, I’ve often worried that the narrow focus of statistical methods on single experiments is damaging: in making evidence for specific effects challenging we make scientific conclusions contingent on the specific experimental setting they were obtained in. This disincentivizes science from developing knowledge that is transferable across situations. Biology or psychology are much more complex that physics, of course, making generalized quantitative effects much more diffcult, but I’m not sure that we couldn’t do better — I have a rant about linear models being deleterious for most fields that statisticians have interacted with, but that’s for another day.
In any case; thanks to all participants for having a go. For all I may seem to complain, the debate clearly stimulated my thought processes and if the same is the case for even a fraction of the audience, that’s the optimal a priori outcome it could have had. Well done.