This is the Weight and Healthcare newsletter! If you like what you are reading, please consider subscribing and/or sharing!
Let’s start today with what P-hacking is, and then we can talk about why we needed to talk about it in the first place. Please note this is an overview, there are many more layers and complications to this, these are the basics.
Quick terms list:
Hypothesis – The assumption that the researchers are testing (ie: medication A will lower blood sugar more than a placebo)
Null hypothesis – That any difference in the outcome between the treatment group and the placebo group that is found is due to chance, sampling, or experimental error and not because of the intervention
P-value – The result of a statistical test performed on results of an experiment to see if the difference between the groups is likely due to the intervention (which confirms the hypothesis and suggests that the treatment is effective) or due to chance/error (which confirms the null hypothesis and suggests that the treatment is ineffective.) The lower the p-value, the more likely that the hypothesis is correct (taught to research/statistics students through the ages as “if the P is low, the null must go.)
Statistically significant – Typically, a p-value <.05 is considered “statistically significant.” When the P-value is <.05 it confirms the hypothesis (ie: the p-value was .04, so the impact of Medication A on blood sugar was statistically significant.)
I want to note that statistically significant is NOT the same thing as clinically significant or significant in the way that we would use it colloquially. For example, let’s say an expensive, restrictive, year-long weight loss intervention results in an average weight loss of two pounds. When tested, the p-value is .01. That two-pound loss is statistically significant but in terms of whether the intervention is worth the cost/risk/difficulty etc. and should be prescribed by doctors - thus, clinically significant - the answer would be no.
So what’s p-hacking?
P-hacking, at it’s most basic level (and it can get pretty complicated!) is using statistical analysis, on purpose or accidentally, to label results statistically significant when they are not.
How do you do it?
P-hacking can be done in a lot of ways. One of the most simple is naming a bunch of primary and/or secondary endpoints for the trial. Typically a researcher is only testing one, or at the most a small few, hypotheses. But what if the researcher just names a bunch of different variables and starts testing them all? The problem is that the more variables you test, the more likely you are to get some that pop as statistically significant, even though they aren’t. This type of p-hacking works on probability – the more hypotheses that are tested, the more likely that a false-positive will occur. This effect can be magnified by having a small number of subjects and a large number of variables.
This can be paired with selective reporting or publication bias. Pulling into a wider lens for a moment, one of the things a lot of people aren’t aware of is that tons of research happens and never gets published. This can be because the researchers (and sometimes the funders) choose not to publish the results because they aren’t what they hoped for. It can also be because those who gatekeep what gets published in peer-reviewed journal articles prefer to publish studies that claim statistically significant results rather than studies that show that an intervention’s effect was not statistically significant. Then there are pay-to-play journals that are not peer-reviewed but will get you published for a fee.
This leads to a different type of bias – if you only publish successes and not studies where the hypothesis was wrong, anyone who is trying to use the existing research to make decisions (like, for example, doctors) is at a disadvantage since they can’t possibly know the full story the research would tell if it was all published.
Zooming back in on p-hacking and selective reporting, sometimes the researchers will run their stats a number of different ways (for example, with and without outliers,) but only report the statistically significant methods. Or they will report the variables that were statistically significant, but not the ones that weren’t. Also, if the statistics don’t fall the way they thought they would/wanted them to, they might just choose a statistical strategy that lets them say that the extra variables were unknown or untested, rather than being clear that they were not statistically significant.
Another method is known as Data trimming. Here, researchers make their own definitions about which data points are considered “outliers” after the data is gathered, and then they simply eliminate the “outliers” from their statistical analysis. It’s kind of like if, instead of throwing out the high and low score on a figure skating judging panel to eliminate bias, the competitor got to see all the judges' scores, run all the statistics, and then decide which scores they want to include. This gets even worse when researchers fail to report, or fail to report clearly, how they did this or even IF they did it.
Data peeking is another common p-hacking technique. In this case the researchers continuously test the data as they are gathered, and slam the brakes on data collection when the result test as statistically significant.
Why is the impact of p-hacking?
In general, it means that false information is peer-reviewed and published. It could, for example, make a medication seem to be much more effective at many more things than it actually is.
It’s happened before. In 2015, John Bohannon ran an actual human trial testing whether or not bitter chocolate would cause weight loss. The trial and data were real, he then had a statistician friend use p-hacking to interpret the results. The study used a small group of people and a large number of variables – 18 to be exact! In their case the science was so transparently bad that they avoided peer-review and submitted to pay-to-play journals. They got accepted by multiple journals within 24 hours and for the low price of 600 Euros they were published in two weeks (without re-writing one word.)
They utilized a PR person to learn how to get media attention:
“The key is to exploit journalists’ incredible laziness. If you lay out the information just right, you can shape the story that emerges in the media almost like you were writing those stories yourself. In fact, that’s literally what you’re doing, since many reporters just copied and pasted our text… Rather than tricking journalists, the goal was to lure them with a completely typical press release about a research paper.”
Gosh, this sounds so familiar for some reason… They sent their release to media outlets in multiple countries and they got print and television coverage all over the world. You can read the full story, as told by Bohannan himself, here (content warning for weight loss and diet talk.)
I want to point out that it’s not necessarily laziness (or, at least, not just laziness.) As there are fewer and fewer jobs for journalists they are being asked to do a lot more with a lot less, including reporting on areas (like science) for which they don’t have a proper understanding. They are also subject to their own implicit and explicit biases, but that’s a subject for another time.
Why p-hack?
There are certainly people out there doing this on purpose because it benefits them, or the people paying them, in some way but that’s far from the only explanation.
Some people are caught in a “publish or perish” situation in their field where they are under pressure to get research published to keep their position or to advance, so they cut corners.
Then there’s confirmation bias. Researchers are so sure that they know what the outcome should be, that they just keep running the numbers until they come out “right.” I wrote about this one in detail here.
How to spot p-hacking
Here are some red flags (note, these don’t mean that p-hacking has happened for sure, but they are, to me at least, a red flag.) If there are others that you use, please feel free to put them in the comments!
If a trial has a bunch of primary or secondary endpoints
If a trial is for an indeterminate rather than a specific time (ie: this trial will be from 2-5 years) especially if it stops at an odd time
If a drug is claimed to work for a bunch of conditions based on a few studies and/or short-term studies (especially if that drug is having trouble getting insurance coverage for its primary use)
If the number of subjects is low, but the number of variables tested is high
If the number of subjects and the number of variables aren’t immediately clear
If you want to go full nerd (and who doesn’t?!) the Berkeley Initiative for Transparency in the Social Sciences has an interesting tool here.
As always, thanks for reading! P-hacking is one of many research manipulation methods. If you’re interested in me writing about more of these, feel free to leave a note in the comments (and if you’d rather have root canal than read more about stuff like this, you can let me know that too!)
Did you find this post helpful? You can subscribe for free to get future posts delivered direct to your inbox, or choose a paid subscription to support the newsletter (and the work that goes into it!) and get special benefits! Click the Subscribe button below for details:
Liked the piece? Share the piece!
More research and resources:
https://haeshealthsheets.com/resources/
*Note on language: I use “fat” as a neutral descriptor as used by the fat activist community, I use “ob*se” and “overw*ight” to acknowledge that these are terms that were created to medicalize and pathologize fat bodies, with roots in racism and specifically anti-Blackness. Please read Sabrina Strings’ Fearing the Black Body – the Racial Origins of Fat Phobia and Da’Shaun Harrison’s Belly of the Beast: The Politics of Anti-Fatness as Anti-Blackness for more on this.
The retracted 2020 study that taught me about P-hacking: Dishonesty is more affected by BMI status than by short-term changes in glucose
https://pubpeer.com/publications/341C41AED4E29DBD67D9D3EFE36F66
The post-publication commentary online caught that they were originally looking at whether being hungry or sated affected honesty, and when that didn't pan out, they tried to invent a different "study" around the data collected. It was retracted fairly quickly after that was pointed out.
This is hugely helpful for someone (like me) who felt overwhelmed by my statistics class in college and grad school. You lay it out beautifully. I would love to see more pieces like this!