Epidemiology should be designed to measure; epidemiologists usually do it wrong
epidemiology is only good if it can be put to practical use; practical use requires meaningful effect estimates
[Welcome new readers. I know most of you subscribed due to your interest in tobacco harm reduction, based on my last post. More on that topic for sure, though I will be more often covering a broader collection of methodology and philosophy of science issues than my old THR blog did. In most cases, I think the material will be of interest to those who wish to understand the science related to THR, even if that single topic is the main reason you subscribe.]
Epidemiology research (the quantitative study of health causes and outcomes) is only useful when it measures effects, not just identifies that they exist. We want to know how much exposure E affects outcome D (for “disease”, even though the outcome is often not a disease per se). Someone might point out that this overstates and there are cases where merely learning E causes D (to unknown degree) proved genuinely useful. But these are fringe cases, such a tiny portion of all the research that is done, so I will stick with that absolute phrasing.
If you have delved even a little beyond the headlines of epidemiology results, you will have noticed that most every reported result is presented with an emphasis on “statistical significance” or a sufficiently low “p-value”. In case you are not familiar, these test statistics are a rough heuristic metric for how confident we should be -- based on getting some study result that suggests E causes D -- that the reality of the universe is genuinely not “E does not cause D” (aka “the null hypothesis is correct”) and we merely had bad luck with the random set of data we drew which erroneously suggested that the null hypothesis is wrong. You can ignore the actual technical definitions of the test statistics because, despite the impression you might have been given, they have no practical meaning. I meant it when I said it is a rough heuristic metric for the probability that random sampling error is messing with you, not the meaningful precise statistic you might have been led to believe.
So, for example, these test statistics would help you conclude this: If you flip a coin 10 times and get 6 heads, you should not be confident that you are flipping a coin that comes up heads more often than tails, so you would be unwise to reject the null hypothesis of equal probability. But if you get 94 heads from 100 flips, that should make you fairly confident you are not working with a normal coin. (For readers intrigued by the “rough heuristic” observation, and especially those who are making observations like “but those statistics assume your methods are legitimate and perfect, such as you are not cheating by using your skill at controlling a coin flip”, I hope to circle back to those topics. For now, I am setting aside the huge issues of non-random imperfections in the study methods.)
It is not a huge problem that these test statistics are not what most people think they are, but there is real harm from focusing on them. Consumers of epidemiology results -- regulators, medical decision makers, individuals seeking to benefit their own health -- approximately never want to know (or at least should almost never want to know) the mere fact that the relationship between E and D is not null. They want to know how much E affects D — measurement (or more completely, accurate measurement, with accurate assessment of how certain the measure is). So, for example, should you want to take a particular prophylactic medicine because the science shows we should be quite confident it reduces your chance of getting a particular disease? There is a good chance your answer would differ if the effect estimate is that it reduces the risk by 5% versus reducing it 75%.
This is basic stuff. I would guess that the first time an epidemiology result was ever presented in terms of statistical significance, someone mentioned that the effect estimate was far more useful than just knowing there is an effect. It is certainly the case that about half a century ago, Ken Rothman and Sander Greenland (the founders of the yet-unfulfilled movement called “modern epidemiology”) started making this point with gusto, and every good epidemiology teacher (all 37 of them) hammers on this point in their classes. Yet -- due to a combination of bad teaching, bad learning from example, and bad incentives -- this lesson does not make it into practice. Authors emphasize the fact that a test statistic provides confidence that we should reject the null hypothesis, rather than attending to the implications of the effect estimate; sometimes the effect estimate is not even calculated, just the test statistic. Authors whose effect estimate is 5% declare that, because their result is also statistically significant, they have provided further support for a previous study that had an effect estimate of 75%, even though these results actually wildly contradict each other and probably have very different practical implications.
Part of the problem here is the misleading dominance of physics in the mindspace of those who fancy themselves as understanding science. A lot of widely-known physics science is characterized by questions of the form “does phenomenon X exist or not”, in which case rejecting the null hypothesis is exactly what we want, and every result that suggests X exists agrees with every other result that suggests X exists. Another part of the problem is that these test statistics were derived in the context of agricultural and manufacturing engineering, where someone wanted to know which choice of tech really seems to improve production (hopefully another topic for a future post). But in epidemiology, the answer to the question “does E change the probability of D” is always “yes”. It is possible to imagine living in a universe that is contracting rather than expanding. It is impossible to imagine that eating apples has exactly zero effect on the chance of an H.sapiens developing a neurodegenerative disease, across all real and potential people. So the only real question is, how much? (In the apples case, the answer is almost certainly “very little” so we can call it zero for convenience. But the point is that the chance that it is exactly zero is infinitesimal and can be ruled out, so the remaining question is “how much?”)
Needless to say, the failure to derive and/or report the useful information, estimated measurement, renders a great deal of epidemiology work useless. You might argue that knowing that the universe is expanding is useless too. But at least it fulfills a deep-seated curiosity about reality. An epidemiology result that E affects D, without providing useful information about how much, is useless in a much more profound way. It tells us nothing. (Caveat: For any E and D, there will be the initial discovery that there apparently is a substantial effect, and this might(!) be the result of an epidemiology study. That could be useful. But such discovery constitutes about 0.01% of all epidemiology, a less than once-in-a-career event. And even then, it is still much more useful if the results also provide a useful effect measurement.)
This brings us to the biggest project I completed during my sick and away-from-public-writing phase. I will highlight one point from it here and write more about it later. The project was an analysis of survey responses by risk assessors to a collection of interview-like questions that boil down to: What would you like epidemiologists to do better in order to be helpful to you?
For those who do not know, risk assessors are the unsung practical health policy makers who do things like setting maximum allowable exposure levels for particular chemicals in the workplace or the environment. They are not cavalier FDA-type regulators who just ban something if they are a little bit worried about it (and the political winds are consistent with banning). Our material world depends on people working in environments with solvents, metals, and other biologically active chemicals, as well as noise and other health hazards. So regulatory responses cannot be “this exposure causes harm sometimes; ban it!” Regulations must be about maximum allowable exposure, the levels at which protective technology is required, and related nuances. It turns out that these questions are not really different from how we should be looking at decisions about medical treatments, lifestyle choices, and other health-affecting behaviors. But while a physician can (unfortunately) just go with their gut, ignore quantification, and say “I think this treatment is worth the costs in this particular case”, a risk assessor cannot get away with just saying “this just feels like a case where setting the exposure limit to 40 parts per million is the right standard.”
The overwhelming sentiment from the risk assessors in our survey is that epidemiologists fail to provide the information that they need to make good decisions. Often a study fails to collect any data that could be informative, no matter what you did with it. But in many cases there is some potentially useful data, yet the epidemiology researchers fail to report their results in a way that allows an assessment of how much exposure causes how much harm. No one is merely deeply curious about whether, say, workplace toluene exposure ever causes hearing loss, like they are about the expansion of the universe. The results are only useful if we know how much harm at what levels of exposure so we can decide how to manage the risk. A lot of resources go into producing epidemiology results that are profoundly useless.
Many of our risk assessor subjects expressed great frustration with this, and the resulting need to thus just base their decisions on laboratory studies of rodents. Epidemiologists have one job here, and they are not doing it. Frankly, it is not clear that rodent lab studies -- in spite of the allure of the precise and replicable measurements they provide -- are really any more informative than the aforementioned “going with one’s gut”. The unit of study is quite different from an actual human body, the dosing of toxins used in the lab is a poor proxy for actual worldly exposures, and the measured outcomes are seldom exactly the disease outcome of interest. But that is another story. (Also the subjects are cute little sentient beings with their own lives and the ability to suffer, and we should not be torturing them, especially to find out stuff that is only a lousy proxy for what we really want to know. But that is another other story.)
But it gets worse. That toluene and hearing loss example I mentioned is from another project I am currently working on, in which I am playing the role of the consumer of epidemiology results and thus experiencing the frustration of our aforementioned survey respondents. One thing in particular that I noticed (in addition to the general problem of not reporting decision-relevant results) is that some epidemiology researchers in this space deliberately designed their study to amplify any potential signal, presumably to maximize their chances of getting a statistically significant result. For example, they only selected the candidate study subjects who they judged to be most likely to experience the effect if it exists. Here we again see the folly of channeling physics science. In cosmology, we often legitimately want to do everything possible to boost the signal because we are looking for something that we can barely see, and if we can manage to see something, then that alone can be sufficient. But if a signal in occupational epidemiology needs to be boosted to even be detected, chances are it is not a priority for risk management. And regarding the present topic, the results of a boosted signal are not a good measure of the real-world typical effects that are needed for making optimal risk management decisions.
Some of the epidemiologists’ signal boosting tactics, like seeking out the population group that are guessed to have the largest risk, are still legitimate science, just not useful science in this context. That method can provide proof of concept. But we already have proof of concept in this case (the same authors who are writing papers that serve no purpose other than proof of concept unironically cite the existing studies that already prove the concept in their introductions); this is not one of those 0.01% cases. Sometimes biasing a study toward “vulnerable populations” can suggest that greater protection is needed for classes of higher-risk individuals, but since we don’t even have any idea what might be the right level of protection for the average worker, this is not a useful question yet.
Other signal boosting tactics are not legitimate science, and are full-on dishonest. There are infinite choices of what statistical model to run on the data. Choices include everything from how to code a particular variable (e.g., just put the absolute value in the model; take the logarithm or make some other monotonic transformation; dichotomize it, which offers a further choice of cutpoints for defining high versus low) to what statistical method to run (e.g., logistic; linear) which build in important assumptions that are ignored in the analysis. If the single reported statistical model was chosen over many attempted alternatives because it generated a clearer signal from the particular data (which will always combine the real effect of E on D and some random scattering around that) -- as often seems to be the case, but is approximately never admitted -- then the reported results will be biased upward. By that I mean that if you applied the same statistical model to a hypothetical new dataset that measured the same worldly events but with new random errors, you would almost certainly get a weaker association of E and D because the model was designed to “optimize” the result based on the particular random variation that the first dataset had.
A less subtle version of this is having many different candidate endpoints (e.g., hearing loss as measured by established definition A, B, or C; seven different tested sound frequencies, any of which might show hearing loss while the others do not; separate analyses for different demographics of the population) and focusing all the headline reporting on the one measure that had the most dramatic effect. In my current research, there are papers that show hearing loss at low frequency but not high, and that show it at high frequency but not low, and the naive commentary is that these agree and reinforce the conclusion that there is a problem. No! They flatly contradict each other. At least when this tactic is employed, a careful reader can often find the other results in the body of the paper (“hmm, your title and abstract scream that this exposure is harmful for women, but your results show your data offers comparable support for the conclusion that it is beneficial for men, even though there is no reason to expect a gender difference; did you perchance chop up your population until you found a subset with a result you could market?”). This contrasts with the tactic of trying a lot of statistical models and only reporting the one that had the “best” result, which leaves the reader with no way of knowing the alternative candidate results.
If you challenge a typical epidemiology researcher with the observation that these tactics are both bad scientific methodology and outright dishonest, you will generally get a response not of defensiveness, but of obliviousness. The researchers genuinely do not even realize they are doing something sketchy, something that interferes with getting the most accurate effect estimates, and thus being most useful to the world. “What are you talking about? That’s what everybody does!”
It is worth noting that occupational health researchers are pretty much as good as it gets, among non-clinical epidemiologists, in terms of wanting to do good science to be genuinely useful, yet they still have these problems. We are not talking about the pseudo-scientist moralizers of “public health” who blatantly twist their methods in any way possible to “show” that their personal bugaboo (alcohol, vaping, vaccination) is not merely a sin, but also harmful. We are not even talking about subfields like nutritional epidemiology which are so deeply down rabbit holes of other obviously bad methods that their results are pretty worthless, even though they do not seem to be deliberately trying to mislead.
Consider another common flavor of bad measurement. A typical headline from a nutritional epidemiology paper is something like, “eating more broccoli can reduce your risk of heart attack by 27%.” Even ignoring the systematic problems like the study did a lousy job of measuring broccoli intake, and that eating more broccoli is strongly associated with eating less meat and other healthy behaviors which cannot be fully controlled for, that tantalizing observation is probably a measurement with no practical meaning. It is almost always something like, “comparing the heart attack risk for the lowest quintile of broccoli consumption (the 20% of subjects who reported the least consumption) versus the highest quintile, the risk is 27% lower for the latter as compared to the former.” Ok, yeah it is a measurement rather than no measurement, but that promised 27% reduction is only available to you if you are currently at the lowest end of consumption and you make the rather unlikely transition clear up to the highest end. It is not a measure of the effect of real available actions in the world.
I find myself hearing the voice of an occupational risk assessor reading this and screaming, “What are you complaining about?! I would love to have an epidemiology study that broke out exposure levels by quintiles and reported the effect estimate for each, whatever the stupid headlines said. With that I could assess where to cap the exposure to keep the risk from being too high. The studies I get that just dichotomously compare an “exposed” group to a control group give me nothing useful.” Yeah, ok, fair.
That imagined dressing down is a reminder that the real key here is that epidemiology needs to be designed to be fit for purpose. Epidemiology’s value comes from providing actionable information presented in the way that informs decisions. We are not very interested in the mere existence of a phenomenon, as cosmology often is. Nor are engaged in dichotomous philosophical questions about goodness (sure, philosophy Substack can convince us that saving 10^100 shrimp from torture is morally more important that improving one human life, but in order to act optimally, we need a measure of the goodness of saving a particular finite number). Indeed, epidemiology can not provide absolute dichotomous answers like those at all because the results are all relative, only true for a particular population, with particular other exposures, today’s technology, etc. Boiling drinking water can prevent a lot of disease... for many populations now, for a lot of other populations a century and a half ago, but not today in my neighborhood.
The best decision-informing results will always involve measurement, the present point. But that is only one necessary step toward better integrating the needs of the consumers of epidemiology and the design and reporting of the research.
Really informative. I leave this post feeling less stupid! Thanks!
Bravo! You are among the 37.