Some Perils of Statistical Education

Let’s talk about some perils of getting educated in statistics1.

As with my discussion on the perils of education, I’m not interested in the obvious things everyone talks about. Statistics themselves are intrinsically perilous, not to mention the classic issues around “How to Lie With Statistics”. But I’m not talking about the issues around misuse of statistics… I’m talking about at least one peril that arises from the very process of becoming educated in them.

Statistics Are Exactly 33 and 1/3rd Percent Interesting

There are three basic cases for statistics:

  1. The signal is so overwhelmingly strong that analysis is almost unnecessary.

  2. The signal is borderline. This is when you need to pull out your statistics, use them carefully, and tease apart all the relevant bits of the data to ensure that you pull only truth from a confusing mass of data.

  3. The signal is so weak that there is no hope of every finding it no matter how many statistics you throw at the problem.

    It is possible that drinking one cup of coffee a day lowers your chance of gall bladder cancer by precisely 0.00000034%, but if that is the case, the world will never know. No conceivable study could ever be run that would have sufficient statistical power and sufficient confidence in the removal of confounding factors to be able to nail down that number.

The “interesting” one is the second one. It’s the one we spend pretty much all our time with our statistics.

It is also the least common case that you will encounter.

In order, the cases you will encounter are:

  1. Signals so weak or nonexistent they will never be teased out of any conceivable data set.
  2. Signals so strong statistical analysis is essentially redundant.
  3. Signals that are present, but require powerful tools, carefully used, to extract.

The first is the almost infinite set of possible connections of no value. This set of hypotheses is far larger than any other set, and is only relevant because the sheer size of this set is sufficient to cause the problem I mentioned earlier… their overwhelming number means it’s quite hard to keep them out of your analysis when you’re looking for an explanation for some fact. They’re all individually unlikely but there’s so many of them that they will find every crack in your methodology, like rain will find the hole in your roof despite the unlikeliness of any particular rain drop being the one that gets in.

The second are the bread and butter of day-to-day life. If I am thirsty, with very high probability I will be less thirsty if I drink potable water2. If I jump off a roof, I have a very good chance of serious injury. I do not need to jump off of 100 roofs and subject the results to deep statistical analysis to tease out a signal; the signal from even one roof is more than enough.

There’s another useful class of these instances, which is things that a statistically-uneducated person may not see as having a very strong signal, but nevertheless do. This is the case I lay out in How To Objectively Tell Who Is Right About The Vaccine; that with just a bit of basic statistical math, almost everyone here in late 2021 has the data to determine that something is out of the ordinary and wrong. But this signal lies in the range between what may be obvious to one educated in statistics and normal human intuition.

Normal human intuition can fail the other way too, of course, thinking something is certain when simple statistics shows it is not. A proper education can also help avoid these.

Last and least, we have the “interesting” signals. My heading notwithstanding, these are far less than 33 and 1/3rd of the cases.

A subtle point: It is likely that there are more things in the universe that fall into the “interesting” category than there are things that fall into the “strong signal” category. But you are more likely to encounter more things in the “strong signal” category than the interesting category, precisely because the strong signals are easily discovered. We’re surrounded by “interesting” things, but we’re even more surrounded by the hypotheses in the weak category, such that whenever we think we’ve got a theory about an interesting signal, it’s more likely going to be in the weak case.

Or, to put it another way, the very fact that you had to drag out statistics to find a signal is already weak evidence that there is no signal3.

Science And Statistics

I think it’s a major flaw of modern science that so much work is done here on the margins. By all means, well-done science should carefully do the analysis to show that something they believe is a strong signal truly is strong. A professional should show their work. But sciences ought to be spending more of their time on things that have enormous statistical significance, where the statistics don’t hardly need to be analyzed at all to show the effect is significant.

Because another peril of a statistics education is to confuse “statistically significant” with important. This is clearly endemic to the entire modern science machine. Now, I think it may be nearly impossible to get a statistics education without some professor at some point teaching this fact, maybe even banging on it a bit, but by the action of the science community it clearly has not had sufficient effect.

If you have to run a multi-decade, longitudinal study on hundreds of thousands of participants to produce the result that eating more than 3 eggs a day raises your risk of heart disease by 0.4% +/- .2% with 95% percent confidence intervals… who cares? I don’t care how statistically significant that result is, it’s not useful.

But Krymneth, of course it’s useful, that’s a lot of potential heart disease.

No, it isn’t, and the reason is that this is kind of the flip side of the p-hacking problem. On its own, it may seem like an important number, but now consider it as merely one entry on a matrix consisting of all possible food habits vs. all possible impacts on the body it may have if pursued over decades. Suddenly, this one number that seemed so important simply because we were looking at it is lost in a sea of very similar numbers. With enough data, everything on this chart is statistically significant, too.

It is very unlikely that 0.4% is a particularly large or important number on that chart. There’s hundreds or thousands of far larger and more important numbers on that chart. If you could somehow obtain a copy of that chart and make a list of the top 100 most interesting interventions, it’s very unlikely that “eat fewer eggs to experience less heart disease” would be even close to on it. A rational being using this chart to pursue health can’t justify going thousands of rows down the list because of the diminishing returns you’d get to long before you were worried about 0.4% influence on a condition already influenced a lot more by many other more important things.

A statistical education, by necessity, is almost entirely about this tiny minority of cases where statistics can miraculously pull signal out of noise and discern whispers in a stadium. It has to be, that’s where all the interesting education is to be had. But it trains a statistician to overestimate both how often that case comes up, and to overestimate the importance of marginal results by forgetting to contextualize them inside of the relevant statistical universe.

I am tarring with a broad brush here. Happily, there are people aware of this problem and working against it. There are a variety of good statistics that can be done to account for contextualizing results against the size of their statistical universe, and to allow you to both go shopping for hypotheses in a data set while also accounting for the p-hacking issues.

This post is not anti-intellectual or anti-statistics, because there’s a lot of very interesting, mathematically-grounded, and powerful work that has been done in the field. This is not written out of disrespect for statistics, it is written from a place of desiring proper respect for statistics. Just as I said in my first Perils of Education piece, it is a mistake to throw the baby out with the bathwater here and think all statistics is bunk because the discipline is misused so often. The flaw lies in the misuse.

A great example I found in the wild: This YouTube video on Ezekiel’s prophecy about what would happen to Tyre. As the presenter there goes through, many prophecies in the Bible contain many details about the prophesied event. Let’s say 10, just to be a round number. One can sort of wing it and guesstimate probabilities for the various bits and pieces, and often come up with a fairly small number for the probability of all of them being true, because it doesn’t take very many number below 1 multiplied together to become very small. In this case, that Tyre would be overthrown “someday” isn’t particularly a bold prophecy, but the details about how it was done were very specific and thus low probability. Skeptics who understand statistics less well than they think will often peck away at one or two of the specific details, then declare the prophecy disproved, without stopping to think about the fact that there’s still a whackload of signal there. Even if they were successful at turning a prophecy from a one in a trillion to a one in 25 billion… that’s not actually all that impressive4.

You Are Almost A Data Scientist Now

Data science is big these days, but one of the open secrets of the field that the data scientists know but those employing them don’t want to hear is that data science follows exactly the pattern I outlined above. People expect to be able to throw large data sets at data scientists and say “Find me something interesting. Surely there is something in there!”

But data generally has the same breakdown:

  1. Most of the things you might want to extract from data aren’t there at all. There’s even mathematical reasons behind this like the curse of dimensionality. Just because you want a question answered does not obligate the data to answer it.

  2. Then, the true dirty secret of data science, much of the time the answer is so blindingly obvious we hardly needed to do any data science work in the first place. If we take “how often a customer calls our phone support” and “how they rated their support afterwards” and ask the odds of a customer dropping our product, then the answer is, the most likely customers to drop our product based on this data set are the ones who call support a lot and give it poor ratings.

  3. Only after all the questions one could ask about the data fall to the previous two categories do we finally get to the ones where the data scientist uses all their immense training to cleverly extract some non-obvious signal from the pile of data.

    But this is still ultimately a minority case across all data sets and possible questions.

With this understanding, you are now, let’s say, 25% of the way towards being a data scientist5. Congrats.

AI Is Mostly Statistics

As a double bonus, AI is mostly statistics, so everything said above about “statistics” applies to AI as well.

The most interesting things about AI are the things where we know there is a signal, we just lack the sophistication to extract them. For instance, speech extraction is a hard problem, but we know that a recorded audio file of someone speaking can have its speech extracted, precisely because we do it as humans. It has taken decades of research and some very sophisticated algorithms to get to where it mostly works, but we always knew there was some way to do it. We have a pretty good idea that a well-equipped “self-driving” car with more sensors than a human has can drive down the road, because a human could drive the car with the same input. It’s just really hard to extract that signal. The techniques to do so are generally labeled “Artificial Intelligence”, though they still amount to a whole bunch of statistics once you understand them.

It’s kind of a case of a sufficient difference in quantity becoming a difference of quality on its own. Statistics as thought of in school is generally extracting one number from some pile of data; AI tends to deal with matrices of numbers in the millions or billions, and does statistics on statistics to the n’th degree, but it’s still statistics, just at a much larger scale.

The world may be full of other problems like that where we don’t know that it’s possible to extract the signal because we can’t already do them as humans, but such problems are almost impossible to find. If we did not already know that speech recordings could be turned into words, we’d probably have had a hard time discovering. There’s too many such possibilities to explore them one by one. If the little “interesting” hypothesis are flanked by numerous nearby “uninteresting” ones, these big interesting matters are flanked by exponentially more wrong hypotheses to test.

So those remain the exceptions. In the meantime, a lot of what is called “AI”, like recommendation engines, remain firmly stuck in the problems discussed above. It’s pretty obvious that recommendation engines, no matter how many fancy statistics are thrown at them, are ultimately just recommending more videos like the ones you’ve seen. YouTube does not see you watch a particular kitten video, watch the exact segment you rewind, then see you watch a particular kitchen repair video, do a lot of math, and then come to the correct conclusion that you’d like to watch a retrospective on the making of Animaniacs through astonishing leaps of logic no human could possibly have kept up with.

It just recommends more cat and kitchen repair videos.

AI can’t find signal where there just isn’t any.

  1. I didn’t expect this to become a series, but it turns out there’s a lot of perils. ↩︎

  2. Note that this is technically not 100%, because it can and has happened that one may be thirsty, but drinking does not decrease the thirst. If this happens, it is certainly a malfunction in the body’s machinery, but it does happen.

    Nevertheless, the signal here is huge, and no one needs to sit down and do a careful statistical analysis to be sure that in general, drinking potable water reduces thirst.

    Of course I have to specify potable water because it is well known that drinking ocean water increases thirst. ↩︎

  3. I want to emphasize here the had to. A scientist may be obligated by a journal to produce a particular analysis, and they may show a p = 0.00000000073 result. But they probably didn’t need statistics to get that result, unless the data set was simply too large for human comprehension. I’m talking about the case here where you had to drag out the stats because it’s p = 0.26 and you couldn’t just eyeball the data for significance. ↩︎

  4. By Biblical standards, if the prophecy is wrong at all it has failed. But the skeptics are not operating by Biblical standards. So if someone hits “merely” a 1 in 25 million long shot in a 90% fulfilled prophecy, we may as Christians still be unimpressed, but as humans that still leaves something to be explained! ↩︎

  5. There is a slight chance I may be off by a couple of orders of magnitude there. Dunno. I’m still analyzing the data. ↩︎

Prepping the Kids Midwits and Contradictory Evidence