Tuesday, February 19, 2013

Large numbers and small

Most of us have heard about/read about the Law of Large Numbers, LLN. In a few words: given an unknown population of things, each thing independent of the other, we can deduce the average value of the 'things' by drawing a 'large' sample from the population.

Example: the age of male adults in the user population

Fair enough. Obviously, the project world, the LLN saves a ton of money and time. We only need a sample to figure out the population... this is what is behind a lot of testing but not an seemingly infinite amount of testing. Or, this is what supports polling of the user population rather asking every person in the population if they like this or that about the product we're developing.

Now comes the Law of Small Numbers, to wit: Small samples yield more extreme results than is typical of the population at large, and small samples produce these results fairly often. 

Who says this? Daniel Kahneman on page 110 of his recent book: Thinking Fast and Slow

Small numbers per se are not a causation of extremism; the extreme results are just an artifact of sampling. And, a small population may not be causative of extreme results either. In other words, just because a particular true population is small doesn't mean it's average value is an extreme situation.
Ok, now we know this... what's the issue here?

The issue is a bias in the way we tend to look at data.

It's the "bird in hand rather than the bush" thing.

We have a perfectly reasonable bias in favor of certainty over doubt. That is, given the certainty of the data from the small sample, we're more likely to go with it (because it's here and now and in front of us) than doubt it, throw it away, and redesign the experiment.

And, many times we'd be very wrong; perhaps extemely wrong depending on the extremes of the true population.

Bottom line: lies, damn lies, and statistics! Beware those with green eyeshades.