Monday, March 14, 2022

The THREE things to know about statistics

Number One: It's a bell, unless it's not
For nearly all of us when approaching something statistical, we imagine the bell-shape distribution right away. And, we know the average outcome is the value at the peak of the curve.

Why is it so useful that it's the default go-to?  

Because many, if not most, natural phenomenon with a bit of randomness tend to have a "central tendency" or preferred state of value. In the absence of influence, there is a tendency for random outcomes to cluster around the center, giving rise to the symmetry about the central value and the idea of "central tendency". 

To default to the bell-shape really isn't lazy thinking; in fact, it's a useful default when there is a paucity of data. 

In an earlier posting, I went at this a different way, linking to a paper on the seven dangers in averages. Perhaps that's worth a re-read.

Number Two: the 80/20 rule, etc.

When there's no average with symmetrical boundaries--in other words, no central tendency, we generally fall back to the 80/20 rule, to wit: 80% of the outcomes are a consequence of 20% of the driving events. 

The Pareto distribution, which gives rise to the 80/20 rule, and its close cousin, the Exponential distribution, are the mathematical underpinnings for understanding many project events for which there is no central tendency. (see photo display below) 

Jurgen Appelo, an agile business consultant, cites as example of the "not-a-bell-phenomenon" the nature of a customer requirement. His assertion: 
The assumption people make is that, when considering change requests or feature requests from customers, they can identify the “average” size of such requests, and calculate “standard” deviations to either side. It is an assumption (and mistake)...  Customer demand is, by nature, an non-linear thing. If you assume that customer demand has an average, based on a limited sample of earlier events, you will inevitably be surprised that some future requests are outside of your expected range

What's next to happen?
A lot of stuff that is important to the project manager are not repetitive events that cluster around an average. The question becomes: what's the most likely "next event"? Three distributions that address the "what's next" question are these:

  • The Pareto histogram [commonly used for evaluating low frequency-high impact events in the context of many other small impact events], 
  • The Exponential Distribution [commonly used for evaluating system device failure probabilities], and 
  • The Poisson Distribution, commonly used for evaluating arrival rates, like arrival rate of new requirements

Number three: In the absence of data, guess!
Good grief! Guess?! Yes. But follow a methodology (*):
  • Hypothesize a risk event or risky outcome (this is one part of the guess, aka: the probability of a correct hypothesis)
  • Seek real data or evidence that validates the hypothesis (**)
  • Whatever you find as evidence, or lack thereof, modify or correct the hypothesis to come closer to the available evidence.
  • Repeat as necessary
(*) This methodology is, in effect, a form of Bayes' reasoning, which is useful for risk analysis of single events about which there is little, if any, history to support a Bell curve or Pareto analysis. Bayes is about uncertain events which are conditioned by the probability of influencing circumstances, environment, experience, etc. (Your project: Find the Titanic. So, what's the probability that you can find the Titanic at point X, your first guess?)

(**) You can guess at first about what the data should be, but in the absence of any real knowledge, it's 50/50 that you're guessing right. Afterall, the probability of evidence is conditioned on a correct hypothesis. Indeed, such is commonly called the Bayes likelihood: the probability of evidence given a specific hypothesis.

Like this blog? You'll like my books also! Buy them at any online book retailer!