Musings on project management: The statistics of failure

Monday, October 20, 2014

The statistics of failure

The Ebola crises raises the issue of the statistics of failure. Suppose your project is to design the protocols for safe treatment of patients by health workers, or to design the haz-mat suits they wear -- what failure rate would you accept, and what would be your assumptions?

In my latest book, "Managing Project Value" (the green cover photo below), in Chapter 5: Judgment and Decision-making as Value Drivers, I take up the conjunctive and disjunctive risks in complex systems. Here's how I define these $10 words for project management:

Conjunctive: equivalent to AND. The risk everything will not work
Disjunctive: equivalent to OR. The risk that at least something will fail

Here's how to think about this:

Conjunctive: the risk that everything will work is way lower than the risk that any one thing will work.
Example: 25 things have to work for success; each has a 99.9 chance of working (1 failure per thousand). The chance that all 25 will work simultaneously (assuming they all operate independently): 0.999^25, or 0.975 (25 failures per thousand)
Disjunctive: the risk that at least one thing will fail is way more than the risk that any one thing will fail.
Example: 25 things have to work for success; each has 1 chance in a thousand of failing, 0.001. The chance that there will be at least one failure among all 25 is 0.024, or 24 chances in a thousand.*

So, whether you come at it conjunctively or disjunctively, you get about the same answer: Complex systems are way more vulnerable than any one of their parts. So... to get really good system reliability, you have to nearly perfect with every component.

Introduce the human factor

So, now we come to the juncture of humans and systems. Suffice to say humans don't work to a 3-9's reliability. Thus, we need security in depth. If an operator blows through one safe guard, there's another one to catch it.

John Villasenor has a very thoughtful post (and, no math!) on this very point: "Statistics Lessons: Why blaming health care workers who get Ebola is wrong". His point: hey, it isn't all going to work all the time! Didn't we know that? We should, of course.

Dr Villasenor writes:

... blaming health workers who contract Ebola sidesteps the statistical elephant in the room: The protocol ... appears not to recognize the probabilities involved as the number of contacts between health workers and Ebola patients continues to grow.

This is because if you do something once that has a very low probability of a very negative consequence, your risks of harm are low. But if you repeat that activity many times, the laws of probability ... will eventually catch up with you.

And, Villasenor writes in another related posting about what lessons we can learn about critical infrastructure security. He posits:

We're way out balance on how much information we collect and who can possibly use it effectively; indeed, the information overload may damage decision making
Moving directly to blame the human element often takes latent system issues off the table
Infrastructure vulnerabilities arise from accidents as well as premeditated threats
The human element is vitally important to making complex systems work properly
Complex systems can fail when the assumptions of users and designers are mismatched

That last one screams for the imbedded user during development --

*For those interested in the details, this issue is governed by the binominal distribution which tells us how to select or evaluate one or more events among many events. You can do a binominal on a spreadsheet with the binominal formula relatively easily.