Musings on project management: Cascading risks

Monday, December 15, 2014

Cascading risks

Everyone who's done risk management failure analysis for a while on large scale systems has likely done it the "reductionist" way with failure mode analysis, decomposition or reduction into trees of interconnected components, and the like. All good stuff, to be sure, and reduction methods fit the model of subsystems with defined interfaces, or if in the software domain, then: API (application programming interfaces)

Now comes a posting from Matthew Squair with a keen observation: Rather than a hierarchy of subsystems that is the architecture of large scale systems, we are more likely to see subsystems as networks with perhaps multiple interconnecting nodes. Now, the static models used in reductionist methods may not predict the most important failures.

"You see the way of human inventiveness is to see the potential of things and race ahead to realise them, even though our actual understanding lags behind. As it was with the development of steam engines, so too with the development of increasingly interdependent and critical networks. Understanding it seems is always playing catchup with ability ...

A fundamental property of interdependent networks is that failure of nodes in one network can cause failures of dependent nodes in other networks. These failures can then recurse and cascade across the systems. "

Of course, such an idea has been in the PMI Risk Practice Manual for some time in the guise of linked and cascading cause-effects where myriad small effects add up a big problem. And in the scheduling community we've recognized for years that static models, like the PERT model, are greatly inferior to dynamic models for schedule network analysis for just the point Squair makes: performance at nodes is often the Achilles' heel of schedule networks; no static model is going to pick it up and report it properly.

Squair goes on to tell us:

"... we find that redundant systems were significantly degraded or knocked out, not by direct fragment damage but by the failure of other systems."

Ooops! Attention project managers: what this guy is telling us is that it's not enough to right down a hierarchy of requirements, and in the software domain its really not enough to write a bunch of user stories... if that's all you do, you'll miss too important elements

Architecture that doesn't fail is a lot more complex than just segregating data from function, or bolting one subsystem to another
Requirement "statements" and "gather requirements" rarely give you the dynamic requirements that are the stuff of real system performance

What to do: of course a reference model is always great, especially if its a dynamically testable model; if you don't have a reference model, then design test procedures and protocols that are dynamic, ever expanding as you "integrate" so that you've got a more accurate prediction of the largest scale possible.