Tim Falzone and Ben Treynor Sloss at Google:

In the face of increasing system complexity and emerging challenges, we at Google are always asking ourselves: what’s next? How can we continue to push the boundaries of reliability and safety?

To address these challenges, Google SRE has embraced systems theory and control theory. We have adopted the STAMP (System-Theoretic Accident Model and Processes) framework, developed by Professor Nancy Leveson at MIT, which shifts the focus from preventing individual component failures to understanding and managing complex system interactions.

System failures often have subjective root causes. Asking different questions leads to different outcomes:

Instead of asking “What software service failed?” we ask “What interactions between parts of the system were inadequately controlled?” In complex systems, most accidents result from interactions between components that are all functioning as designed, but collectively produce an unsafe state.

The concept of a system entering a hazard state is a good one.

Hazard states are not system failures, but they are unsafe conditions which can lead to failures. Having automated and manual tools maintain awareness of being in a hazard state can help prevent disasters.

Leave a Reply

Your email address will not be published. Required fields are marked *

More Technology Knowledge Updates…