In the last week, I imported a TypeScript file with a different case than what was on disk by mistake. Some tools accepted it. The import-x/extensions ESLint rule failed because of the case mismatch. The error message there did not suggest anything about letter case.
In the face of increasing system complexity and emerging challenges, we at Google are always asking ourselves: what’s next? How can we continue to push the boundaries of reliability and safety?
To address these challenges, Google SRE has embraced systems theory and control theory. We have adopted the STAMP (System-Theoretic Accident Model and Processes) framework, developed by Professor Nancy Leveson at MIT, which shifts the focus from preventing individual component failures to understanding and managing complex system interactions.
System failures often have subjective root causes. Asking different questions leads to different outcomes:
Instead of asking “What software service failed?” we ask “What interactions between parts of the system were inadequately controlled?” In complex systems, most accidents result from interactions between components that are all functioning as designed, but collectively produce an unsafe state.
The concept of a system entering a hazard state is a good one.
Hazard states are not system failures, but they are unsafe conditions which can lead to failures. Having automated and manual tools maintain awareness of being in a hazard state can help prevent disasters.
This blog series creates a small operating system in the Rust programming language. Each post is a small tutorial and includes all needed code, so you can follow along if you like.