In brief
A real-world control system disruption revealed how quickly confidence can erode after recovery
When production resumes, unresolved dependencies and temporary decisions often remain
Hesitation after a failure is natural—but stability isn’t the same as safety
Stewardship means sequencing decisions after the crisis, not reacting all at once
When Assumptions Meet Reality
The incident didn’t start with a control system problem.
It started with a sudden power failure.
The plant lost power abruptly—everything went dark, including the physical servers running the control system. Like many facilities, there was a UPS in place to protect against exactly this scenario. It was assumed to work. It had always been there. Nothing in daily operation suggested otherwise.
But when the power dropped, the UPS didn’t support the load. The servers lost power without a clean shutdown, corrupting the drives. When power was restored, the physical server was unrecoverable.
At that point, the plant faced a difficult reality. They had backups of their control system—but no working server to restore them to. There was no redundant server on site and no spare hardware immediately available. Operations were completely offline while the team scrambled to locate replacement equipment from other sites.
What made the difference was not perfection, but preparation in one critical area.
An operator had recently taken backups of the operator station virtual machines and saved them to a separate NAS location—not on the failed server. A recent archive had also been copied there. The backups were current, intact, and accessible. What was missing was a place to run them.
While the plant worked to source hardware, our team restored the operator station VM on a laptop in our office and configured it to run locally. That laptop was shipped overnight, allowing the plant to bring up a single operator station and regain visibility and control within roughly 24 hours.
It wasn’t ideal, and it was never meant to be permanent. But it created breathing room.
About a week later, the plant repurposed an older server from another site and restored additional functionality. The temporary operator station didn’t disappear. Instead, it continued operating alongside the repurposed hardware, resulting in a patchwork system held together by short-term decisions that were never intended to coexist.
The plant was producing again—but everyone involved understood the situation for what it was: a fragile middle state
The plant was producing again—but everyone involved understood the situation for what it was: a fragile middle state.
That’s where the conversation needs to shift.
What the Disruption Actually Revealed
Disruptions like this rarely generate entirely new problems. More often, they expose conditions that already existed but hadn’t been brought to light.
Before the event, the system was working. Production was running. There was no clear reason to question the underlying protections in place. That’s what made the failure so destabilizing: it exposed reliance on a critical safeguard that hadn’t been actively verified—specifically, a UPS expected to protect the servers during a power loss—combined with the absence of redundancy to keep operations running when that protection failed.
Most systems don’t become risky because they stop functioning. They become risky when key dependencies are assumed rather than understood. When shutdown protections, recovery paths, or backup strategies exist largely on the basis of expectation rather than observation, confidence erodes. Changes take longer to validate. Fewer people feel certain about how different parts of the system interact under stress.
The system may continue to run—but clarity fades.
Over time, this loss of clarity accumulates in reasonable ways. Redundancies are added in some areas but not others. Interfaces are layered in. Infrastructure, networks, and control logic evolve at different rates. Backup locations change. Temporary decisions persist. Each step makes sense on its own. What’s often missing is a shared understanding of how those decisions behave together when conditions are no longer ideal.
That’s why these issues tend to surface during disruptions rather than steady operation. Power loss, abrupt shutdowns, or hardware failures don’t create new risks so much as force existing assumptions to prove themselves. Where backups live, what hardware they depend on, and which protections are truly effective suddenly matter.
Key takeaway
If this scenario repeated itself, how quickly could the system recover—and how confident would the team be in the recovery process?
Why Change Feels Risky After a Failure
Once stability is restored, teams often feel conflicted about what comes next.
On one hand, the disruption has exposed weaknesses that can’t be ignored. On the other hand, the system is running again—often through careful, hard-won effort and temporary measures that were never intended to be permanent. That tension shapes how decisions get made.
For operators and maintenance teams, change isn’t abstract. It’s experienced through startups, alarms, recovery procedures, and day-to-day interaction with the system. When recovery depends on workarounds or improvised steps, caution isn’t resistance—it’s a rational response to lived experience.
In situations like this, the most stressful moment isn’t always the failure itself. It’s the uncertainty around recovery. What will come back cleanly? What won’t? Which steps are repeatable, and which rely on people being in the right place at the right time? Those questions linger long after production resumes.
This is why technically sound changes can still feel risky. A modification may make sense on paper, but if it alters a fragile recovery path or introduces unfamiliar behavior during startup or shutdown, it can erode confidence.
Key takeaway
After a disruption, it’s tempting to avoid further change once the system is running again.
But mistaking short-term stability for long-term safety can quietly create new risks.
Why Sequencing Matters After the Crisis
During a disruption, priorities are usually clear.
Production is down. Visibility is lost. The immediate goal is to restore control by whatever means are available. Decisions are made quickly, often creatively, and with full awareness that they’re temporary. In those moments, sequencing happens naturally—because survival demands it.
The challenge comes later.
Once production resumes and the system is running again, it’s easy to assume the disruption is over. The alarms quiet down. The pressure lifts. And the decisions that were clearly temporary during the crisis begin to feel “good enough” to leave alone.
That’s where sequencing truly matters.
Sequencing isn’t just about breaking large objectives into smaller steps or choosing priorities over scope. It’s about recognizing that the end of the crisis is not the end of the disruption. The system may be operating, but it’s often doing so through a fragile mix of workarounds, compromises, and assumptions that were never meant to persist.
In the situation described earlier, the initial sequence was appropriate: restore production first, regain visibility, and stabilize operations enough to move out of immediate danger. A more durable stopgap followed, replacing the most fragile temporary measures. Those were the right decisions at the time.
What remains unresolved is what comes next.
This is the phase where sequencing needs to continue—beyond crisis response and into deliberate decisions about recovery, redundancy, and risk. When that next phase doesn’t happen, temporary solutions quietly become permanent, and the system carries forward all the uncertainty that the disruption revealed.
Sequencing in this context means asking: what decisions were deferred, and how long can they safely remain that way? It creates space to address recovery behavior, clarify dependencies, and reduce risk before the next disruption forces action again.
Key takeaway
After a disruption, the biggest risk isn’t the decisions made during the crisis—it’s assuming the disruption ended when production came back online.
Closing: When the System Is Running, But the Questions Aren’t Gone
Disruptions don’t end cleanly.
Even after production resumes, confidence is often thinner than before. Trust in recovery behavior has been tested, and the idea of making additional changes can feel riskier than leaving things alone. That hesitation isn’t a failure of leadership or stewardship—it’s a natural response to having just lived through uncertainty.
But a running system isn’t always a resolved one.
Temporary decisions remain in place. Dependencies that were exposed haven’t disappeared. They’ve simply stopped demanding attention. What changes in this phase isn’t the configuration—it’s the margin for error.
Stewardship isn’t about moving quickly or fixing everything at once. It’s about recognizing that recovery created a narrow window to restore clarity, rebuild confidence in how the system behaves under stress, and decide what comes next before the next disruption forces the issue.
The goal isn’t perfection. It’s making sure that if the same situation happened again, recovery wouldn’t depend on luck, improvised solutions, or borrowed time.
That’s how disruptions become turning points—not because teams bounced back, but because they chose to act on what the disruption revealed while they still had the chance.