#3797: How Self-Reverting Watchdogs Save Broken SSH Sessions

A dead man's switch for server configs that automatically rolls back risky changes when connectivity drops.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3976
Published: Jun 21
Duration: 33:34
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: fault-tolerance networking ai-agents

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The self-reverting watchdog is a dead man's switch for server configuration changes — a script that applies a modification, starts a timer, and automatically rolls back to the last known good state if you don't confirm connectivity within a set window. The pattern originated in railway safety and has been standard in network equipment for decades, but server admins rarely have built-in tooling for it. The basic structure has three phases: a pre-flight snapshot that captures current configs and validates them, applying the change and arming a background countdown, and a confirmation handshake through a channel separate from the main SSH session. The snapshot must be genuinely restorable — syntax-checked and tested — because it's your parachute. The confirmation channel is where most implementations fail: if it relies on the same SSH connection that dies with the network change, you've built an automated failure cascade. Solutions include secondary SSH services on untouched management IPs, webhook callbacks, external monitoring nodes that ping the server and report back, and the simplest approach — a file flag where you SSH in through the new connection and create a keep file before the timer expires. As AI agents increasingly make autonomous infrastructure changes without human hesitation or adrenaline, this pattern becomes essential: a twenty-line bash script can save six hours of driving to a colocation facility.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#3797: How Self-Reverting Watchdogs Save Broken SSH Sessions

Daniel sent us this one — and I think every sysadmin who's ever stared at a frozen SSH session knows the exact feeling he's describing. You're working on a remote server, maybe two thousand miles away, you tweak the static IP or change a bridge mode setting, hit apply, and the terminal just stops. That sinking moment where you don't know if the machine is rebooting or if you just turned it into a very expensive paperweight. It's that half-second of silence that stretches into something geological. You start doing the mental math — how long would it take to drive there? Do I even have physical access? Who has the keycard this week? Daniel's asking about self-reverting watchdogs — these scripts that can automatically undo a risky change if you don't confirm it worked. What are they, how do you build them, and how do they work when you're going through an AI agent instead of SSH directly?

That frozen terminal is the universal admittance that you've just made a terrible mistake. And the worst part is, you usually know it about half a second after you press enter. The command finishes, the cursor blinks once, and then nothing. That silence has a texture. It's the texture of regret.

Of course it has a texture. Regret apparently has a mouthfeel now.

It does when you're on hour three of a network migration you thought would take fifteen minutes. But here's the thing that makes this more urgent than it used to be. We are increasingly handing root access and configuration authority to AI agents that run changes autonomously. And an AI doesn't have the sinking feeling. It doesn't pause and think, hmm, that cursor's been blinking a while. It just moves on to the next command, blissfully unaware that it's locked itself out of the box. There's no adrenaline gland in a language model. No little voice that says "maybe don't.

The blast radius of a bad networking change is no longer just the human who fat-fingered a netmask. It's an agent that might run twenty more commands in that disconnected state before anyone notices. And I want to pause on that, because twenty commands is conservative. If it's a CI/CD pipeline agent doing a rolling update across a cluster, it might hit every node before the monitoring system even pages you. You wake up to a dashboard that's just red squares, every single one, and the agent's logs show it happily reporting success on all of them because it never knew it was talking to ghosts. And the traditional rescue options aren't always there. If you've got a KVM or IPMI, great, you're covered. If you're physically at the machine with a serial console, also great. But if you're at a coffee shop twelve hundred miles from the data center, those aren't options.

And that's what the prompt is getting at. The self-reverting watchdog is exactly the pattern for this situation. It's a script that applies a change, starts a timer, and if you don't actively confirm "yes, I can still reach this machine" within a set window, it automatically rolls everything back to the last known good state. Think of it as a dead man's switch for configuration changes — your hand stays on the lever while the train is moving, and if you let go, the brakes slam on.

Which is not the same as throwing a sleep command at it and hoping for the best.

No, and this is where a lot of people go wrong. They think, I'll use tmux or screen, or I'll set a quick timeout, and if something breaks I can reconnect. But if you've broken the network stack itself — if the kernel is now refusing to route traffic, or your new IP doesn't exist on the subnet — no userspace session is saving you. The kernel drops the connection. You can't reconnect because the packets literally have nowhere to go. Your tmux session is toast along with everything else. I've watched junior engineers learn this in real time. They'll start a screen session, make the change, the terminal freezes, and they say "no problem, I'll just reconnect to screen." And then they try. And then the realization hits.

There's a specific facial expression. It's the moment the brain finishes simulating the packet's journey and realizes there's no route back. The eyebrows go up slightly. The mouth opens a little.

It's the face of someone watching a car roll downhill with no driver. So the watchdog has to operate outside the session itself. It has to be a separate process that doesn't depend on your SSH connection surviving.

Let's talk about where this pattern actually came from, because I don't think it originated in systems administration. The dead man's switch goes back to railway safety in the nineteenth century — the engineer had to keep pressure on a pedal or lever, and if they became incapacitated, the train would stop automatically. It's a safety pattern that assumes failure is not just possible but inevitable, and the system should fail closed rather than fail open.

And in computing, the earliest widespread version I'm aware of was in network equipment — Cisco and Juniper both have "reload in" commands. You type "reload in 5" before making a risky change, and if you don't cancel the reload after confirming everything works, the router reboots to the saved configuration. It's been standard practice in network engineering for decades. Server admins just never got the same built-in tooling. We had to build our own.

Which is strange when you think about it. A Juniper switch has better self-preservation instincts out of the box than a Linux server with twenty years of uptime. But okay, let's walk through the pattern step by step, because once you see it, it becomes one of those things you wish you'd been using for years. The basic structure has three phases. Phase one: pre-flight check and snapshot. Before you touch anything, you take a full backup of whatever config files you're about to modify and you validate that the current state is working. Phase two: you apply the change and immediately arm the watchdog — spawn a background process that starts counting down. Phase three is the confirmation handshake. You test connectivity, and if everything works, you send a signal that tells the watchdog "we're good, stand down." If that signal never arrives, the watchdog triggers a full rollback.

Phase one deserves more attention than it usually gets, because the snapshot has to be genuinely restorable. I've seen people do things like "cp /etc/network/interfaces /tmp/backup" and call it done, but they don't check whether the backup file is actually complete, or whether the syntax is valid, or whether restoring it would actually bring the network back. You need to validate your backup before you trust it with your connectivity. Run a syntax check on the config file if the tool supports it. Make sure the file isn't zero bytes. If you're paranoid — and you should be — actually apply the backup config in a test environment first. The snapshot is your parachute. You don't pack a parachute while you're falling.

You're buying yourself a window. You make the change, you have sixty seconds, if you can still reach the machine you disarm it, and if you can't, the machine undoes itself.

The disarm step is critical. The signal has to come through a channel that survives the configuration change. That's the part people mess up. If your confirmation relies on the same SSH session that just died because of the networking change, you've built yourself an automated failure cascade. The watchdog needs a separate confirmation channel.

Let's talk about what those channels look like, because I think that's where the real design thinking happens. If I'm changing a static IP on an interface, the main SSH connection is going to drop the moment the IP changes. How do I confirm the new config is good if I can't use the same pipe?

There are a few patterns worth knowing. The simplest: have a secondary SSH service running on a different management IP that you don't touch. Your main connection drops, you SSH in on the secondary address and run your confirmation command. If you don't have a second interface, you can use an entirely different protocol. A webhook callback, for example. You set up a simple HTTP endpoint that listens for a POST request. You make the change, curl the endpoint from the server. If the server can reach your endpoint, the network is working. Or you can reverse it — have an external monitoring service that tries to reach the server on the new configuration and sends the keep signal on your behalf.

Wait, that's interesting. The external verifier pattern. So instead of the server reaching out to confirm it's alive, you have something outside trying to reach in?

Right, and this can be more reliable in some scenarios. Imagine you've got a monitoring node on the same subnet — a Raspberry Pi tucked in the rack, or a small VPS in the same data center. That monitor continuously pings the server you're reconfiguring. After you apply the change, the watchdog on the server doesn't just count down blindly. It queries the external monitor: "Hey, can you still see me?" The monitor checks its last successful ping timestamp. If it's recent, the monitor sends back the keep signal. If the pings stopped, it stays quiet and the rollback fires.

The monitor is the one with the judgment call. The server doesn't have to know whether it's reachable — it asks a friend.

It's the "can you hear me now?And it solves a real problem, which is that a server with a broken network stack might not be able to initiate any outbound connections at all. It can't curl your webhook, it can't SSH anywhere. But it might still be able to receive a connection on a local management interface from a monitor that's on the same switch. That local path stays up even when the routed path is dead.

That's elegant. The monitor becomes the designated survivor of your network changes. You're basically appointing a second box to be your eyes and ears.

You can implement this with something as simple as a tiny Flask app listening on localhost that the monitor hits every five seconds. If the monitor misses three checks in a row, it assumes the worst and triggers the rollback via an out-of-band channel — maybe a serial console, maybe a separate VLAN, maybe even a different physical connection. The key is that the verification path and the configuration change must not share fate.

Or you could pre-share a key. Store a UUID somewhere on your local machine before you start. After the change, SSH in through whatever new connection is working and run a command like, touch /tmp/watchdog_keep. If that file exists when the timer finishes, the watchdog cancels the rollback.

The file flag is the cleanest implementation. And I want to get into the actual bash of how you build this, because it's not complicated. We're talking maybe twenty lines of script that can save you six hours of driving to a colo facility. Let me lay out a concrete example. You have a Debian 12 server running netplan — it's currently using DHCP on interface eth0. You want to switch it to a static IP at 192.50 with gateway 192.Here's how you'd structure the script.

Go ahead, walk us through it.

Step one: snapshot. You copy /etc/netplan/01-netcfg.yaml to something like /tmp/netplan_backup.But don't stop there — also capture the current routing table with "ip route show" and dump it to a file. Also capture "ip addr show" so you know exactly what the interface state looked like. Sometimes gateways change underneath you, sometimes there are secondary addresses you forgot about. The more state you capture, the more precisely you can restore. Step two: you write out your new netplan config with the static IP. Step three: you run "netplan apply." Immediately after that command finishes, you spawn a background process. The simplest way is to use the at command — you type "echo restore_script.sh | at now + 1 minute." In one minute, unless that job is cancelled, the restore script runs.

The at scheduler survives session termination.

That's the key. The at daemon is separate from your SSH session. It runs jobs in the background regardless of whether you're still connected. As of this year, at is available on essentially every Linux distribution out of the box. It's been around since the seventies and nobody's gotten rid of it. If you schedule a rollback job with at, that job will fire whether you're logged in or not, whether the system is reachable or not. It's almost absurdly simple. You're literally just telling the system "in sixty seconds, run this script unless I tell you otherwise.

What does the restore script actually contain? Let's be concrete.

The restore script copies /tmp/netplan_backup.yaml back to /etc/netplan/01-netcfg.yaml, runs "netplan apply," and then maybe restarts the networking service for good measure. It should also log what it did — a timestamped entry in /var/log/watchdog.log saying "rollback triggered at such-and-such time" so you have an audit trail. Optionally, it can send you an email or a webhook notification so you know the rollback happened even if you've stepped away from your terminal.

The scheduler is your insurance policy. Then before that minute expires, you reconnect — on the new IP, on a different port, through a cellular failover, whatever your confirmation channel is — and if everything is working, you cancel the at job.

Crucially, if you can't reconnect, you do nothing. The timer runs out, at fires your restore script, which copies the backup netplan config back in place and runs netplan apply again. You wait about a minute for the network to settle, and the server is back on its original DHCP address. You SSH back in like nothing happened.

Except your pride. The pride file does not get restored from backup.

The pride partition has no redundancy and cannot be fscked. We've learned this the hard way.

There's actually a fun historical footnote here. The "at" command was part of Version 7 Unix, released in 1979. It was written by someone who was tired of remembering to run jobs at specific times. And for forty-five years, it's just been sitting there, quietly reliable, while we built Kubernetes clusters and CI/CD pipelines and serverless functions on top of it. And now it turns out to be the perfect primitive for self-reverting configuration changes. Sometimes the old tools are the right tools.

The Unix philosophy of small, composable tools that do one thing well pays off in ways the original authors couldn't have predicted. at wasn't designed for watchdog timers. It was designed for "run this report at midnight." But because it's simple and it doesn't depend on session state, it slots perfectly into this use case. That's the beauty of the Unix toolchain.

Okay, so that's the basic pattern with at. But what about cases where at isn't the right tool? Suppose you need something more robust — a watchdog that actively monitors and doesn't just fire blind. What's the systemd approach?

Systemd is the upgrade path. Instead of a blind timer, you can use a oneshot service with systemd-run. The command "systemd-run --on-active=60 /path/to/restore-script" is roughly the equivalent of the at approach. But here's where systemd gives you real power. You can create a custom service unit that monitors for your confirmation signal and only triggers the rollback if the signal hasn't arrived. A oneshot with RemainAfterExit set to no means the service fires once and cleans itself up. If you combine that with a bash script that checks for the existence of a flag file, you've built a proper dead man's switch.

I'm imagining the workflow. You write the unit file, you start the systemd timer, you apply your config change, reconnect through your secondary channel, and if all is well, you run systemctl stop on the watchdog timer. If you can't reconnect, the timer expires and systemd runs the rollback script. The timer outlives your SSH session because it's a system service.

And systemd timers are more reliable than background sleep processes. A sleep subprocess can theoretically die if its parent session gets nuked in a weird way. The systemd init system manages its own child processes and they survive session termination cleanly. Plus you get logging through journald for free, you get dependency management, you can set the timer to monotonic clock so it survives system suspend. It's the production-grade version of the pattern.

There's also the trap approach, right? Using trap signals to catch disconnection.

Trap is tempting but dangerous here. Trap catches signals within a specific shell session, like EXIT or HUP. The problem is that if your connection drops at the kernel level — because the IP stack is broken — the system might never send a clean HUP to the process. The session just dies in a way that bypasses the signal handler. Trap works for clean exits, like closing a terminal window locally. It's not reliable for network-level failures. I've seen too many scripts that relied on trap EXIT to undo things and they failed right when you needed them most.

The failure mode is failure itself. Like a fire alarm that only works when the room is not on fire.

The fire alarm powered by thermal sensors that melt at the exact temperature of a working fire.

Prioritize at and systemd. Skip trap for the rollback case. Now let's talk about the edge cases that actually hurt. What if the rollback itself breaks something?

This is where snapshot discipline matters. Always restore from a known good copy of the config files, never from memory or from a secondary script that tries to reverse engineer what you just did. Idempotent rollback design: your restore script should be the opposite of your apply script in a strict, symmetric way. If the apply script copied the backup to a temp location and then wrote the new config, the rollback copies that temp file back and reapplies. Even if the rollback runs multiple times, the result is the same — the known good config. That's idempotent. And you also want to test on a local VM first. Spin up a container, run through the entire watchdog cycle including a simulated disconnection where you literally kill the network interface, and see if it recovers. Don't wait until you're on a production machine at two in the morning to find out that your backup netplan file had a syntax error.

The two AM test. The most rigorous and least forgiving quality assurance framework in existence. I've done this to myself. I once wrote a rollback script that referenced a backup path with a typo in the directory name. The backup was there, the script just couldn't find it. I discovered this at one in the morning, eight hundred miles from the server. The rollback fired, failed silently, and I spent the next two hours on the phone with a data center technician who had to physically walk to the rack and plug in a crash cart. All because of a missing slash.

That's the kind of mistake you only make once. After that, you test your rollback scripts like you test your backups — by actually running them. Not by reading them and thinking "yeah, that looks right." Actually execute the restore in a VM and verify the network comes back. It's the same principle as disaster recovery drills. If you haven't tested it, it doesn't work.

The Linux Foundation actually did a survey on this, right? What were the numbers?

The Linux Foundation did a survey last year and found that forty-three percent of sysadmins have experienced a bricked remote server due to a networking config change, with an average recovery time north of four hours. Four point two hours, specifically. That is a staggering amount of productivity torched by one bad config. And the worst part — most of those respondents said the change itself was trivial. A typo in the netmask. A missing gateway. A DNS server that wasn't reachable from the new subnet.

Forty-three percent. Honestly, that's lower than I expected. I thought every sysadmin over the age of thirty has at least one story that starts with "I was changing the switch port VLAN...

Eighteen percent claimed they had three or more. They called it the unplanned off-site maintenance experience.

Adrenaline is a component of the job, apparently. There's a certain kind of sysadmin who almost misses that feeling once they move into management. The cold sweat. The racing heart. The sudden clarity of exactly how badly you've messed up.

That's the culture, right? We laugh about it because we've all been there. But now fold in AI agents, and it gets less funny. An agent that's making configuration changes autonomously doesn't realize it's lost connectivity. It doesn't know to stop. It just keeps generating commands. If it's applying netplan changes and then moving on to firewall rules and service restarts, you've gone from a single broken config to a cascade of changes with no human in the loop. By the time you check the agent's logs, you've got thirty changes to untangle and no clear idea of which one cut you off.

The agent needs to be the one confirming the watchdog, not a human. And that means the agent needs an independent channel for verification. It can't just rely on the same SSH connection it used to apply the change.

This is the agentic watchdog pattern. You bake the confirmation into the agent's workflow. The agent applies the change through whatever channel it normally uses — a Python script, a REST API, an exec call on a server. Then the watchdog slams on. The agent has to reconnect through an out-of-band channel to confirm. Maybe it's an MQTT topic it subscribes to that reports the server's health status. Maybe it's an HTTPS health check endpoint that returns a 200 only if networking is intact. Whatever the channel is, it has to be firewalled off from the config change itself.

Like running a management network that you never reconfigure at the same time as the production network. Old-world networking best practice, applied to agents.

And you can even get fancy here. The agent spawns a sub-agent on a different network interface — say the management NIC — whose sole job is to verify connectivity on the primary NIC and issue the keep signal. If the sub-agent can't reach the main interface's health check, it stays silent and the rollback fires.

Split-brain patterns start showing up the moment you give autonomy to something with root access. And I want to flag something here, which is that we're essentially building a conscience for the agent. A little voice that says "are you sure you're still connected? verify before proceeding." It's the machine equivalent of the sinking feeling the human gets when the terminal freezes. We're encoding our own anxiety into the automation.

That's exactly what it is. We're taking the human instinct of "wait, did that work?" and turning it into a protocol. The watchdog is automated doubt. And that's a healthy thing to build into systems. Doubt is a safety mechanism. Certainty is what gets you into trouble.

That's what the prompt was after. Whether it's a human on a terminal or an AI agent driving root changes, the watchdog pattern stays the same structurally. The only difference is who's sending the keep signal and through what channel. Everything else — the snapshot, the apply, the timer, the rollback — remains identical. That's the beauty of it. The watchdog doesn't care who made the mistake.

The watchdog is profoundly egalitarian in its distrust.

Let's put some flesh on these bones. Let's say I'm administering things directly. I need to switch a Kubernetes node to use a new CNI plugin across a remote link. Where does my watchdog slot in?

Kubernetes context makes this fascinating because the node's identity to the cluster is the heartbeat. The kubelet communicates with the API server based on the node's network configuration. If you break that, the node goes NotReady. So you design the watchdog around the kubelet's registration. Before you apply the CNI change, you capture the node's current config and its scheduler state — specifically, you want to cordon the node first so no new pods get scheduled while you're messing with networking. You apply the plugin, then the watchdog polls the API server — from a separate management host, crucially — to verify the node shows as Ready within ninety seconds. If it doesn't, the watchdog uses kubectl commands or direct config restores to roll back.

The cluster itself becomes your confirmation channel. Node Ready equals the keep signal, but delivered through an API call that's external to the node you're working on.

The node can't be trusted to report its own health after a networking change. That's like asking a patient who just had head surgery whether they're feeling okay. You need an external observer. The API server, queried from a different node or a management workstation, is that observer. If the node doesn't check in within the window, the watchdog cordons it, drains any stray pods, and reverts the CNI config.

It's almost elegant — the infrastructure keeps the agent honest. The cluster is the lie detector.

There's the beauty and tradeoff space with NixOS-style system rebuilds. Nix's rollback generator puts an atomic snapshot history into your bootloader — it's essentially a built-in self-reverter. If you're inside the Nix declarative universe, every configuration change is a new generation, and you can roll back to the previous generation from the bootloader menu if something breaks. It rivals the watchdog pattern in terms of safety. Outside that universe, it's as reachable as a big-budget illusion when you actually have dependencies spread across multiple packages. The true value of watchdogs is not being limited to resources entirely under one version manager.

With caution: you can misuse the safety net. Like constantly applying a config that fails and auto-rolls back, and you're cycling, never investigating the root cause. The watchdog becomes a crutch. "Oh, it'll just roll back" becomes the excuse for not testing changes properly.

I've watched exactly this in a SaltStack environment. That automatic handler went through an identical apply-puke-rollback loop thirteen times at three in the morning because some underlying library dependency had changed. Each node treated the change identically, so the error was a silent choreographed disaster. Thirteen nodes, thirteen rollbacks, thirteen log entries that all looked the same, and nobody noticed until the morning because the monitoring dashboard just showed everything as green — the nodes had all rolled back successfully, after all.

Sisyphus worked at a colo. Someone give that example a paper name. "The Sisyphean Rollback: A Case Study in Automated Failure.

It's insidious because at 3 AM those log pings are gloriously ignored. Your baseline cost becomes countless mind-hours for zero infra change. The systems are working perfectly to undo the change, and that very perfection masks the fact that the change is fundamentally broken.

By Sisyphus's temple, the monitoring alerts don't fire because technically nothing is down. Everything is up. It's just that every attempt to improve things fails silently and reverts. You're running in place at full speed.

Use a depth counter with an exhaustion signal before recycling into an infinite loop. Rate-limit applies so by the third automated reversal, the change is frozen and push pipelines turn off entirely. Otherwise the logging hides under cost thresholds and you never see it.

Let's reintroduce guard meaning to raw counter thresholds. Three strikes and the system doesn't just roll back — it alerts, it locks the config, it demands human intervention. The watchdog needs its own watchdog.

This is where two-phase commitment patterns show up, which connects outward toward distributed systems theory. The agent's preliminary push stays sandboxed until an outside observer API confirms connectivity continuity while the box-state snapshot rebinds identity. We've basically centralized side-channel confirmation as our principal risk marker beyond key injection.

Name that secondary box — the "Doubt Handler." It refutes terrible configs with a gaze faintly accusatory. Might survive market analysis as a service. io — we don't trust your changes, and neither should you.

[laughing] Monastic dedicated fail-validator. A box in the corner of the data center whose entire job is to look at your config changes and say "I have concerns.

Neither side considers these easy ones really survivable as independent reliable channel management, because race conditions between disk-cache restore periods still mean connection loss is the win. Sixty seconds isn't "plenty" — it's a closing domain with a property registry flipping signatures along path-walks. Server proxy-side resolution, you lose regardless if the timing is wrong.

The window is tight and it's getting tighter. As networks get faster, as automation runs at machine speed, sixty seconds starts to feel like an eternity in some contexts and impossibly short in others. If your rollback takes forty-five seconds to complete and your keep signal has to traverse three network hops, you're cutting it very close. You need to profile your actual rollback time and set your window accordingly. Measure, don't guess.

Listen — I've now reverted the same dropped table disaster across three different boardrooms while listening to dinner party talk. It doesn't merge. The contexts don't blend. You're making small talk about vacation plans while mentally replaying the exact sequence of commands that got you into this mess.

From our higher side of channel planning, the advice pattern stays stable. Link budgets become gate decisions ultimately. If async reconciliation won't clamp because your own same-session network pipeline is drowning the match-walk within stale stacks, the proxy stallout still eats the swap-path alive. But best to prime carefully. We can finalize the full layout: steps copy exact, then the watchdog chain independent-port function as it turns toward listener domain life very gently but stays core deploy guarantee philosophy under stack overheads. Outside carefully placed handshake controls, overhead cycles keep threshold small, safe, predictable intervals — standard we own with final comment. People over-intellectualize maybe while staying practical. Low-exec files close well. But the approach indeed repeats clean.

Yes — pattern zero changed because networking foundation stones unchanged significantly across vast timeline despite moves at fringes. IP is IP. A route is a route. The fundamentals don't care about your orchestration layer.

Ultimately, safe infrastructure requires protocol toolchain preparation, confirmation reliably to back up heartbeat logic closure. Reliably robust perhaps just across external API forms bound deep. The watchdog is not a clever hack. It's a first-class infrastructure primitive. It deserves the same rigor you'd apply to your backup strategy or your monitoring stack. Because when it works, you don't notice it. And when it doesn't work, you notice it at 2 AM from twelve hundred miles away.

That's the thing I want to leave people with. The watchdog pattern is not complicated. It's twenty lines of bash and a scheduled task. The barrier to entry is almost nothing. The barrier to doing it well — testing it, verifying the rollback, setting up the out-of-band confirmation channel — that's a little higher, but still entirely achievable. And the alternative is being the person on the phone with the data center technician at 2 AM, trying to explain which rack, which server, which crash cart. That's a call you only want to make once.

Or zero times. Zero is the target number of those calls. So build the watchdog. Test it on a VM. Test it on a staging server. Then sleep better knowing that your next network change has an undo button that actually works, even when you don't.

We fold it there. Daniel, we hope that answers your question. To everyone else: go set up your at jobs. Your future self, stranded at a coffee shop somewhere, will thank you.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#3797: How Self-Reverting Watchdogs Save Broken SSH Sessions

Downloads

You Might Also Like

#3797: How Self-Reverting Watchdogs Save Broken SSH Sessions