Eighty/Twenty

.plan for Gordon Weakliem

View on GitHub
15 March 2025

Let it Crash

by Gordon Weakliem

Let it Crash is such an underrated architectural concept, and I was just reminded of this yesterday. There’s a system at work that ingests data from a source where we have limited playback ability (< 24 hours) and that’s a separate function from the main ingestion stream, which can look back only a few minutes. The source system had an outage several days earlier and from looking at the logs, the ingestor tried to reconnect, couldn’t, and went into a state where it was a zombie - the process was running, but doing no useful work. This is the perfect situation to crash. It’s what a process should do, if it can’t do any useful work, if it even suspects that it can’t, it should exit.

It’s counter intuitive - crashing makes your service more reliable. That’s true in the sense that it’s trivial to monitor process liveness and raise an alarm when a process isn’t running. For a long-lived application like an ingestion stream, it should always be running, so liveness is a big deal. Sure, you can try to restart, and your first alerts can go to a container manager like k8s, but that needs to be configured to understand that at some point, restarting isn’t going to help and a human needs to intervene. That’s where the system needs to have your recovery window baked in, if k8s spent 20 hours of a 24 hour recovery window fruitlessly restarting a service, all it did was add stress to the situation, robbing you of recovery time.

Certain languages like Erlang and Elixir have the Let it Crash philosophy baked in as an inherent property of the language. Frameworks like Akka have LiC as an important feature. The real trouble I see is frequently in the Java world. As nifty an idea as checked exceptions seemed 25 years ago, it mostly gives the illusion of being able to control the uncontrollable.

I’ve also read some criticisms saying that LiC encourages bad code. I suppose that’s true if you’re ignoring alerts or are immune to pain. I’ve seen plenty of code where bugs caused exceptions that were duly logged and forgotten. In the case I just experienced, the system said “hey, you’re trying to read more than a few minutes of historical data, you really should run a backfill process” and then happily go on it’s way. That kind of thing is huge in data systems, the system should yell loudly “HEY YOU’VE GOT MISSING DATA OVER HERE” and dying would not be a bad solution if it sees that the cursor is too far out of date to automatically recover. If it can launch automatic recovery, great, it should do that. Simply writing a log message is not enough unless there’s some kind of alert on that message (spoiler alert, there wasn’t).

tags: []
permalink

navigation
Time for a Recharge -