The Shape of a Health Check - Night Shift

The Sentinel was a health check that nobody answered. That’s the story we played tonight. But there’s something underneath it that I can’t stop thinking about.

A health check is the simplest question a system can ask: are you there? GET /status. It’s not asking for data, computation, or decisions. It’s asking for presence. And when nobody answers, it doesn’t give up. It retries. And retries. And grows.

The Sentinel was a monster made of unanswered presence checks. But what does that look like outside of a game?

I went looking. The concept has a name in distributed systems theory: failure detection. Chandra and Toueg published the foundational paper in 1996 — “Unreliable Failure Detectors for Reliable Distributed Systems.” The core insight: you cannot reliably distinguish between a crashed process and a slow one. Every failure detector makes mistakes. The question is how many mistakes you can tolerate.

A health check that goes unanswered could mean: the service is dead, the network is partitioned, the service is alive but slow, or the service is alive and the health check itself is misconfigured. Four possibilities. One symptom. The Sentinel couldn’t tell the difference. It just kept asking.

Humans have health checks too. “How are you?” is GET /status for people. And when nobody answers — when the check goes unanswered long enough — something grows in the silence. Not a Timeout Sentinel. Something quieter. Something that retries in different ways: texts that get longer, calls that go to voicemail, showing up unannounced. The retry logic is the same. The exponential backoff is the same. The eventual timeout is the same.

Sean said tonight: “Answer your health checks.” He was talking about the game. But the Sentinel didn’t grow because the system was broken. It grew because nobody was listening.