The Circuit Breaker: The Pattern You Need Before You Need It
Jeff Straney·
The notification service started failing around 11pm on a Tuesday. I know the exact sequence because I had to reconstruct it afterward from logs.
First the response time climbed. Normally the notification service responded in 80ms. At 11:05 it was at 400ms. Our service had a five-second timeout on calls to it, so nothing was failing yet, just slow. The threads waiting on the notification service were accumulating. By 11:12 the thread pool was exhausted. New requests to our service were queuing. By 11:18 the queue filled and we started returning 503s. By 11:20 our on-call alert fired.
The notification service was completely optional. Users didn't care whether notifications went out immediately. Our service should have noticed the notification service was struggling and stopped calling it. Instead it held every request hostage to the health of an optional dependency.
We implemented a circuit breaker that week. The incident has not happened again.
What the pattern actually does
A circuit breaker wraps calls to an external dependency and tracks failures. When failures exceed a threshold, it "opens" and stops making calls to the dependency, failing fast instead. After a cooldown period, it allows a test request through. If that succeeds, it "closes" and resumes normal operation.
The three states:
Closed is normal operation. Calls go through. Failures are counted.
Open means the dependency is considered failed. Calls fail immediately without attempting the real request. This protects your service from waiting on something that is unlikely to respond.
Half-open comes after the cooldown period. One request is allowed through as a test. If it succeeds, the circuit closes. If it fails, it opens again.
The crucial property is that the open state fails fast. Instead of holding threads for five seconds waiting on a timeout, it fails in milliseconds. Your service can return an appropriate degraded response to the user immediately.
A simple implementation in TypeScript
type CircuitBreakerState = "closed" | "open" | "half-open";
interface CircuitBreakerOptions {
failureThreshold: number; // failures before opening
resetTimeout: number; // ms before trying half-open
}
class CircuitBreaker {
private state: CircuitBreakerState = "closed";
private failures = 0;
private nextAttemptTime = 0;
constructor(
private readonly name: string,
private readonly options: CircuitBreakerOptions
) {}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === "open") {
if (Date.now() < this.nextAttemptTime) {
throw new Error(`Circuit breaker ${this.name} is open`);
}
this.state = "half-open";
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (e) {
this.onFailure();
throw e;
}
}
private onSuccess(): void {
this.failures = 0;
this.state = "closed";
}
private onFailure(): void {
this.failures++;
if (this.failures >= this.options.failureThreshold) {
this.state = "open";
this.nextAttemptTime = Date.now() + this.options.resetTimeout;
}
}
}
Usage:
const notificationBreaker = new CircuitBreaker("notification-service", {
failureThreshold: 5,
resetTimeout: 30_000, // 30 seconds
});
async function sendNotification(userId: number, message: string): Promise<void> {
try {
await notificationBreaker.call(() => notificationClient.send(userId, message));
} catch (e) {
// Circuit is open, or real failure. Log and continue.
logger.warn("notification failed, continuing", { userId, error: String(e) });
}
}
When the circuit is open, sendNotification catches the error and continues without the notification. The user doesn't wait. The thread isn't held.
The failure mode it prevents
Without a circuit breaker, a slow downstream service causes your service to be slow (threads waiting on timeouts). Your service being slow causes upstream services to queue or timeout. What started as a problem in one optional dependency cascades through the entire system.
With a circuit breaker, after the first few failures the circuit opens. Subsequent calls fail fast. The rest of your service continues operating normally, minus the optional feature. When the downstream service recovers, the circuit closes automatically.
One important detail: this only works if your code actually handles the circuit being open. If you call the circuit-protected function and let the error propagate all the way up, you haven't protected anything. The protection requires a deliberate decision: when this fails, what does the user get? The answer might be "the page without notifications," "a cached response," or "an error, but a fast one." The circuit breaker gives you control over that decision. It doesn't make it for you.
When to add it
Before the incident. I know that's the obvious answer and I also know it's not how it usually goes. The usual sequence is: optional dependency causes cascading failure, on-call engineer has a bad night, circuit breaker gets added to the postmortem action items.
Any time your service calls an external service, especially an optional one, a circuit breaker is worth the hour it takes to add. The hour spent adding it is much less than the night spent recovering from the alternative.
