PGH Web

PGH Web

Web Solutions from Pittsburgh

The Subtle Bug That Lives in Your Retry Logic

Jeff Straney·

I debugged a problem once where an API call was failing consistently with a 400 error. Our code was retrying. The error was happening five times, the code was retrying five times, and the request was failing five times the same way. We were not fixing the problem. We were just wasting time retrying something that would never succeed.

The real cost showed up downstream. A dependent service was timing out waiting for the result. The request was not timing out fast, it was timing out slow, retrying all the way down.

That is when I learned the distinction between retriable errors and non-retriable errors, and why most retry logic gets it backwards.

Retriable vs Non-Retriable

A retriable error is one where the cause is transient. Network timeout. Service temporarily unavailable. Connection pool exhausted. Rate limited. These will probably succeed if you try again.

A non-retriable error is one where the cause is permanent. Invalid credentials. Malformed request. Not found. Unauthorized. Bad gateway. Retrying will not fix these. The same error will happen again.

The mistake is treating them the same. Code that retries a 401 Unauthorized is worse than code that fails immediately, because it turns a quick failure into a slow one. Code that fails immediately on a network timeout is worse than code that retries, because it gives up on a transient problem.

What Actually Happens

When you retry a request that is failing for a permanent reason, you are adding delay to a problem that was already a problem. The user or the dependent service is now waiting longer for bad news. If there is cascading failure (timeouts triggering more retries), you are amplifying it.

I watched a payment service go down because the error handling was retrying on every error, including 500 Bad Gateway. The gateway was having issues. The service kept retrying. The retries piled up. The service became slower, which made the gateway more overloaded, which made more requests fail, which made more retries happen. The right choice would have been to fail fast on a 500 and let the downstream service handle it.

How to Know the Difference

The rule of thumb: if retrying would produce a different result, it is retriable. If it would produce the same error, it is not.

Retriable: network timeout (next attempt might succeed), service temporarily unavailable (it might recover), rate limited (you can try again later).

Non-retriable: 400 bad request (the request is still bad), 401 unauthorized (you are still not authorized), 404 not found (it is still not there), 500 with a permanent cause (code has a bug).

The trap is that 500 errors can be both. A 500 from an overloaded database might be retriable. A 500 from a bug in the service is not. You cannot always tell from the error code.

The Practical Implementation

Most retry logic looks like this:

for i := 0; i < maxRetries; i++ {
    resp, err := callService()
    if err == nil {
        return resp, nil
    }
    // always retry
}

Better retry logic looks like this:

for i := 0; i < maxRetries; i++ {
    resp, err := callService()
    if err == nil {
        return resp, nil
    }
    if !isRetriable(err) {
        return nil, err // fail fast on permanent errors
    }
    // backoff before retry
    time.Sleep(backoffDuration(i))
}

The isRetriable function is the key. It has to know which errors are worth retrying.

func isRetriable(err error) bool {
    // network errors are retriable
    if errors.Is(err, context.DeadlineExceeded) {
        return true
    }
    
    // rate limits are retriable
    if errors.Is(err, rateLimitError) {
        return true
    }
    
    // check the HTTP status code
    if httpErr, ok := err.(interface{ StatusCode() int }); ok {
        code := httpErr.StatusCode()
        // 429 (rate limited), 503 (unavailable), 504 (timeout)
        return code == 429 || code == 503 || code == 504
    }
    
    // everything else is not retriable
    return false
}

This is not perfect. A 500 error might be permanent or transient. But it is better than retrying everything or nothing.

The Backoff Part

Retrying is not enough. You also need to back off. If you retry immediately, you are not giving the temporary problem time to resolve. You are just hammering the same failing service.

Exponential backoff is the pattern: wait a small amount, then wait longer on each retry.

attempt 1: no wait
attempt 2: wait 100ms
attempt 3: wait 200ms
attempt 4: wait 400ms
attempt 5: wait 800ms

This gives the service time to recover while still trying again. It also prevents thundering herd: if 1000 clients all fail and retry immediately, you get 1000 simultaneous requests. If they all back off, the load comes back more gradually.

The Real Cost

I have seen services brought down by retry logic that did not distinguish between retriable and non-retriable errors. They amplified failure instead of recovering from it. The fix is always the same: stop retrying things that will never succeed, and add backoff to things that might.

This is not a new idea. But it is a subtle enough bug that I have seen it in production code from teams that really should have known better. The error code does not tell you whether to retry. The nature of the error does. Building a distinction between them into your retry logic is the difference between resilience and cascade failure.