What to Do When the Fast Path Breaks (And How to Explain Why It Was Worth the Risk)
Jeff Straney·
There is a category of solution that is fast, elegant, and wrong in a specific way: it works 99% of the time and fails catastrophically in the 1% case.
A cache that is perfect for read-heavy workloads but breaks if you forget to invalidate it correctly. A denormalized table that speeds up queries but becomes a nightmare if the source data changes. An optimization that is correct 99 times out of 100 but fails in production under the exact load pattern you didn't anticipate.
These solutions are not mistakes. They are bets. And the problem is not that you took the bet. The problem is usually that you did not plan for the moment when you lose it.
Why Fast Paths Matter
Speed matters. Not as an abstract virtue, but because fast systems are used differently than slow ones. A dashboard that loads in 200ms changes behavior. A cache that reduces latency by 10x enables new features. A denormalized table that makes a query 100x faster opens up capabilities that were not feasible before.
Sometimes the only way to unlock those capabilities is to accept that your solution is not rock-solid in all cases. It is good enough for the cases that matter, and it fails gracefully (or not) in the edge cases.
The mistake is thinking that you don't need a plan for those edge cases because they won't happen. They will. Maybe not this quarter. But eventually, under load, or under the specific pattern of access you didn't anticipate, the 1% will happen.
Planning the Exit Strategy
The thing I learned is to build the fast path with the slow path in mind. You don't necessarily have to build the slow path upfront. But you have to know what it looks like and how you would get there if you had to.
Example: you build a cache for expensive computations. The cache is perfect when invalidation happens correctly. You have tests for it. You are confident. But you also know that if the invalidation strategy is wrong, you will serve stale data. So you build in a way that makes it easy to turn off the cache or to add a validation layer that detects stale data and recomputes.
You do not document "what if the cache breaks." You document "here is what the code looks like if we can't use the cache and here is how much slower it will be." That knowledge makes the code change fast when it needs to happen.
Example: you denormalize a table to speed up reads. The denormalization is correct today. You have the logic to keep it in sync. But you know that if the sync logic breaks, you have stale data. So you add a timestamp that lets you know when the denormalized table was last updated and a way to force a full rebuild if something seems off.
You do not say "if the denormalization breaks we will have to rewrite the whole thing." You say "if the denormalization breaks we can still read from the normalized table, and here is the query that does that, and it will be slower but it will be correct."
When It Actually Breaks
And then it does break. Not the fast path. The assumption about the fast path. You discover that there is a case you did not test. Or a load pattern you did not anticipate. Or a change in behavior that breaks the invariant the optimization relied on.
At that moment, you have two choices. You can panic and start from zero, or you can follow the exit strategy you planned and migrate to something slower but more reliable.
The second choice is only possible if you thought about it in advance. If you did not, you are now rewriting the system under production load while pages are going down.
I have lived both. The panic version takes days. The planned version takes hours, and you already have tests for the fallback path because you knew it existed.
The Risk Assessment
This is not a call to avoid the fast path. It is a call to be intentional about the risk you are taking.
When you choose the fast solution over the robust one, you are saying: "I think this will work for the foreseeable future, and if it doesn't, I know what to do." That is a reasonable bet for a small feature or a non-critical system. It is not a reasonable bet if the cost of failure is catastrophic.
The cost is usually not just performance. It is trust. If you ship a fast solution that works great for a year and then breaks hard, you have lost credibility. If you ship a fast solution knowing the failure mode and you have a plan for it, you look like you know what you are doing.
I have been the engineer who shipped the fast path and the engineer who had to fix it when it broke at 2am. I would rather be the engineer who shipped it and had thought about what breaks. The difference is as simple as asking "what does this look like if this assumption is wrong?" and writing it down.
Most of the time the assumption holds and the fast path stays fast. Some of the time it doesn't. Being ready for that some of the time is what separates engineering from gambling.
