{"id":"bu6hdivbfwscldd","title":"Modern Engineering Demands the Full Stack","slug":"modern-engineering-demands-the-full-stack","summary":"Full-stack engineering matters because problems rarely stay in one layer. AI services, APIs, UI state, hosting, and failure handling all meet in the same user experience.","imageUrl":"https://briancrabtree.me/images/journal-modern-engineering-demands-the-full-stack.webp","category":"Engineering Mandate","date":"2025-12-17","featured":false,"likes":42,"author":"Brian Crabtree","content":"<h2>The Fallacy of the Simple Retry</h2>\n\n<p>The most common and most dangerous reaction to a service outage is the immediate retry. It is the knee-jerk reaction of a script written by someone who has never had to carry a pager. The logic seems sound on the surface: the request failed, so try again. However, in a distributed system, this creates what we call a thundering herd. If an upstream AI provider is struggling to process requests and returns a 503, having ten thousand of your client instances immediately retry simultaneously does not fix the problem. It buries the provider deeper, ensuring they stay down longer, and it ties up your own threads and resources waiting for responses that are never coming. You are essentially DDoS-ing your own dependency.</p>\n\n<p>To fix this, we have to implement automated retries with exponential backoff and jitter. This is non-negotiable architectural hygiene. Exponential backoff means we increase the wait time between each subsequent retry—wait one second, then two, then four, and so on. But even that is not enough because if every client backs off by exactly the same amount, they all still retry in synchronized waves. This is where jitter comes in. We introduce a random variance to the wait time to desynchronize the clients, smoothing out the traffic spike into a manageable curve. This logic should not live in your application code where it can be forgotten or implemented inconsistently. It belongs in your service mesh or your HTTP client configuration, applied globally and ruthlessly.</p>\n\n<h2>Circuit Breaking as Preventative Amputation</h2>\n\n<p>There comes a point where retrying is futile. If the AI service has been down for five minutes, the next request is statistically likely to fail as well. Continuing to hammer a dead endpoint is a waste of your infrastructure's compute cycles and network bandwidth. We need a mechanism to stop the bleeding, and that mechanism is the circuit breaker. Borrowed from electrical engineering, the concept is simple: when failures reach a certain threshold, the breaker \"trips\" and opens the circuit. For a set period, all attempts to call that service fail immediately without even trying to hit the network.</p>\n\n<p>This fail-fast behavior is critical for system stability. It frees up resources instantly. Instead of a thread hanging for thirty seconds waiting for a timeout, it returns an error in milliseconds. This allows your application to degrade gracefully or switch to a fallback mechanism immediately. Think of it as preventative amputation for your architecture; you sacrifice the limb (the AI feature) to save the body (the rest of the application). Once the cooldown period passes, the circuit breaker allows a single \"test\" request through. If it succeeds, the circuit closes and traffic resumes. If it fails, the clock resets. This is how you survive an outage without your own metrics monitoring dashboard lighting up red across the board.</p>\n\n<h2>The Asynchronous Imperative</h2>\n\n<p>One of the worst habits web developers have picked up is the tendency to do everything synchronously. We treat HTTP requests like function calls, blocking the user's browser while we wait for a server to think. With AI services, which can take anywhere from two seconds to a minute to generate a response, this is architectural suicide. Coupling your user's immediate interface experience to the latency of a third-party inference engine is a guarantee of frustration. We need to aggressively decouple these processes using message queues.</p>\n\n<p>When a user requests an AI generation, your application should not be making the call to the provider. Instead, it should toss a message into a queue—RabbitMQ, Kafka, SQS, pick your poison—and immediately return a status to the user saying \"we are working on it.\" A separate pool of workers pulls jobs from that queue and handles the interaction with the slow, flaky AI service. If the service goes down, the queue just fills up. No data is lost. No user requests time out. The workers can churn through the backlog when the service recovers. This also gives you the ability to throttle consumption. If you only have ten workers, you will never hit the AI provider with more than ten concurrent requests, effectively rate-limiting yourself and preventing overage charges or rate-limit bans.</p>\n\n<p>Furthermore, we must talk about the graveyard of failed requests, known as the Dead-Letter Queue (DLQ). If a message fails processing after all retries and backoff attempts, it should not just vanish into the ether. It gets moved to a DLQ. This allows you to inspect the failures later. Was it a malformed prompt? A specific model outage? You can't fix bugs you can't see, and the DLQ is your forensic evidence locker.</p>\n\n<h2>Defensive Caching and Stale Data</h2>\n\n<p>The cheapest request is the one you never make. In the context of expensive AI tokens, caching is not just a performance optimization; it is a cost-saving and reliability measure. While AI responses are often non-deterministic, many business use cases generate identical prompts for identical contexts. If user A asks for a summary of a specific document, and user B asks for the same summary ten minutes later, there is absolutely no reason to pay the latency and dollar cost to re-generate that text. Stick a distributed cache like Redis in front of your inference layer.</p>\n\n<p>We can take this a step further with the \"stale-while-revalidate\" pattern. If the cached data is slightly expired, serve it anyway while triggering a background refresh. In a world where the upstream service might be throwing 503 errors, showing a user a summary that is an hour old is infinitely better than showing them an error stack trace. This approach masks the fragility of the dependency from the end user. Cache invalidation remains one of the hardest problems in computer science, but for AI outputs, we can often afford to be looser with our Time-To-Live (TTL) settings. Over-eager invalidation negates the benefit, while a robust cache acts as a shock absorber for your entire system.</p>\n\n<h2>The Art of Graceful Degradation</h2>\n\n<p>Eventually, despite your queues, your circuit breakers, and your fancy caching strategies, the service will be totally unavailable. This is where we separate the engineers from the framework users. What does your application do when the AI brain is lobotomized? If your answer is \"show a generic error modal,\" you have failed. We need to implement semantic fallbacks. If the AI summarizer is down, fall back to displaying the first paragraph of the text. If the recommendation engine is dead, show the most popular items from the last 24 hours.</p>\n\n<p>In some critical systems, it might even be worth maintaining a \"dumber,\" smaller model hosted locally or on cheaper infrastructure. It won't be as smart as the massive cloud model, but it will be available. It is the difference between a luxury car going into \"limp mode\" so you can get to a mechanic, and the car simply exploding on the highway. Your users generally prefer the limp mode. They might notice the quality drop, but they can complete their task. Reliability is the most important feature you can ship. Nobody cares how clever your prompt engineering is if the server returns a 503. Build for failure, expect the worst, and your systems might just survive the hype cycle. For a related angle I keep coming back to, see <a href=\"/journal/how-this-site-is-built/\">How This Site Is Built (Reference Stack)</a>.</p>","tags":["Fullstack Development","Software Engineering","Web Architecture","DevOps","Technical Leadership"],"views":102}