Engineering for Failure: The Five Decays of AI Systems

Most teams don't have a reliability problem with AI. They have a visibility problem.

You’ve probably seen this pattern many times. The model gives a confident answer. Later, you discover it was wrong, and there’s no obvious reason why. It just missed something important.

That’s not a single bug you can patch; it’s a failure mode you haven’t designed for yet.

After building AI-driven systems under real conditions (long sessions, large documents, changing requirements, provider updates), one lesson keeps repeating:

AI systems don’t “get worse.” They silently decay.

In this article, I’ll explain what that decay looks like and how to design against it.

We'll cover:
1. How AI actually processes information
2. Five specific, predictable ways AI execution degrades in production
3. Five architectural patterns to design against each failure mode

Let’s start with the mechanics.

How AI Actually Works: Tokens, Context and Attention

Before we talk about failure modes, one mechanical concept matters more than any other: tokens.

When you read “Does this integration require multi-factor authentication?”, you see words, understand the sentence, and hold the meaning in your head.

When an LLM sees the same sentence, it sees something closer to:
[Does] [this] [integration] [require] [multi] [-factor] [authentication] [?] — eight separate tokens. Each token triggers probability distributions over what might come next, based on patterns learned during training. The model is not reading in the human sense; it is statistically continuing a sequence.

That distinction has three major consequences.

First: Tokens are the unit of capacity.
Your instructions, the files you upload, the conversation history, and the model’s responses - all share the same fixed-size buffer called a context window. When that buffer fills, the oldest content is removed entirely; it isn’t weakened or archived. The model has no persistent memory or hidden storage, only a sliding window that is rebuilt on every request.

Second: Attention is not uniform.
Even when a document fits entirely inside the context window, the model does not process it evenly. Transformers compute attention globally, but signal strength varies by position. In practice, long documents tend to show stronger signal at the beginning and around your query, and weaker signal in the middle — exactly where exceptions, edge cases, and compliance details often live. The model doesn’t intentionally skip that middle section, but it does weight it less.

Third: Fluency is not accuracy.
The model is trained to produce coherent, confident-sounding output. It is not trained to tell you when its signal was weak or incomplete. The gap between linguistic certainty and actual evidential coverage is where most high-impact failures originate.

With that foundation in place, we can look at what goes wrong in production.

Two Ways AI Fails You

There are two failure categories worth caring about.

Capability failure means the model cannot do the task in principle. It cannot generate cryptographically secure randomness. It cannot reliably prove complex theorems. No amount of prompting will fix that; you’re using the wrong tool.

Execution failure means the model can do the task, but fails to do it reliably under real-world conditions.

Execution failures are subtle. The model doesn’t tell you that context was truncated, or that it glossed over a critical detail on line 310, or that its behavior changed after a provider update last Tuesday. It still returns an answer that sounds confident and well-formed.

Capability failures are usually obvious. Execution failures are not. And execution failures are where almost every real production incident comes from.

In practice, I see five repeatable patterns of execution failure. I call them the Five Decays.

The Five Decays of AI Systems

These are not edge cases. They follow directly from how language models work. Each is predictable, and none of them announce themselves.

1. Context Decay — “What you said first fades first.”
2. Attention Decay — “Large files become skimmed files.”
3. Intent Drift — “Your goal changes; the AI doesn’t notice.”
4. Behavioral Regression — “Yesterday’s success, today’s failure.”
5. Confidence Inflation — “Less reading, more certainty.”

Let’s walk through each one.

Decay 1: Context Decay

Most developers assume that a long session implies long memory. It doesn’t.

You set constraints at the start of a session:
“Never mix customer data. Always use metric units. Return JSON only.”

Twenty turns later, those constraints may no longer be present in the context at all. They aren’t deprioritized; they’ve simply been pushed out of the window.

Two mechanisms drive this:

Position bias means tokens near the current edges of the window — especially recent ones — exert stronger influence. Instructions buried in the middle have a weaker signal.
Truncation means that when the context window fills, the oldest tokens are cut off. The model continues fluently with whatever remains and does not flag that anything has been removed.

Long sessions feel like they are building up shared memory. Mechanically, they behave like a bounded buffer under constant pressure.

Decay 2: Attention Decay

In this case, nothing was truncated. The entire document fits inside the context window. The model has technically “seen” all 450 lines.

But seeing is not the same as reading carefully.

Imagine a 450-line security specification. You ask: “Does this support SSO?” The model replies: “Yes, OAuth 2.0 with SAML fallback.” You follow up: “Quote the SAML configuration lines.” Now the model responds: “I see OAuth, but cannot find a SAML implementation.”

The SAML section was there the whole time. It simply received less attention. The beginning of the document and the region around your latest question got strong signal. The deep middle — where the SAML details lived — was weighted least.

The result is that a critical detail on line 310 ends up in the worst possible place in the context for attention. The model sounds like it read everything. Statistically, it treated different parts of the document very differently.

Decay 3: Intent Drift

Intent Drift is more subtle because nothing was removed and nothing is obviously low-signal. The problem here is recency optimization.

LLMs prioritize the most recent instructions in the prompt. They don’t maintain a stable, persistent notion of “what we’re really doing” unless that goal is restated and reinforced.

Watch how a realistic session evolves:

Turn 1: “Debug this authentication error.” (Fix one bug.)
Turn 6: “Actually, can you clean up this auth logic?” (Scope expands.)
Turn 12: “Let’s redesign the auth flow.” (Architecture changes.)
Turn 20: “Why did this break production?” (There’s now an incident.)

No one stopped and declared a scope change. Each step felt like a small adjustment. The model isn’t “forgetting” the original goal; it is simply optimizing for the strongest current signal, which is usually the most recent request.

You believe you’re on a continuous thread. The model is now solving a different problem.

Decay 4: Behavioral Regression

You tested a prompt. It worked well. You shipped it. Three weeks later, the same prompt produces noticeably different behavior. You haven’t changed anything on your side.

LLM providers may update model weights, fine-tuning layers, safety systems, routing logic between model variants, and inference optimizations — often without changing the public model name. Unlike traditional APIs, where outputs are constrained by fixed schemas, LLM outputs are free-form text. Regression often shows up as semantic drift, not obvious breakage.

You start to see small shifts:

JSON formatting changes slightly
Refusals become stricter
Constraint adherence weakens
Tone moves from neutral to more assertive

Nothing crashes, but the invariants you were relying on erode.

With traditional software, past validation gives you a reasonable guarantee of future behavior. With LLM-based systems, that assumption does not hold.

Decay 5: Confidence Inflation

This is the most dangerous decay because it hides all the others.

As context decays, as attention becomes uneven, as the underlying signal weakens, the model does not naturally become more cautious. Very often, it becomes more decisive.

The reason is simple: the model is optimized for fluent, confident language, not calibrated uncertainty. If the strongest signals in a document suggest that MFA is required, it will generalize confidently, even if the exception that disproves that rule lives in a low-attention region on line 310.

You will not see a warning like: “I may have missed something in the middle.”
You are more likely to see: “This integration definitely requires MFA for all endpoints for enhanced security.”

Humans tend to equate confident language with reliability. The model’s tone stays strong even when its evidential coverage is patchy. That gap — between how certain the answer sounds and how much was actually checked — is what leads to many of the most costly failures.

The pattern is familiar: context weakened, attention uneven, intent shifted, behavior drifted — and the output still sounds authoritative.

Distrust as Architecture

At this point, it is tempting to say: “We just need better prompts.”

We don’t. Prompts are instructions. Architecture is control. When failure modes are structural, mitigation has to be structural as well.

“Distrust” here is not emotional skepticism toward AI. It’s an engineering stance:

Do not design systems that depend on optimistic assumptions about LLM behavior.

Each decay maps to a concrete countermeasure. Five decays, five controls.

Pattern 1: Time-Bounded Context (counters Context Decay)

You can’t eliminate Context Decay, but you can control the time horizon where you assume memory is valid.

Core principle: Don’t assume long sessions are safe.

Instead:

Cap conversation length (e.g., 20 turns or 48 hours)
Periodically restate critical constraints instead of assuming they survived
Force summarization before continuing very long threads
Timestamp and version rules and documentation
Treat “earlier in the chat” as volatile memory, not durable storage

For documentation, embed validity windows directly in the text:

“This spec is valid until 2025‑03‑01. If the current date is later than this, refuse to answer and request updated documentation.”

If memory is inherently unstable, make expiration explicit. Decay is no longer silent; it becomes a defined event.

Pattern 2: Forced Evidence Extraction + Completeness Enumeration (counters Attention Decay)

Attention Decay can't be fixed inside the model. So we instrument it from outside.

Make reading observable. If the model claims it read a document, it must prove it.

❌ Weak prompt:

Does this integration support bulk operations?

✅ Distrust prompt:

Quote the exact lines (with line numbers) that define bulk operation endpoints. 

Format: [Line X–Y: 'exact quote']. 

If no explicit evidence exists, respond exactly: 'NO EXPLICIT EVIDENCE FOUND'.

No citation means no claim. If the answer depends on line 310, the model must surface line 310. If it didn't read it closely enough to cite it, it cannot fake precision.

For large documents, add Completeness Enumeration before any answer:

Before answering:
1. List all major sections identified (with line ranges)
2. Mark each as:
   - FULLY REVIEWED
   - PARTIALLY REVIEWED
   - NOT REVIEWED
3. Then answer with coverage caveats

Now you can see what was covered, what was skimmed, what was ignored. This doesn't eliminate Attention Decay — but it exposes it. Exposed decay is manageable. Silent decay is not.

In this case the output will look like:

Sections detected:

• Authentication (1–120) – FULLY REVIEWED
• Endpoints (121–300) – PARTIALLY REVIEWED
• Security (301–450) – NOT REVIEWED

Answer based on partial coverage.
Security section not analyzed.

Pattern 3: Explicit Resets (counters Intent Drift)

Intent Drift happens because the model optimizes for recency. So we defend the goal.

Every 10 turns:

Summarize the original objective
List any scope changes introduced
Ask "Is the primary goal unchanged?"
Do not proceed without confirmation.

When switching tasks: Declare it structurally.

Do it like you have a specific protocol for Context Reset:

Previous objective:
    [Debug authentication bug]
Proposed objective:
    [Redesign authentication module]
Question:
    Treat as continuation or start new session?

This creates structural friction. Instead of letting scope evolve implicitly through casual conversation, we force it into the open. Intent Drift is silent. Explicit Resets make it visible.

Pattern 4: Stability Guards + Regression Testing (counters Behavioral Regression)

Behavioral Regression is external drift. You don't control it. So you instrument it.

Maintain regression test suites for critical prompts — treat prompts like code
Snapshot and diff outputs across model versions, not just correctness but constraint adherence, formatting, refusal behavior, tone
Before deploying a new model version, run a shadow evaluation: same prompt suite against both old and new, diff behavior before switching
In production, use canary prompts — known-answer queries that run continuously to detect drift early

If JSON formatting breaks, refusals tighten unexpectedly, or constraints weaken — you detect it immediately rather than in a post-incident review.

You cannot freeze the model. But you can make drift observable.

Pattern 5: Mandatory Uncertainty (counters Confidence Inflation)

Confidence Inflation happens because the model is rewarded for fluency, not for being well-calibrated. To counter this, you have to engineer calibration from the outside.

Avoid asking for fake precision (“Give me a probability between 0 and 1 with two decimal places”). Instead, require structured uncertainty alongside the answer.

For example, after each substantive answer, append a block like:

CONFIDENCE LEVEL: [LOW / MEDIUM / HIGH]

EVIDENCE:
• [Line X–Y: quote]

GAPS:
• Sections not reviewed
• Assumptions made

RISK:
• What could invalidate this answer

Simple rules for the confidence levels:

HIGH → direct citation, full section coverage
MEDIUM → partial coverage or inferred logic
LOW → missing sections or weak signals

The output will look like:

Answer:
Supports SSO via OAuth 2.0.

CONFIDENCE LEVEL: MEDIUM

EVIDENCE:
- Lines 45–60 describe OAuth flows

GAPS:
- Section 4 (SAML) not fully reviewed

RISK:
- Legacy endpoints may differ

Now the model has to surface what it relied on, what it inferred, and what it didn’t check. You haven’t eliminated mistakes, but you have reduced the likelihood of people trusting a brittle answer blindly.

The Architecture at a Glance

Failure Mode	Engineering Control
Context Decay	Time-Bounded Context
Attention Decay	Forced Evidence Extraction + Completeness Enumeration
Intent Drift	Explicit Resets
Behavioral Regression	Stability Guards + Regression Testing
Confidence Inflation	Mandatory Uncertainty

None of this makes the model itself smarter. It makes the system safer. Capability is rarely your limiting factor; reliable execution almost always is.

The Shift That Matters

When behavior is deterministic, we try to prove correctness. When behavior is probabilistic, we focus on constraining failure. LLM systems are probabilistic components.

So the core questions are not “Which model is best?” or “Can I trust this?” Those are too vague to be useful.

Much better questions are:

How does this fail under load?
How will I notice if quality degrades?
What happens when it’s confidently wrong?
Which of the decays apply to this workflow?
What control do I have in place for each one?

AI doesn’t need to be flawless to create value, but its failures need to be bounded. If you can answer “What happens when it’s wrong?” clearly and concretely, you’ve moved from experimentation to engineering.

This article grew out of a workshop I run for senior engineers and technical leaders. If your team is building AI-driven systems in production and you’d like to bring a version of this session to your organization, feel free to reach out.