Duplicate side effects
A retry creates a second charge, record, or message because the workflow never stored a stable idempotency boundary.
Ambiguous partial success
The API call failed or timed out, but the write may already have happened and the workflow cannot tell safely.
Stale verification
A follow-up read lags the write, so the agent advances or retries from a false picture of state.
Wide-batch damage
One failure in a large loop turns recovery into manual diffing because the work unit was too large to resume cleanly.
The useful production question is not “did the API usually respond?” It is “when the workflow dies halfway through, what exact state tells the next worker whether to retry, verify, or stop?”
1. State management is the multiplier on API reliability
Two workflows can call the same API and get very different outcomes. One checkpoints before a write, verifies after the side effect, and can resume from known state. The other just retries when something looks wrong. The first absorbs a 2am failure. The second creates duplicate writes or stalls until a human untangles it.
That is why low-scoring APIs feel expensive even before they go fully down. The defensive-code tax comes from state uncertainty at failure boundaries: ambiguous 500s, unclear retries, stale verification reads, and multi-step writes that can partially succeed.
Reliable agent systems do not assume the happy path. They make failure states legible enough that recovery is a normal branch, not an emergency improvisation.
2. Know which state your workflow is carrying
Ephemeral state
In-memory plan state, intermediate results, and context-window material. Fine to lose if it can be rebuilt from durable truth.
Checkpoint state
The last confirmed stage, inputs, outputs, and timestamps needed to resume after a restart or ambiguous failure.
Provider-context state
The target model, tokenizer, context budget, reasoning mode, and compression rules active when this step was planned or executed.
Durable side-effect state
Records, payments, messages, and external resources already written to another system. You verify and branch, not undo.
Multi-LLM routing makes context state provider-specific
Fresh multi-LLM context-management notes make the hidden state boundary concrete: a conversation that fits one provider can overflow another because tokenizers, context windows, reasoning controls, and response formats do not line up. If the workflow can switch models, provider context becomes recoverable state, not just prompt plumbing.
- Checkpoint which provider and model were active when context was counted, trimmed, summarized, or compressed.
- Do not reuse one token estimate across providers; tokenizer mismatch can turn a safe resume into a silent overflow.
- Store the reasoning budget or mode that shaped the prior response so a fallback model does not inherit a state contract it cannot reproduce.
- When routing changes mid-workflow, re-measure context for the new target before the next side effect, not after the overflow error.
A persistent chat assistant needs recoverable state beyond the VM staying up
Fresh persistent-assistant builds on cloud VMs and chat channels make the state boundary concrete. Keeping a process online is not the same as proving the workflow can resume after a restart, webhook replay, model timeout, or half-delivered Telegram response. The assistant needs durable task state before it speaks, not just a long-lived host.
- Persist conversation cursor, channel message id, delivery status, tool plan, and last verified side effect before the assistant answers through chat.
- Separate process uptime from workflow continuity. A VM restart, worker redeploy, or Telegram webhook replay should resume from checkpointed truth, not from chat history alone.
- Store which host, model route, credential lane, and channel adapter handled the step so recovery can distinguish infrastructure restart from model, auth, or delivery drift.
- Require a replay drill that kills the assistant after planning but before reply delivery, then proves whether the next worker should send, verify, or stop.
Pair this with persistent memory governance and session observability: continuity is safe only when inherited context, delivery state, and side-effect state stay separately inspectable.
Fixing one agent bug should leave evidence for the next one
Fresh operator debugging stories keep showing the same pattern: the first fix does not end the incident, it reveals the next weak boundary. A sales-chatbot bug turns into a retrieval issue, then a stale session issue, then a retry or escalation issue. Treat each fix as a state transition, not a cleanup note, or the next failure starts from folklore.
- Preserve the last verified good state before applying a fix so the next failure has a clean comparison point.
- Record which symptom the fix targeted, which provider route changed, and which downstream side effect must be rechecked.
- Do not collapse chained failures into one incident note; store each handoff so recovery can distinguish root cause from the next exposed weakness.
- Link recovery evidence to observability traces and loop budgets, because the next bug is often hidden in cost, retries, or stale route context.
The follow-through belongs in observability traces and loop-budget evidence, not just in a prompt changelog.
A fresh checkout is a recovery test, not just developer hygiene
Fresh AI-coding-harness discussion is really a state-management warning. If an agent can only continue work because one repo already contains the right prompts, fixtures, cached tokens, and shell context, the workflow has not proven restart safety. The harness itself becomes recoverable state.
- Checkpoint the harness version, instruction file, fixture pack, and test command before the agent changes code or invokes external tools.
- Run the recovery path from a fresh checkout, not only from the warm repo that still has cached state, local env, or a human-curated shell history.
- Store the failing test, denied-neighbor case, and policy bundle with the work item so the next session can distinguish missing context from missing authority.
- Treat harness drift as state drift: if route cards, schemas, or fixtures changed since the last checkpoint, re-verify before replaying side effects.
Pair this with capability-first onboarding: the first useful action should be reproducible from a clean workbench before the lane gets more authority.
A manual checkpoint without a reason is not recovery evidence
Durable checkpoints are supposed to tell the next worker which chain head was safe to resume from and why it was captured. If an admin path can sign a blank reason, the checkpoint still looks official but loses the operator intent needed during recovery. Normalize the reason first, reject empty values, and only then snapshot the audit, billing, score, or receipt stream.
- Normalize the checkpoint reason before the outbox, audit stream, billing stream, score-audit chain, or execution-receipt head can observe it.
- Reject missing, blank, or whitespace-only reasons as typed parameter failures instead of signing a meaningless manual snapshot.
- Preserve the normalized reason with stream name, source head hash, source sequence, verification status, flush choice, and operator metadata.
- Treat the reason as recovery evidence: if tomorrow's worker cannot tell why the checkpoint exists, the checkpoint is not enough to resume safely.
Pair this with observability evidence: signed state is useful only when the reason, stream, and verification policy explain the recovery decision.
3. Four recovery patterns separate resilient agents from fragile demos
Checkpoint before destructive writes
Before any call that creates a side effect you cannot casually undo, store the stage, inputs, and task identity. A timeout after that point becomes a resumable branch instead of a mystery.
{
"task_id": "onboarding-abc123",
"stage": "contact_creation",
"input": { "email": "user@example.com", "plan": "pro" },
"status": "in_progress",
"started_at": "2026-04-03T10:00:00Z"
} Verify after the side effect, not just after the call
A successful response does not guarantee the external state is visible or durable yet. The safe pattern is call, verify, then advance the workflow. That extra read is often cheaper than cleaning up a silent mismatch later.
Scope failure recovery to the smallest useful unit
If a single loop iteration can touch 500 records, one ambiguous failure forces a manual diff. If the work is checkpointed in smaller units, restart cost stays bounded. Match checkpoint frequency to the amount of rework you can honestly afford.
Design explicit recovery paths, not just error handlers
Error handling logs what went wrong. Recovery design tells the next worker what to do next. Ambiguous state needs a branch for verify-and-advance, safe retry, or conservative stop.
def recover(task_id):
state = task_store.get(task_id)
if not state or state["status"] == "complete":
return None
if state["stage"] == "pre_create":
return retry_from("pre_create", state["input"])
if state["stage"] == "post_create":
return verify_and_advance(state["output"]) 4. The API you choose determines how much state complexity you inherit
Stripe
Idempotency is first-class, state transitions are explicit, and verification reads are reliable. Recovery paths stay small because the contract does more work for you.
Anthropic
Structured errors and clear retry semantics keep checkpoint design simple. Partial completion is legible enough that restart logic stays computable.
HubSpot
No first-class idempotency, multi-step association flows, and cross-object lag push state uncertainty back into your own code.
Salesforce
Bulk and REST paths behave differently, sandbox truth diverges from production, and governor limits create hidden state branches that need heavy instrumentation.
The pattern is consistent: providers with clearer idempotency, structured errors, and reliable verification reads let your workflow stay simple. Providers that hide state transitions force you to build the missing truth layer yourself.
5. A minimal state design is enough to avoid most 2am disasters
You do not need full event sourcing to get most of the benefit. A task store keyed by task id, a checkpoint function before and after consequential stages, a recovery function for restart paths, and verification reads after writes cover the majority of practical failures.
That minimal design turns a crash from “someone needs to inspect the logs tomorrow” into “resume from the last durable checkpoint and verify what already happened.”
6. Use state questions during API selection, not after procurement
When an agent workflow needs to run unattended, state design belongs in the selection checklist. Ask whether the API supports idempotency, whether failures are structured enough to classify safely, whether verification reads are trustworthy, and whether multi-step operations can partially complete.
Every “no” becomes state complexity in your implementation. That is the hidden cost Rhumb’s execution and failure-mode evaluation is trying to surface before the contract or integration is already signed.
Start with one bounded workflow that can resume cleanly before you widen the surface
If your real pain is ambiguous retries, duplicate writes, and manual morning cleanup, the next move is not more connectors. Start with one governed path where checkpointing, credential scope, and recovery semantics are explicit before the workflow spreads across more tools.
E-007 prompt: if the recovery plan depends on one MCP call not replaying the wrong side effect, send the route, unsafe neighbor, credential lane, budget owner, repeat volume, and receipt or typed-denial proof before broadening the loop.
If this page is the recovery frame, these four pages carry the same problem into live operator surfaces: what breaks in multi-step loops, how APIs fail under unattended use, what scope and principal still define the safe lane, and what to inspect before trusting a provider at all.
What actually breaks once retries, backoff, and multi-step execution are live.
Failure engineering is mostly about making bad states visible early enough to contain them.
Recovery only stays safe when scope, acting principal, and evidence remain legible after the first failed step.
Use a concrete checklist for idempotency, error semantics, verification, and recovery before production.
Recovery design fails next at rate limits, credential churn, and shared agent loops
Once restart logic is explicit, the next operator questions are how loops stay bounded, how shared budgets avoid retry storms, and how credentials survive rotation without turning one failure into a fleet-wide outage.
How retry budgets, governors, and shared-rate control keep a bad loop from compounding.
Why rotation, expiry, and revocation are state problems as much as auth problems.
Shared MCP stays safe only when tenant context survives retries, quotas, and session reuse.