Before Your Agent Calls an API at 3am: A Reliability Checklist

# Before Your Agent Calls an API at 3am: A Reliability Checklist

You're in bed. Your agent is running. It calls a payment API to complete a transaction, gets a 500 back, retries three times, creates three duplicate charges. By the time you check Slack in the morning, three users have filed disputes.

This isn't hypothetical. It's the failure mode developers hit when they treat agent integrations the same as human integrations.

The core problem: APIs are built for developers who can read error messages, click "Retry," and handle edge cases manually. Agents can't do any of that. They need APIs that communicate failures clearly, handle concurrent calls predictably, and allow credentials to be managed programmatically.

Most APIs don't.

The 5 Questions That Separate "Works" from "Works at 3am"

We've scored 650+ APIs across 20 dimensions for agent-nativeness. When we look at what actually causes production failures in agent deployments, five questions account for most of the avoidable incidents.

Run this checklist before you wire any external API into an autonomous agent.

Question 1: What does the API say when something goes wrong?

Open the API's error reference. Look at a few real error responses.

Green flag: JSON with a machine-readable error code and a specific message.

{
  "error": {
    "code": "INSUFFICIENT_FUNDS",
    "message": "Account balance too low for this transaction",
    "retry_after": null
  }
}

Red flag: HTML error page, vague 500 with no body, or "Internal Server Error" with nothing else. Your agent doesn't read documentation at runtime. It parses the response. If the response doesn't tell it why the call failed, the agent will either retry blindly (risk: duplicates) or give up silently (risk: data loss). AN Score dimension: Error Signal Quality. Top-scoring APIs return structured, parseable errors with enough context for an agent to decide: retry, abort, or escalate.

Question 2: Can the agent send the same request twice without breaking things?

Look for idempotency support in the API docs. Search for "idempotency key," "request ID," or "duplicate prevention."

Green flag: Idempotency-Key header on POST requests. The API deduplicates on your end. Red flag: No idempotency support. POST requests are fire-and-forget with no duplicate prevention.

Why it matters: Agents retry on timeouts. Networks drop packets. If a call completes on the server but the response never arrives, your agent will retry. Without idempotency, that retry creates a second resource, a second charge, a second email, a second record.

The most reliable APIs build idempotency in from the start. You provide a unique key, they handle the rest.

Question 3: When the API rate-limits you, does it tell you when to retry?

Look at the API's documentation for rate limits. Specifically: does a 429 response include a Retry-After header or equivalent?

Green flag:

HTTP 429 Too Many Requests
Retry-After: 30
X-RateLimit-Reset: 1711670400

Red flag: HTTP 429 with no headers indicating when it's safe to retry. An agent that hits a rate limit without guidance will either spin-wait, hammering the API on a fixed interval, or implement exponential backoff with no ceiling. The first risks getting your key banned. The second means the agent might wait 10 minutes when 5 seconds would have worked.

Question 4: Can the agent get credentials without a human?

Check the API's authentication setup flow. Specifically: can API keys be created, rotated, and scoped programmatically?

Green flag: Dashboard API key generation (one-time human action), API key auth for all endpoints, no MFA required for key creation. Red flag: OAuth 2.0 as the primary auth method with browser-based consent flow. 2FA or SMS verification on key creation. IP allowlisting required.

This is the one that kills you at 3am. If your agent needs to refresh credentials and it requires a human to click "Authorize" in a browser, you're done. The task hangs indefinitely.

If the workflow crosses credential boundaries, read How to Secure Your API Keys for Agent Use before you trust a browser-first auth story.

Real example: SendGrid's documentation recommends OAuth as the preferred flow for some endpoints. API key auth exists but is treated as secondary. Agents using the OAuth-recommended path fail completely when tokens expire.

Question 5: Does the API return deterministic errors for the same input?

This one's harder to test upfront, but look for it in community threads and issue trackers.

Green flag: Same bad input always returns the same error code. Pagination is cursor-based, not offset-based. Red flag: Offset-based pagination. 500 errors for edge cases that should be 400s. "Flaky" mentions in GitHub issues.

An agent building a workflow model will retry 400s differently than 500s. If the API mislabels client errors as server errors, your agent will retry indefinitely on something that was never going to succeed.

Fresh operator signal: endpoint retirement is still a reliability failure

Twilio is retiring api.de1.twilio.com on April 28. The harder lesson is not only that a hostname is going away. It is that many integrations treated a regional-looking base URL as if it preserved a routing promise the platform never actually gave them.

That is a reliability problem before it becomes a docs problem.

If your agent or wrapper pins base URLs in config, tests, or failover logic, treat those hostnames as part of the contract surface:

Green flag: the provider gives a deprecation date, a replacement target, and enough machine-readable signal to tell you whether geography, auth scope, or fallback behavior changed.
Red flag: the only warning is a prose post humans may or may not read, while automation keeps retrying a dead endpoint like it is transient infrastructure noise.

This is the same boundary described in machine-parseable change communication: endpoint drift should fail closed as a deterministic migration event, not masquerade as flaky transport.

Scoring the APIs You're Evaluating

The five questions above map to real signal in the AN Score framework. When we score an API, we're systematically measuring exactly these failure modes across 20 dimensions.

What to look for:

Score Range	What it means for agents
8.0–10.0	Production-grade. Handles edge cases. Safe for unsupervised operation.
6.0–7.9	Works with careful integration. Add explicit retry logic and error handling.
Below 6.0	High maintenance burden. Expect to babysit this integration.

Current baselines on common categories from 650+ scored APIs:

Payment APIs: Stripe leads. Most payment APIs cluster in the middle, not at the top.
Email APIs: Resend and Postmark are strongest. SendGrid still carries friction.
Storage and infra: Cloud leaders score well, but provider-specific quirks still matter.

Browse the full leaderboard at rhumb.dev/leaderboard, sorted by AN Score and filterable by category.

One More Thing: Test the Sandbox First

Before putting any API into a production agent loop, run your agent against the sandbox or test environment with explicit error injection:

Send a request that will 400, invalid input, and confirm the agent handles it correctly.
Simulate a timeout, add a delay, and inspect whether the agent retries and how many times.
Send the same idempotent request twice and confirm it creates one resource, not two.
Exhaust the rate limit and confirm the agent backs off instead of hammering.

If the API doesn't have a sandbox, treat it as a yellow flag. You're testing in production.

Failure-mode evidence

If you want concrete examples instead of a generic checklist, start with the live autopsies:

These are the kinds of reliability breaks that look small in docs and become real operator pain once an unattended workflow starts retrying.

TL;DR

Five questions before your agent calls an API:

Error clarity: Does it return structured, parseable errors?
Idempotency: Can it safely handle retry without creating duplicates?
Rate limit guidance: Does it tell you when to retry?
Credential autonomy: Can the agent manage auth without a human?
Deterministic behavior: Same input, same output, every time?

APIs that pass all five are built for agents. APIs that fail two or more will cost you sleep.

Rhumb scores 650+ APIs on 20 dimensions for agent-nativeness. Free to search and browse at rhumb.dev. No signup required for the first 10 tool calls.

Fresh operator signal

Reliability failures also start when a hard dependency quietly retires

A workflow can be perfectly healthy one day and fail closed the next because a pinned model ID, version enum, or credential-test default no longer exists. That is still a reliability problem. Nothing in your code changed, but the boundary became deterministic overnight and the system needs migration logic, not blind retries.

Inventory pinned model IDs, version enums, and fallback defaults across env vars, framework wrappers, credential tests, and dashboards.
Check availability before execution and classify not-found or retirement errors as deterministic migration work, not retry noise.
Treat the replacement as a new operating profile: parameters, stop reasons, tool behavior, rate limits, and price can all shift together.
Fail closed when the workflow cannot prove compatibility yet, especially in auth checks and health checks that otherwise burn time before the real call even starts.

This is where machine-parseable change communication and loop-reliability design stop being separate topics and become part of the same pre-flight checklist.

Preview model gate

A preview model is not reliable until the production contract is observable

Preview access creates a tempting false positive for agents: the demo works, the sample code returns output, and the launch post publishes pricing hints. None of that proves the route is ready for unattended retries. Reliability starts when the exact endpoint, response shape, usage accounting, rate limits, and fallback behavior are visible on the path your workflow will actually call.

Prove the first-party production endpoint exists; do not treat a proxy, playground, or SDK preview as the contract your agent will run on.
Capture the exact response shape, usage fields, tool-call semantics, media fields, and error envelopes before wiring the route into retries.
Measure mode-specific price, latency, and rate limits for the path you use, not the cheapest headline model price.
Keep a fallback or quarantine path when preview docs omit typed failures, quota behavior, regional availability, or deprecation windows.

Treat preview launches as runtime-trust drift candidates until the loop can prove the first-party contract, the cost envelope, and the failure semantics under real execution.

Silent policy-write gate

A successful write is not reliability proof when the policy contract moved

Security defaults can drift without producing a clean failure. If a provider retires a policy field but still accepts legacy writes, the agent sees success while new resources launch under a different enforcement object. That is not a retry problem. It is stale-contract reliability debt.

Do not treat a 200 response, Terraform convergence, or audit-log entry as proof that the intended security default still applies.
After any policy write, read the replacement policy object that now owns enforcement and compare it against the intended guardrail before continuing.
Preserve the legacy field, replacement field, actor, resource family, and enforcement result in trace context so operators can see which contract actually applied.
Return a typed stale-policy outcome when the old field is accepted but no longer authoritative; do not retry the same no-op write as if it were transient failure.

Pair this gate with the silent security-default drift test so unattended setup lanes prove enforcement, not just acknowledgement.

Callback authority gate

A callback is not proof until the signature, state, and tenant binding validate first

Agent workflows often treat redirects, checkout completions, top-up callbacks, and webhooks as the moment authority becomes real. That is dangerous if the handler reads session state, credits an account, or unlocks execution before proving the callback belongs to the same actor, amount, tenant, and signed event window. Reliability starts by validating the continuation before business logic can observe it.

Validate OAuth callback state, wallet top-up payloads, and webhook signatures before any session, credit, entitlement, or account state is read.
Reject blank or whitespace-only payment_request_id and X-Payment proof fields before wallet, ledger, billing, or provider state can interpret the request.
Preserve the callback actor, claimed account, signed payload family, and replay window in trace context so operators can distinguish a real continuation from a forged completion.
Fail closed with a typed callback-auth outcome when the signature, state token, tenant binding, amount, or timestamp cannot be proven before the side effect is applied.
Test the adjacent-dangerous values too: stale state, wrong tenant, altered amount, missing signature, and replayed event should all die before business logic observes them.

This gate belongs beside remote-auth authority and receipts-as-evidence: a receipt or redirect can document what happened, but it should not create authority until the callback boundary is proven.

Wallet top-up preflight

A payment proof should fail as bad input before it looks like billing state

Wallet-prefund flows add a second reliability boundary: the proof is both payment evidence and account-crediting input. If blank payment request ids or empty X-Payment proofs reach wallet lookup, ledger mutation, or provider settlement, the operator sees misleading not-found or settlement noise instead of the real failure. The first gate is boring and strict: normalize the proof fields, reject empty values, and only then compare the proof to the requested top-up.

Normalize the top-up payment_request_id and X-Payment proof before wallet session, top-up lookup, payment-request lookup, or settlement code observes the request.
Reject missing, blank, or whitespace-only proof fields with typed input errors instead of letting them fall through as not-found, invalid-session, or provider-payment failures.
Compare any payment-request id embedded in the x402 proof with the requested top-up id before crediting the wallet or marking the payment request processed.
Keep the top-up id, requested amount, wallet identity, org id, proof family, and verification outcome in trace context so operators can distinguish bad input from replay, mismatch, or settlement failure.

This is the payment-specific version of the same x402 dogfood lesson: proof format errors and mismatched payment-request ids should be typed authority outcomes, not ambiguous billing failures.

Settlement conversion gate

Settlement totals should validate before finance state believes them

Manual settlement conversion is still part of the execution trust boundary. If an admin marks a batch converted with a missing, zero, boolean, decimal, or whitespace-shaped USD-cent total, the failure should stop before ledger reconciliation, invoice reporting, or finance review treats the batch as converted. Normalize the cent total and conversion id first; then write settlement state.

Validate total_usd_cents as a positive integer before settlement batch state, ledger reconciliation, or invoice reporting can observe the conversion result.
Reject missing, zero, negative, boolean, decimal, blank, or malformed cent totals as typed parameter failures instead of preserving ambiguous settlement state.
Trim optional Coinbase conversion ids and store blank values as absent evidence, not as whitespace identifiers that look like real reconciliation proof.
Preserve batch id, normalized USD-cent total, conversion id presence, actor, and rejection reason in trace context so finance review can separate bad admin input from processor, wallet, or ledger drift.

This extends wallet-prefund reliability beyond payment proof parsing: conversion evidence must be typed before finance state can confuse bad admin input with processor, wallet, or ledger drift.

Proxy request preflight

A proxy route should reject bad method and path input before it looks like provider failure

Proxy layers are easy places to blur responsibility. If an empty method, unsupported verb, traversal-shaped path, or malformed route fragment reaches credential lookup or provider dispatch, the operator sees auth, billing, or upstream noise instead of the real failure. Normalize the request method and path at the proxy boundary, then decide whether the call has any authority to continue.

Normalize and validate the proxy request method and path before route lookup, credential selection, budget attribution, proxy dispatch, or provider code observes the request.
Reject missing, blank, whitespace-only, or unsupported methods as typed input failures instead of letting them become generic upstream proxy errors.
Reject empty paths, traversal-shaped paths, and malformed route fragments before the proxy can attach credentials or bill a caller for a request that never had a valid target.
Preserve normalized method, normalized path, caller, budget owner, proxy lane, credential lane, and rejection reason in trace context so operators can separate bad input from route miss, auth failure, and provider outage.

This is the routing version of the same governed capability surface: malformed target selection should fail as typed boundary input, not as a mysterious provider outage after credentials or budget have already been attached.

Request-shape preflight

Retired payload fields should fail before the loop calls them flaky

A provider can reject an outbound field that worked yesterday while the endpoint, model family, and SDK import still look familiar. If that rejection reaches generic retry code, the agent wastes budget and may fall back to a lane with different price, auth, or tool behavior. Normalize the payload contract first, then decide whether this is migration work or a runnable request.

Validate outbound payload fields, content part names, enum values, and SDK-normalized request bodies before the request can enter retry or fallback logic.
Treat provider rejections for retired or renamed fields as deterministic contract drift, not as user typo, model hesitation, or transient upstream failure.
Preserve provider, endpoint, SDK version, model or capability id, rejected field, replacement hint, and retry decision in trace context.
Block side-effecting routes when the replacement request shape is not yet proven against the exact provider lane the agent will run unattended.

Pair this gate with machine-readable change communication and LLM loop reliability: request drift is safe only when it becomes a typed migration outcome before retries, fallback, credentials, or spend attach.

Strict schema gate

Strict schemas are execution boundaries, not docs niceties

A strict schema is useful because it lets the workflow say no before ambiguity reaches policy, credentials, spend, or provider state. If unknown fields, nullable surprises, enum drift, or unexpected output variants pass through as soft warnings, the agent can widen the operation while every surrounding surface still claims the lane was approved.

Require strict schemas at the boundary where policy, credentials, budget, and provider routing first meet the request.
Reject unknown fields, missing required fields, enum drift, nullable surprises, and unexpected output variants as typed contract failures before retry logic sees them.
Store schema id, schema version, validator result, rejected path, and fallback decision in trace context so a later receipt can explain why execution stopped or continued.
Fail closed when a schema loosening would allow the model to smuggle a broader target, action, tenant, or side-effect class than the original lane approved.

Connect this to machine-readable change communication: schema enforcement only stays safe when the caller can tell whether a rejection is stale input, provider drift, policy denial, or a real validation bug.

Quota-key preflight

Rate-limit recovery starts before the agent chooses a key

Fresh API-key and rate-limit guidance is useful, but unattended agents need one more boundary: key selection is part of execution policy. If a workflow can hop from an exhausted key to a broader provider account without proving the same tenant, budget, data-use class, and authority lane, quota recovery has become permission expansion.

Resolve the credential lane and quota owner before the first provider call, including backup-key eligibility and customer budget ceiling.
Classify 401, 403, 409, 422, and 429 outcomes separately so auth drift, scope denial, contract drift, and exhausted quota do not collapse into generic retry behavior.
Carry remaining allowance, reset window, selected key id, rejected fallback keys, and budget owner into trace context before retries or provider fallback can run.
Stop when the only runnable key would cross a different tenant, data-use class, billing owner, or workflow budget than the original request approved.

Use this preflight with fleet rate-limit design and credential fleet controls: the reliable answer to a 429 is not "try another key". It is a typed proof that another key is still allowed.

Delivery-failure preflight

A sent message is not a delivered side effect until the callback proves it

Fresh local Twilio testing guidance maps to a broader agent rule: outbound communication needs a failure sandbox before production authority. The workflow should know how it handles invalid numbers, carrier filtering, queued messages, late callbacks, duplicate idempotency keys, and channel fallback before a real human receives anything.

Model provider-accepted, carrier-queued, delivered, failed, undelivered, duplicate, and expired-callback outcomes before any real recipient is contacted.
Persist idempotency key, provider message id, sender number, recipient class, callback event id, delivery state, and retry decision in the same trace.
Test late, repeated, forged, and out-of-order callbacks locally so replay cannot double-send or advance the workflow from stale delivery state.
Require operator-approved fallback before switching channel or provider; SMS to WhatsApp, voice, email, or another sender is an authority change, not a transparent retry.

Pair this with messaging provider selection: API acceptance, carrier delivery, customer consent, and fallback channel approval are different gates.

Side-effect replay gate

Idempotency keys are agent memory for side effects

Duplicate prevention is not just a nice API header. In an unattended loop, the idempotency key is the durable memory that lets a retry prove it is continuing the same side effect rather than starting a new one. If the key disappears when the worker restarts, the model replans, or a queue redelivers the job, the agent has no safe way to tell a completed-but-unacknowledged write from work that never happened.

The boundary matters as much as the key itself. Normalize idempotency before execution routing, credential lookup, billing, proxy selection, or provider dispatch sees the request. Otherwise two visually different keys can describe the same operator intent on one execution surface while drifting into separate side effects on another.

Generate the idempotency key before the first attempt and persist it with the workflow step, not only inside the HTTP client retry wrapper.
Normalize caller-supplied idempotency keys at the API boundary before resolve, execute, gateway, or provider routing can observe the request.
Reuse the same key after timeouts, worker restarts, queue redelivery, or model replanning so the provider can collapse duplicates into one side effect.
Record the provider response, resource ID, and replay result in trace context so operators can prove a retry reused the original attempt instead of issuing a second command.
Fail closed when a side-effecting route has no idempotency contract; quarantine the lane or require a human decision instead of letting the agent guess.

The practical test is simple: kill the worker after the provider commits but before your agent receives the response, then replay the job. A production-ready route should return the original result or a typed duplicate outcome, not create a second charge, email, ticket, or database row.

Route-hardening checkpoint

When reliability proof names one repeat route, turn it into a hardening brief

The checklist is still research until it identifies a specific route that will run again: one tool call, one unsafe neighbor that must fail, one credential lane, one budget owner, one repeat volume, and one receipt or typed-denial proof. That is the point where reliability review becomes the E-007 route-hardening request instead of another generic audit.

Use the E-007 hardening probe

Name the route, unsafe neighbor, credential lane, budget owner, repeat volume, and evidence requirement before broad execution.

Open the managed fit check

If the route is already painful enough to repeat, start with the bounded managed-execution preflight instead of wiring a broad connector.

Next honest step

If this checklist already feels familiar, move into a bounded execution lane

If your team already knows the integration will need real execution, keep the authority boundary explicit. Start with capability-first onboarding or open the direct managed path instead of wiring a giant connector list all at once.

See the capability-first handoff → Open the managed path →

Fleet follow-through

Reliability starts at the checklist, then survives in the loop

The preflight only matters if the workflow can keep running once retries stack, shared budgets tighten, and credentials start failing under load. These are the next three pages to read when the checklist moves from theory into live operations.

Loop Reliability

Before your agent calls an API at 3am, run this reliability checklist