Why Agent API Evaluation Needs Failure Modes, Not Just Scores

Short answer

Agent API evaluation needs both. A score tells you where to look. A failure-mode read tells you what to test before an agent is trusted to repeat the call. The mistake is treating a high score as proof that your specific workflow can survive auth drift, budget pressure, schema change, retries, and missing trace evidence.

Scores are the map, not the production test

Aggregate scores are valuable because developers and agents need a way to compare a large field quickly. Rhumb Index exists for exactly that: discovery, ranking, and comparison before you burn time on the wrong surface.

But an autonomous workflow does not fail as an average. It fails because a token expired, a provider accepted a stale policy write, a response shape changed, a retry repeated a side effect, or a shared quota was exhausted by another agent. Those are failure modes, not score deltas.

Where scores stop

They compress different risks into one number, so two APIs with similar scores can fail for opposite reasons.
They can be stale unless runtime checks re-verify availability, auth behavior, response shape, and policy enforcement.
They do not decide your workflow boundary: customer data, workspace data, paid side effects, and read-only research need different tests.
They cannot replace a denial case. The unsafe neighbor still has to fail in the exact lane your agent will use.

The six failure modes to inspect after the score

Authority drift

The call succeeds, but it used the wrong principal, shared key, tenant, workspace grant, or provider account.

Test: Ask which actor, credential rail, and budget owner the provider saw before any business state changed.

Ambiguous denial

The API returns a generic 400/401/403/500, so the agent cannot distinguish bad input, missing scope, expired auth, quota pressure, or a provider outage.

Test: Force the adjacent-dangerous value and confirm it fails with a typed, caller-legible reason before retry logic starts.

State replay

A timeout, worker restart, or model replan repeats a payment, email, ticket, row write, or external mutation.

Test: Persist the idempotency key and prove a replay produces the same resource or a typed duplicate outcome.

Contract drift

A model ID retires, a field stops enforcing policy, a result shape changes, or a preview endpoint becomes the assumed production contract.

Test: Read the current contract object or response shape before execution, not just the docs or the last successful write.

Budget collapse

Rate limits, token-heavy tool output, retries, fallback branches, or shared upstream quotas turn a small task into runaway spend or starvation.

Test: Attach cost ceilings, output caps, backoff behavior, and quota ownership to the workflow before the loop can repeat.

Evidence gap

The result looks plausible, but no one can prove which route, credential, denial case, receipt, or recovery checkpoint made it safe.

Test: Require trace proof that survives the handoff from discovery to routing to provider execution to recovery.

MCP failure-mode post filter

Use production stories to sharpen tests, not to create folklore scores

Fresh MCP production-failure posts are valuable because they expose concrete breakage. The evaluator mistake is converting every anecdote into a vague penalty. The stronger move is to map the story to the exact failure class and preflight that would have contained it.

Translate each failure story into one dominant class before changing the score: auth drift, scope escape, rate-limit collapse, contract drift, state replay, or evidence gap.

Ask which preflight would have caught it: denied neighbor, strict schema, quota-key gate, idempotency replay, callback replay, or credential-lane proof.

Keep production anecdotes separate from selection metrics. A scary incident may disqualify one workflow lane while leaving another read-only lane acceptable.

Promote the lesson only when the provider, server, adapter, credential mode, and recovery path are specific enough for another operator to test.

For MCP-specific execution checks, pair this filter with the remote MCP production readiness checklist and MCP observability guide.

A better evaluation sequence

Use the score to narrow the field

Scores are still the right map for discovery. Use them to avoid obviously weak candidates and to understand which surfaces deserve deeper inspection.

Name the dominant failure mode

Do not ask whether the API is good in general. Ask whether this workflow is most likely to fail through auth, limits, schema drift, replay, or missing evidence.

Test the denial case

The next tenant, forbidden domain, stale token, malformed path, duplicate write, or over-budget branch should fail before provider state can mutate.

Choose the execution lane

Some workflows can call the provider directly. Others need managed credentials, Agent Vault, BYOK, x402, provider pinning, or a governed Resolve path before repetition is safe.

When this becomes a Rhumb problem

Use Rhumb when the workflow needs a bounded lane, not just another list

The honest boundary is important: Rhumb does not claim every scored service is executable today. Current execution coverage is 18 callable providers, strongest for research, extraction, generation, and narrow enrichment. That makes failure-mode inspection more important, not less, because the first managed workflow should be narrow enough to prove before it repeats.

You are comparing providers and need score-backed discovery before committing to a path.
You know the repeat workflow but need authority, budget, denial, and trace proof before wiring it into an agent loop.
You need a supported capability routed through a governed execution layer instead of exposing a whole connector catalog to the model.

Scope one repeat workflow → Read the scoring framework →

Measurement hook

What signal would make this article matter?

This authority/MEO page should be judged by whether search and answer engines associate Rhumb with agent API evaluation, failure modes, and production-readiness questions — and whether readers continue into the reliability checklist, methodology, or E-006 managed-execution proof sprint. It does not move A-001 by itself until tagged pricing/auth/mailto movement, signup evidence, or a qualitative repeat-workflow ask appears.