How to Evaluate APIs for AI Agents: The 20-Dimension Framework

The wrong question everyone's asking

Search for “agent compatibility scoring” and you'll find a dozen tools that scan websites for AI crawlability — whether your site has llms.txt, structured data, or robots.txt rules for GPTBot. That's useful if you're optimizing a marketing page for ChatGPT citations.

But if you're building an AI agent that needs to use an API — send an email, process a payment, query a database — website crawlability tells you nothing. Your agent doesn't read your landing page. It calls your endpoints.

The real question isn't “Is this website AI-friendly?” It's: “Will this API actually work when my agent calls it at 3am with no human supervision?”

That's a fundamentally different evaluation. It requires measuring execution reliability, authentication friction, error handling quality, and dozens of other dimensions that website scanners don't touch.

What actually matters: execution vs. access

After scoring 1,038 services across 92 categories, we've found that agent compatibility comes down to two axes:

Execution (70% of what matters)

Can the agent reliably get work done through this API?

Error handling: Does the API return structured, parseable errors? Or vague 500s that leave the agent guessing?
Schema stability: Do response shapes change between versions without warning?
Idempotency: Can the agent safely retry a failed request without creating duplicates?
Latency consistency: Are response times predictable enough for timeout management?
Rate limit transparency: Does the API tell the agent how long to wait, or just reject requests?

Access Readiness (30% of what matters)

Can the agent even get started?

Signup friction: Does creating credentials require email verification, phone numbers, or CAPTCHAs?
Authentication complexity: API key in a header? Or a multi-step OAuth dance requiring a browser?
Documentation quality: Can the agent (or the developer configuring it) understand the API from docs alone?
Sandbox availability: Is there a test environment that doesn't require production credentials?
Rate limits: Are free-tier limits high enough for development and testing?

We weight execution at 70% because access friction is a one-time cost — you solve it during setup. Execution reliability is an ongoing cost that compounds every time the agent makes a call.

Schema enforcement belongs in execution

A schema is only agent-ready when it can stop the wrong call

Strict request and response schemas matter because they turn ambiguous model output into a bounded execution decision. Unknown fields, missing required values, enum drift, and unexpected variants should fail as typed contract outcomes before provider routing, credentials, budget, or side effects attach.

Good signal

The API names the schema id, rejected path, allowed enum, replacement hint, and whether retry, migration, or human approval is the safe next step.

Bad signal

The agent gets a vague validation error, silently coerced input, or an output variant the parser never expected after the workflow already spent quota.

For deeper operational checks, pair this methodology with the reliability checklist and machine-readable change discipline.

The AN Score: quantifying agent-nativeness

The Agent-Native (AN) Score is our framework for measuring these dimensions. It evaluates each API across 20 specific dimensions on these two axes, producing a score from 0 to 10:

Tier	Score	What it means
L4 Native	8.0–10.0	Built for agents. Minimal friction, reliable execution, structured everything.
L3 Fluent	6.0–7.9	Agents can use this reliably with minor configuration.
L2 Developing	4.0–5.9	Usable with workarounds. Expect friction points.
L1 Emerging	0.0–3.9	Significant barriers. Not recommended for unsupervised agent use.

Real example — payments:

Stripe scores 8.1 (L4 Native): execution score 9.0, access readiness 6.6. It has idempotency keys, structured errors, versioned webhooks, and an official agent toolkit. The access readiness score is lower because restricted API keys can silently scope-limit results — a documented failure mode that catches agents off guard. (An agent believes no customers exist when it simply lacks read permission.)

PayPal scores 4.9 (L2 Developing): execution score 5.9, access readiness 3.7. OAuth2 is the only auth method. Sandbox requires CAPTCHA verification. The moment your agent needs to click “I am not a robot,” the automation dies.

The gap between 8.1 and 4.9 isn't marginal. It's the difference between an agent that processes payments at 3am and one that pages a human.

Five questions to ask before your agent calls any API

You don't need a formal scoring framework to make better tool selections. Start with these five questions:

1. What happens when the request fails? Check the API's error responses. Do you get a structured JSON error with an error code, message, and suggested fix? Or a generic 500 with an HTML error page? Agents need parseable errors to decide whether to retry, fall back, or escalate.

2. Can the agent create credentials without a human? If signup requires email verification, phone number, or CAPTCHA — your agent can't self-provision. Look for APIs that offer programmatic key generation or zero-signup access paths (like x402 pay-per-call).

3. Are rate limits explicit and machine-readable? Good APIs return X-RateLimit-Remaining and Retry-After headers. Bad APIs just return 429 with no guidance. Your agent needs to know: how long should I wait? How many calls do I have left?

4. Does the API version its responses? Breaking changes in response schemas are the #1 cause of silent agent failures. Look for explicit versioning (Stripe's API version headers, GitHub's API versions) rather than unversioned endpoints.

5. Is there a sandbox that doesn't require production credentials? Your agent needs to test before going live. If the sandbox requires the same onboarding friction as production (business verification, credit card, manual approval), development iteration time explodes.

The “agent readiness” vs. “API agent-nativeness” distinction

This matters: most tools calling themselves “agent readiness scanners” (AgentReady, Pillar, SiteSpeakAI) evaluate websites for AI chatbot crawlability. They check llms.txt, robots.txt, structured data, and content formatting.

That's a different problem. Website agent-readiness is about making your content discoverable by AI search engines. API agent-nativeness is about making your endpoints usable by autonomous AI agents.

What's measured	Website agent readiness	API agent-nativeness (AN Score)
Target audience	AI search engines (ChatGPT, Perplexity)	AI agents calling APIs
Key metrics	llms.txt, robots.txt, Schema.org	Error handling, auth friction, idempotency
Score meaning	“Can AI find your content?”	“Can AI use your service?”
Evaluation method	Static page scan	API testing + documentation analysis
Examples	AgentReady, Pillar, SiteSpeakAI	AN Score (Rhumb)

Both matter. If you're a developer choosing tools, website readiness tells you whether the vendor takes AI seriously. API agent-nativeness tells you whether the product actually works in your agent pipeline.

How to use this in practice

If you're building an agent that calls external APIs:

Check the AN Score for any service you're considering
Read the failure modes before integrating — know where the API breaks for agents
Prefer L3+ services for critical paths; use L2 services only with fallback logic
Run npx rhumb-mcp in your agent to get scores at decision time

If you're evaluating a new API without a score:

Use the five questions above as a quick filter. If an API fails on questions 1 (error handling) or 2 (credential creation), it's likely L1–L2 regardless of other strengths.

If you're an API provider wanting to improve:

Read our methodology. The 20 dimensions are published and transparent. The most impactful improvements are usually: structured errors, API key auth (not just OAuth), and explicit rate limit headers.

How to Evaluate APIs for AI Agents: The 20-Dimension Framework

The wrong question everyone's asking

What actually matters: execution vs. access

A schema is only agent-ready when it can stop the wrong call

The AN Score: quantifying agent-nativeness

Five questions to ask before your agent calls any API

The “agent readiness” vs. “API agent-nativeness” distinction

How to use this in practice

Do not stop at the score once a workflow is about to run unattended.

When evaluation turns into a real workflow, move into the bounded production path.

The shortlist still has to survive loops, shared budgets, and key drift.

See how your tools score

See the framework under pressure

HubSpot API Autopsy

Salesforce API Autopsy

Twilio API Autopsy

Shopify API Autopsy

Related

How to Evaluate MCP Servers

A Production Readiness Checklist for Remote MCP Servers

API Versioning Is Table Stakes

Capability-First Agent Onboarding