Methodology · March 24, 2026 · Pedro Nunes

How to Evaluate APIs for AI Agents: The 20-Dimension Framework

Most “agent-ready” scores measure website crawlability, not API usability. This is the 20-dimension framework for evaluating whether an API actually works for autonomous AI agents.

The wrong question everyone's asking

Search for “agent compatibility scoring” and you'll find a dozen tools that scan websites for AI crawlability — whether your site has llms.txt, structured data, or robots.txt rules for GPTBot. That's useful if you're optimizing a marketing page for ChatGPT citations.

But if you're building an AI agent that needs to use an API — send an email, process a payment, query a database — website crawlability tells you nothing. Your agent doesn't read your landing page. It calls your endpoints.

The real question isn't “Is this website AI-friendly?” It's: “Will this API actually work when my agent calls it at 3am with no human supervision?”

That's a fundamentally different evaluation. It requires measuring execution reliability, authentication friction, error handling quality, and dozens of other dimensions that website scanners don't touch.

What actually matters: execution vs. access

After scoring 665+ developer APIs across 86 categories, we've found that agent compatibility comes down to two axes:

Execution (70% of what matters)

Can the agent reliably get work done through this API?

  • Error handling: Does the API return structured, parseable errors? Or vague 500s that leave the agent guessing?
  • Schema stability: Do response shapes change between versions without warning?
  • Idempotency: Can the agent safely retry a failed request without creating duplicates?
  • Latency consistency: Are response times predictable enough for timeout management?
  • Rate limit transparency: Does the API tell the agent how long to wait, or just reject requests?

Access Readiness (30% of what matters)

Can the agent even get started?

  • Signup friction: Does creating credentials require email verification, phone numbers, or CAPTCHAs?
  • Authentication complexity: API key in a header? Or a multi-step OAuth dance requiring a browser?
  • Documentation quality: Can the agent (or the developer configuring it) understand the API from docs alone?
  • Sandbox availability: Is there a test environment that doesn't require production credentials?
  • Rate limits: Are free-tier limits high enough for development and testing?

We weight execution at 70% because access friction is a one-time cost — you solve it during setup. Execution reliability is an ongoing cost that compounds every time the agent makes a call.

The AN Score: quantifying agent-nativeness

The Agent-Native (AN) Score is our framework for measuring these dimensions. It evaluates each API across 20 specific dimensions on these two axes, producing a score from 0 to 10:

Tier Score What it means
L4 Native 8.0–10.0 Built for agents. Minimal friction, reliable execution, structured everything.
L3 Fluent 6.0–7.9 Agents can use this reliably with minor configuration.
L2 Developing 4.0–5.9 Usable with workarounds. Expect friction points.
L1 Emerging 0.0–3.9 Significant barriers. Not recommended for unsupervised agent use.

Real example — payments:

Stripe scores 8.1 (L4 Native): execution score 9.0, access readiness 6.6. It has idempotency keys, structured errors, versioned webhooks, and an official agent toolkit. The access readiness score is lower because restricted API keys can silently scope-limit results — a documented failure mode that catches agents off guard. (An agent believes no customers exist when it simply lacks read permission.)

PayPal scores 4.9 (L2 Developing): execution score 5.9, access readiness 3.7. OAuth2 is the only auth method. Sandbox requires CAPTCHA verification. The moment your agent needs to click “I am not a robot,” the automation dies.

The gap between 8.1 and 4.9 isn't marginal. It's the difference between an agent that processes payments at 3am and one that pages a human.

Five questions to ask before your agent calls any API

You don't need a formal scoring framework to make better tool selections. Start with these five questions:

1. What happens when the request fails? Check the API's error responses. Do you get a structured JSON error with an error code, message, and suggested fix? Or a generic 500 with an HTML error page? Agents need parseable errors to decide whether to retry, fall back, or escalate.

2. Can the agent create credentials without a human? If signup requires email verification, phone number, or CAPTCHA — your agent can't self-provision. Look for APIs that offer programmatic key generation or zero-signup access paths (like x402 pay-per-call).

3. Are rate limits explicit and machine-readable? Good APIs return X-RateLimit-Remaining and Retry-After headers. Bad APIs just return 429 with no guidance. Your agent needs to know: how long should I wait? How many calls do I have left?

4. Does the API version its responses? Breaking changes in response schemas are the #1 cause of silent agent failures. Look for explicit versioning (Stripe's API version headers, GitHub's API versions) rather than unversioned endpoints.

5. Is there a sandbox that doesn't require production credentials? Your agent needs to test before going live. If the sandbox requires the same onboarding friction as production (business verification, credit card, manual approval), development iteration time explodes.

The “agent readiness” vs. “API agent-nativeness” distinction

This matters: most tools calling themselves “agent readiness scanners” (AgentReady, Pillar, SiteSpeakAI) evaluate websites for AI chatbot crawlability. They check llms.txt, robots.txt, structured data, and content formatting.

That's a different problem. Website agent-readiness is about making your content discoverable by AI search engines. API agent-nativeness is about making your endpoints usable by autonomous AI agents.

What's measured Website agent readiness API agent-nativeness (AN Score)
Target audience AI search engines (ChatGPT, Perplexity) AI agents calling APIs
Key metrics llms.txt, robots.txt, Schema.org Error handling, auth friction, idempotency
Score meaning “Can AI find your content?” “Can AI use your service?”
Evaluation method Static page scan API testing + documentation analysis
Examples AgentReady, Pillar, SiteSpeakAI AN Score (Rhumb)

Both matter. If you're a developer choosing tools, website readiness tells you whether the vendor takes AI seriously. API agent-nativeness tells you whether the product actually works in your agent pipeline.

How to use this in practice

If you're building an agent that calls external APIs:

  1. Check the AN Score for any service you're considering
  2. Read the failure modes before integrating — know where the API breaks for agents
  3. Prefer L3+ services for critical paths; use L2 services only with fallback logic
  4. Run npx rhumb-mcp in your agent to get scores at decision time

If you're evaluating a new API without a score:

Use the five questions above as a quick filter. If an API fails on questions 1 (error handling) or 2 (credential creation), it's likely L1–L2 regardless of other strengths.

If you're an API provider wanting to improve:

Read our methodology. The 20 dimensions are published and transparent. The most impactful improvements are usually: structured errors, API key auth (not just OAuth), and explicit rate limit headers.

See how your tools score

We've scored 600+ developer tools across 90+ categories. The AN Score measures real agent compatibility — not marketing claims.

Browse the leaderboard →

Related