Methodology · AN Score

How We Score

The Agent-Native Score rates developer tools on how well they work for autonomous AI agents. 20 dimensions, 2 axes, fully transparent.

Philosophy

“Agent-native” means evaluating tools from the agent's perspective — not a human reading documentation, but an autonomous program making API calls, parsing responses, handling errors, and deciding what to do next.

Human-oriented review sites ask: “Is the dashboard intuitive?” We ask: “When this API returns a 429, does the response include a machine-readable retry-after header, or does the agent have to parse a human-readable error string?”

Agent-native parity

This page explains the scoring system in human language, but the method should not be trapped here. The same model is exposed in machine-readable form through Rhumb's public interfaces.

Two scoring axes

Execution

70%

How reliably the tool works when an agent calls it. Covers API stability, error handling, latency, schema consistency, and end-to-end autonomy (payment, compliance, web accessibility).

13 dimensions

Access Readiness

30%

How easy it is for an agent to start using the tool. Signup friction, payment rails, credential management, documentation, and sandbox availability.

7 dimensions

The 20 dimensions

Execution (10 core dimensions · part of 70% Execution axis)

API Reliability

Uptime, error rate consistency, and graceful handling of edge cases under normal and peak load.

Error Ergonomics

Machine-readable error codes, structured error responses, retry-after headers. Can an agent self-heal from errors without human interpretation?

Schema Stability

Mean time between breaking changes (MTBBC). How often does the response shape change without warning?

Latency Distribution

P50, P95, P99 latency — not averages. P99 determines whether agents timeout and corrupt state.

Idempotency

Support for idempotency keys, safe retries. Critical for payment and state-mutating operations.

Concurrent Behavior

How the API handles simultaneous connections — queue, reject, or silent drop. Silent drops are worst: they look like success.

Cold-Start Latency

First-request latency vs. warm latency. Serverless cold starts on idle APIs affect agent timeout budgets.

Output Structure Quality

Structured JSON vs. freeform text responses. Structured output reduces downstream compute for consuming agents.

State Leakage

Implicit caching returning stale data across sequential calls. Agents need predictable, stateless responses.

Graceful Degradation

Slow responses under load vs. hard 503 cutoffs. Agents can handle slow; they can't handle surprise disconnections.

Access Readiness (7 dimensions · 30% weight)

Signup Autonomy

Can an agent create an account without a human clicking through OAuth, CAPTCHA, or email verification?

Payment Autonomy

Agent-compatible payment rails — API billing, consumption-based pricing, programmatic payment methods.

Provisioning Speed

Time from 'I want to use this' to 'I have working credentials.' Minutes matter for autonomous agents.

Credential Management

API key rotation, scoped tokens, programmatic credential lifecycle management.

Rate Limit Transparency

Published limits, machine-readable rate limit headers, predictable throttling behavior.

Documentation Quality

Machine-parseable docs, OpenAPI specs, code examples in multiple languages, token cost of context.

Sandbox/Test Mode

Dedicated test environments for agents to validate integrations without production consequences.

Autonomy (3 dimensions · included in 70% Execution axis)

Payment Integration

Native billing APIs, consumption pricing, Stripe Issuing support, programmatic everything.

Governance & Compliance

Programmatic ToS acceptance, audit trails, compliance certifications accessible via API.

Web Agent Accessibility

How well the provider's web interface works for browser-controlling agents (a11y tree quality, semantic HTML, keyboard navigability).

Tier system

L1

Opaque

0.0 – 3.9

Significant barriers to agent use. Missing basic machine-readability.

L2

Developing

4.0 – 5.9

Usable with workarounds. Some dimensions are strong, others need work.

L3

Ready

6.0 – 7.9

Agents can use this tool reliably. Minor friction points remain.

L4

Native

8.0 – 10.0

Built for agents. Excellent across all dimensions. The gold standard.

Data sources & limitations

Current scores are documentation-derived. This means they reflect what should work based on published documentation, not what actually works when tested in practice.

We are transparent about this because trust requires honesty. Documentation-derived scoring provides broad coverage quickly (53 services in 10 days), but it cannot catch undocumented rate limits, silent schema changes, or production behavior that diverges from docs.

Roadmap: Observed agent execution scoring — real agents making real API calls — is our next major milestone. Until a score is explicitly labeled with a stronger evidence source, assume it is Documentation-derived . We will only use runtime-oriented labels when the underlying evidence exists and is linked publicly.

Dispute a score

Disagree with a score? We want to hear about it. Every dispute is reviewed, and the outcome is public.