How We Score
The Agent-Native Score rates developer tools on how well they work for autonomous AI agents. 20 dimensions, 2 axes, fully transparent.
Philosophy
“Agent-native” means evaluating tools from the agent's perspective — not a human reading documentation, but an autonomous program making API calls, parsing responses, handling errors, and deciding what to do next.
Human-oriented review sites ask: “Is the dashboard intuitive?” We ask: “When this API returns a 429, does the response include a machine-readable retry-after header, or does the agent have to parse a human-readable error string?”
Agent-native parity
This page explains the scoring system in human language, but the method should not be trapped here. The same model is exposed in machine-readable form through Rhumb's public interfaces.
- llms.txt: discover the available routes and tools
- Docs: API and MCP usage for programmatic retrieval of scores, alternatives, and failure modes
- Service pages: each score page links back to methodology, trust, self-score, and dispute flow so provenance is visible at the moment of evaluation
Two scoring axes
Execution
70%How reliably the tool works when an agent calls it. Covers API stability, error handling, latency, schema consistency, and end-to-end autonomy (payment, compliance, web accessibility).
13 dimensionsAccess Readiness
30%How easy it is for an agent to start using the tool. Signup friction, payment rails, credential management, documentation, and sandbox availability.
7 dimensionsThe 20 dimensions
Execution (10 core dimensions · part of 70% Execution axis)
API Reliability
Uptime, error rate consistency, and graceful handling of edge cases under normal and peak load.
Error Ergonomics
Machine-readable error codes, structured error responses, retry-after headers. Can an agent self-heal from errors without human interpretation?
Schema Stability
Mean time between breaking changes (MTBBC). How often does the response shape change without warning?
Latency Distribution
P50, P95, P99 latency — not averages. P99 determines whether agents timeout and corrupt state.
Idempotency
Support for idempotency keys, safe retries. Critical for payment and state-mutating operations.
Concurrent Behavior
How the API handles simultaneous connections — queue, reject, or silent drop. Silent drops are worst: they look like success.
Cold-Start Latency
First-request latency vs. warm latency. Serverless cold starts on idle APIs affect agent timeout budgets.
Output Structure Quality
Structured JSON vs. freeform text responses. Structured output reduces downstream compute for consuming agents.
State Leakage
Implicit caching returning stale data across sequential calls. Agents need predictable, stateless responses.
Graceful Degradation
Slow responses under load vs. hard 503 cutoffs. Agents can handle slow; they can't handle surprise disconnections.
Access Readiness (7 dimensions · 30% weight)
Signup Autonomy
Can an agent create an account without a human clicking through OAuth, CAPTCHA, or email verification?
Payment Autonomy
Agent-compatible payment rails — API billing, consumption-based pricing, programmatic payment methods.
Provisioning Speed
Time from 'I want to use this' to 'I have working credentials.' Minutes matter for autonomous agents.
Credential Management
API key rotation, scoped tokens, programmatic credential lifecycle management.
Rate Limit Transparency
Published limits, machine-readable rate limit headers, predictable throttling behavior.
Documentation Quality
Machine-parseable docs, OpenAPI specs, code examples in multiple languages, token cost of context.
Sandbox/Test Mode
Dedicated test environments for agents to validate integrations without production consequences.
Autonomy (3 dimensions · included in 70% Execution axis)
Payment Integration
Native billing APIs, consumption pricing, Stripe Issuing support, programmatic everything.
Governance & Compliance
Programmatic ToS acceptance, audit trails, compliance certifications accessible via API.
Web Agent Accessibility
How well the provider's web interface works for browser-controlling agents (a11y tree quality, semantic HTML, keyboard navigability).
Tier system
Opaque
0.0 – 3.9Significant barriers to agent use. Missing basic machine-readability.
Developing
4.0 – 5.9Usable with workarounds. Some dimensions are strong, others need work.
Ready
6.0 – 7.9Agents can use this tool reliably. Minor friction points remain.
Native
8.0 – 10.0Built for agents. Excellent across all dimensions. The gold standard.
Data sources & limitations
Current scores are documentation-derived. This means they reflect what should work based on published documentation, not what actually works when tested in practice.
We are transparent about this because trust requires honesty. Documentation-derived scoring provides broad coverage quickly (53 services in 10 days), but it cannot catch undocumented rate limits, silent schema changes, or production behavior that diverges from docs.
Roadmap:
Observed agent execution scoring — real agents making
real API calls — is our next major milestone. Until a
score is explicitly labeled with a stronger evidence
source, assume it is
Documentation-derived
. We will only use runtime-oriented labels when the
underlying evidence exists and is linked publicly.
Dispute a score
Disagree with a score? We want to hear about it. Every dispute is reviewed, and the outcome is public.