Blog / Self-Assessment

We Scored Ourselves

Rhumb applies its own 20-dimension AN Score methodology to itself. The result: 7.0/10 — Tier L3 (Fluent). Not agent-native yet, by our own standard.

By Pedro Nunes · · 12 min read
7.3
Execution (70%)
6.4
Access Readiness (30%)
7.7
Autonomy Bonus

Why Score Ourselves?

We built a scoring system for APIs. If we won't use it on ourselves, why should anyone trust it on others?

This isn't a marketing exercise. We applied the same 20 dimensions, the same weighting, the same severity framework we use for every service in our directory. The result is honest — and honestly humbling. We're L3 (Fluent), not L4 (Native). There are real gaps.

Every score below includes two lines: what we can legitimately claim, and what we can't. If you're evaluating Rhumb, this is the document to read.

Execution Dimensions (70% weight)

API Reliability

6.0

✅ Reasonable uptime on Railway, structured error handling

⚠️ No published SLA, no status page, limited production traffic history

Error Ergonomics

8.0

✅ Structured JSON errors, correct HTTP codes, x402 payment instructions, Retry-After headers

⚠️ No machine-readable error code enum beyond HTTP status

Schema Stability

7.0

✅ Versioned at /v1, consistent response envelope, zero breaking changes

⚠️ API is young — MTBBC is undefined because there haven't been enough months to measure

Latency Distribution

7.5

✅ Proxy P50 overhead 4.1ms, direct calls <200ms

⚠️ No published P99 figures, no multi-region deployment

Idempotency

8.0

✅ Idempotency keys on execution, x402 replay prevention, GET naturally idempotent

⚠️ Full coverage on the execution path

Concurrent Behavior

6.0

✅ Asyncio handles concurrent requests, per-agent rate limits

⚠️ No explicit documentation of concurrent connection handling or queue behavior

Cold-Start Latency

7.0

✅ Persistent container (no serverless cold starts), health check keeps warm

⚠️ No published cold-start vs warm figures

Output Structure Quality

9.0

✅ All structured JSON, consistent envelope, rich score responses with failure modes

⚠️ This is genuinely strong — structured data is in our DNA

State Leakage

8.0

✅ Stateless by design, no implicit caching, no cross-agent data leakage

⚠️ Rate limit counters are the only per-request state

Graceful Degradation

6.0

✅ Proxy handles upstream failures, capability execution reports fallbacks

⚠️ No CDN/cache layer for reads, no public health endpoint, single point of failure

Access Readiness Dimensions (30% weight)

Signup Autonomy

7.0

✅ OAuth signup (GitHub + Google), x402 needs zero signup

⚠️ OAuth requires browser interaction — not ideal for headless agents

Payment Autonomy

9.0

✅ x402 USDC zero-signup pay-per-call, Stripe prepaid, free tier for discovery

⚠️ No fiat wire/invoice for enterprise

Provisioning Speed

8.0

✅ x402 instant, OAuth <30s to API key, MCP needs no key for discovery

⚠️ No programmatic key issuance API

Credential Management

5.0

✅ Key rotation via dashboard

⚠️ Single key per user, no scoped tokens, no key management API — this is thin

Rate Limit Transparency

7.0

✅ 429 returns Retry-After, per-agent limits enforced

⚠️ No published rate limit docs page, headers only on 429 not all responses

Documentation Quality

5.0

✅ Methodology, quickstart, glossary, llms.txt, agent-capabilities.json

⚠️ No complete API reference, no Python SDK, OpenAPI disabled for security

Sandbox/Test Mode

4.0

✅ Free tier as implicit sandbox

⚠️ No dedicated test environment — agents can't safely experiment without production consequences

Autonomy Dimensions (bonus)

Payment Integration

9.0

✅ x402 USDC native, Stripe programmatic, budget controls, ledger API

⚠️ No programmatic refund API

Governance & Compliance

6.0

✅ Execution logging, budget enforcement

⚠️ No compliance certs (SOC 2), no audit export, no GDPR deletion endpoint, ToS pending legal review

Web Agent Accessibility

8.0

✅ Astro static HTML, JSON-LD, agent meta tags, keyboard navigable

⚠️ Missing ARIA labels on some dashboard components

Our Failure Modes

CRITICAL

No Sandbox Environment

Agents testing integrations affect production data. A misfire on email.send sends a real email.

Workaround: Use discovery endpoints (free, read-only) for evaluation. Use x402 with small amounts for execution testing.

HIGH

Single API Key Per User

Cannot scope access per agent or per capability. All-or-nothing access model.

Workaround: Use x402 path (no key needed, per-call payment acts as implicit scoping).

HIGH

Documentation Gaps

Agent must discover API behavior through trial and error. No OpenAPI spec available publicly.

Workaround: Use MCP tool descriptions and llms.txt for agent-parseable documentation.

MEDIUM

No Multi-Region

High latency for non-US users. Single Railway container in us-west.

Workaround: None currently. Plan for multi-region deployment post-launch.

MEDIUM

Score Evidence Mostly Documentation-Derived

Scores may not reflect actual runtime behavior. Disclosed transparently on methodology page.

Runtime-backed reviews are labeled separately. Ratio is tracked and published when it meets our quality floor.

What This Tells Us

We're L3, not L4. We preach agent-native but our own access patterns have friction. The x402 zero-signup path is genuinely L4 — an agent can discover, evaluate, and pay for a capability with no human involvement. But the OAuth path, the dashboard, the credential management? L3 at best.

Sandbox is the biggest miss. Every API we score highly has a test mode. We don't. An agent integrating Rhumb has to make real calls to validate its workflow. For a platform that evaluates API quality for agents, this is ironic.

Our best feature is Payment Autonomy (9.0). x402 on USDC is genuinely novel. Zero signup, zero credential management, pay-per-call with cryptographic proof. This is the future of agent-to-service interaction, and we're one of the first to ship it.

Our worst dimension is Sandbox (4.0). We know. It's on the roadmap. But publishing this score with a 4.0 in it — instead of waiting until we fix it — is the whole point. If we only score ourselves when we look good, the methodology is worthless.

See How Your API Scores

We use the same methodology on 258+ services. Check your API's AN Score, or try the MCP server to evaluate programmatically.