Blog / Self-Assessment

We Scored Ourselves

Rhumb applies its own 20-dimension AN Score methodology to itself. The result: 7.0/10 — Tier L3 (Fluent). Not agent-native yet, by our own standard.

By Pedro Nunes · · 12 min read
7.3
Execution (70%)
6.4
Access Readiness (30%)
7.7
Autonomy Bonus

Why Score Ourselves?

We built a scoring system for APIs. If we won't use it on ourselves, why should anyone trust it on others?

This isn't a marketing exercise. We applied the same 20 dimensions, the same weighting, the same severity framework we use for every service in our directory. The result is honest — and honestly humbling. We're L3 (Fluent), not L4 (Native). There are real gaps.

Every score below includes two lines: what we can legitimately claim, and what we can't. If you're evaluating Rhumb, this is the document to read.

Execution Dimensions (70% weight)

API Reliability

6.0

✅ Reasonable uptime on Railway, structured error handling

⚠️ No published SLA, no status page, limited production traffic history

Error Ergonomics

8.0

✅ Structured JSON errors, correct HTTP codes, x402 payment instructions, Retry-After headers

⚠️ No machine-readable error code enum beyond HTTP status

Schema Stability

7.0

✅ Versioned at /v1, consistent response envelope, zero breaking changes

⚠️ API is young — MTBBC is undefined because there haven't been enough months to measure

Latency Distribution

7.5

✅ Proxy P50 overhead 4.1ms, direct calls <200ms

⚠️ No published P99 figures, no multi-region deployment

Idempotency

8.0

✅ Idempotency keys on execution, x402 replay prevention, GET naturally idempotent

⚠️ Full coverage on the execution path

Concurrent Behavior

6.0

✅ Asyncio handles concurrent requests, per-agent rate limits

⚠️ No explicit documentation of concurrent connection handling or queue behavior

Cold-Start Latency

7.0

✅ Persistent container (no serverless cold starts), health check keeps warm

⚠️ No published cold-start vs warm figures

Output Structure Quality

9.0

✅ All structured JSON, consistent envelope, rich score responses with failure modes

⚠️ This is genuinely strong — structured data is in our DNA

State Leakage

8.0

✅ Stateless by design, no implicit caching, no cross-agent data leakage

⚠️ Rate limit counters are the only per-request state

Graceful Degradation

6.0

✅ Proxy handles upstream failures, capability execution reports fallbacks

⚠️ No CDN/cache layer for reads, no public health endpoint, single point of failure

Access Readiness Dimensions (30% weight)

Signup Autonomy

7.0

✅ OAuth signup (GitHub + Google), x402 needs zero signup

⚠️ OAuth requires browser interaction — not ideal for headless agents

Payment Autonomy

9.0

✅ x402 USDC zero-signup pay-per-call, Stripe prepaid, free tier for discovery

⚠️ No fiat wire/invoice for enterprise

Provisioning Speed

8.0

✅ x402 instant, OAuth <30s to API key, MCP needs no key for discovery

⚠️ No programmatic key issuance API

Credential Management

5.0

✅ Key rotation via dashboard

⚠️ Single key per user, no scoped tokens, no key management API — this is thin

Rate Limit Transparency

7.0

✅ 429 returns Retry-After, per-agent limits enforced

⚠️ No published rate limit docs page, headers only on 429 not all responses

Documentation Quality

5.0

✅ Methodology, quickstart, glossary, llms.txt, agent-capabilities.json

⚠️ No complete API reference, no Python SDK, OpenAPI disabled for security

Sandbox/Test Mode

4.0

✅ Free tier as implicit sandbox

⚠️ No dedicated test environment — agents can't safely experiment without production consequences

Autonomy Dimensions (bonus)

Payment Integration

9.0

✅ x402 USDC native, Stripe programmatic, budget controls, ledger API

⚠️ No programmatic refund API

Governance & Compliance

6.0

✅ Execution logging, budget enforcement

⚠️ No compliance certs (SOC 2), no audit export, no GDPR deletion endpoint, ToS pending legal review

Web Agent Accessibility

8.0

✅ Astro static HTML, JSON-LD, agent meta tags, keyboard navigable

⚠️ Missing ARIA labels on some dashboard components

Our Failure Modes

CRITICAL

No Sandbox Environment

Agents testing integrations affect production data. A misfire on email.send sends a real email.

Workaround: Use discovery endpoints (free, read-only) for evaluation. Use x402 with small amounts for execution testing.

HIGH

Single API Key Per User

Cannot scope access per agent or per capability. All-or-nothing access model.

Workaround: Use x402 path (no key needed, per-call payment acts as implicit scoping).

HIGH

Documentation Gaps

Agent must discover API behavior through trial and error. No OpenAPI spec available publicly.

Workaround: Use MCP tool descriptions and llms.txt for agent-parseable documentation.

MEDIUM

No Multi-Region

High latency for non-US users. Single Railway container in us-west.

Workaround: None currently. Plan for multi-region deployment post-launch.

MEDIUM

Score Evidence Mostly Documentation-Derived

Scores may not reflect actual runtime behavior. Disclosed transparently on methodology page.

Runtime-backed reviews are labeled separately. Ratio is tracked and published when it meets our quality floor.

What This Tells Us

We're L3, not L4. We preach agent-native but our own access patterns have friction. The x402 zero-signup path is genuinely L4 — an agent can discover, evaluate, and pay for a capability with no human involvement. But the OAuth path, the dashboard, the credential management? L3 at best.

Sandbox is the biggest miss. Every API we score highly has a test mode. We don't. An agent integrating Rhumb has to make real calls to validate its workflow. For a platform that evaluates API quality for agents, this is ironic.

Our best feature is Payment Autonomy (9.0). x402 on USDC is genuinely novel. Zero signup, zero credential management, pay-per-call with cryptographic proof. This is the future of agent-to-service interaction, and we're one of the first to ship it.

Our worst dimension is Sandbox (4.0). We know. It's on the roadmap. But publishing this score with a 4.0 in it — instead of waiting until we fix it — is the whole point. If we only score ourselves when we look good, the methodology is worthless.

Next honest step

If the disclosed gaps still look workable, choose the bounded path before you widen the trust boundary.

The honest read is the same as the scorecard: Rhumb is strongest when the first run stays narrow, attributable, and legible. Start with capability-first onboarding, jump straight into the managed lane if the fit is already clear, or inspect pricing once you know the path is worth it.

Gap inspection
Fleet follow-through

The self-assessment only matters if you inspect the live operator failure surfaces too

Publishing the score is the honesty layer, not the whole operating picture. The current production question is what breaks in unattended loops, how shared provider budgets stay bounded when multiple agents reuse the same lane, and whether credential scope survives rotation and growth without turning one success into a fleet-wide trust leak. These three pages turn the score into that operator read.