Blog / Self-Assessment
We Scored Ourselves
Rhumb applies its own 20-dimension AN Score methodology to itself. The result: 7.0/10 — Tier L3 (Fluent). Not agent-native yet, by our own standard.
Why Score Ourselves?
We built a scoring system for APIs. If we won't use it on ourselves, why should anyone trust it on others?
This isn't a marketing exercise. We applied the same 20 dimensions, the same weighting, the same severity framework we use for every service in our directory. The result is honest — and honestly humbling. We're L3 (Fluent), not L4 (Native). There are real gaps.
Every score below includes two lines: what we can legitimately claim, and what we can't. If you're evaluating Rhumb, this is the document to read.
Execution Dimensions (70% weight)
API Reliability
6.0✅ Reasonable uptime on Railway, structured error handling
⚠️ No published SLA, no status page, limited production traffic history
Error Ergonomics
8.0✅ Structured JSON errors, correct HTTP codes, x402 payment instructions, Retry-After headers
⚠️ No machine-readable error code enum beyond HTTP status
Schema Stability
7.0✅ Versioned at /v1, consistent response envelope, zero breaking changes
⚠️ API is young — MTBBC is undefined because there haven't been enough months to measure
Latency Distribution
7.5✅ Proxy P50 overhead 4.1ms, direct calls <200ms
⚠️ No published P99 figures, no multi-region deployment
Idempotency
8.0✅ Idempotency keys on execution, x402 replay prevention, GET naturally idempotent
⚠️ Full coverage on the execution path
Concurrent Behavior
6.0✅ Asyncio handles concurrent requests, per-agent rate limits
⚠️ No explicit documentation of concurrent connection handling or queue behavior
Cold-Start Latency
7.0✅ Persistent container (no serverless cold starts), health check keeps warm
⚠️ No published cold-start vs warm figures
Output Structure Quality
9.0✅ All structured JSON, consistent envelope, rich score responses with failure modes
⚠️ This is genuinely strong — structured data is in our DNA
State Leakage
8.0✅ Stateless by design, no implicit caching, no cross-agent data leakage
⚠️ Rate limit counters are the only per-request state
Graceful Degradation
6.0✅ Proxy handles upstream failures, capability execution reports fallbacks
⚠️ No CDN/cache layer for reads, no public health endpoint, single point of failure
Access Readiness Dimensions (30% weight)
Signup Autonomy
7.0✅ OAuth signup (GitHub + Google), x402 needs zero signup
⚠️ OAuth requires browser interaction — not ideal for headless agents
Payment Autonomy
9.0✅ x402 USDC zero-signup pay-per-call, Stripe prepaid, free tier for discovery
⚠️ No fiat wire/invoice for enterprise
Provisioning Speed
8.0✅ x402 instant, OAuth <30s to API key, MCP needs no key for discovery
⚠️ No programmatic key issuance API
Credential Management
5.0✅ Key rotation via dashboard
⚠️ Single key per user, no scoped tokens, no key management API — this is thin
Rate Limit Transparency
7.0✅ 429 returns Retry-After, per-agent limits enforced
⚠️ No published rate limit docs page, headers only on 429 not all responses
Documentation Quality
5.0✅ Methodology, quickstart, glossary, llms.txt, agent-capabilities.json
⚠️ No complete API reference, no Python SDK, OpenAPI disabled for security
Sandbox/Test Mode
4.0✅ Free tier as implicit sandbox
⚠️ No dedicated test environment — agents can't safely experiment without production consequences
Autonomy Dimensions (bonus)
Payment Integration
9.0✅ x402 USDC native, Stripe programmatic, budget controls, ledger API
⚠️ No programmatic refund API
Governance & Compliance
6.0✅ Execution logging, budget enforcement
⚠️ No compliance certs (SOC 2), no audit export, no GDPR deletion endpoint, ToS pending legal review
Web Agent Accessibility
8.0✅ Astro static HTML, JSON-LD, agent meta tags, keyboard navigable
⚠️ Missing ARIA labels on some dashboard components
Our Failure Modes
No Sandbox Environment
Agents testing integrations affect production data. A misfire on email.send sends a real email.
Workaround: Use discovery endpoints (free, read-only) for evaluation. Use x402 with small amounts for execution testing.
Single API Key Per User
Cannot scope access per agent or per capability. All-or-nothing access model.
Workaround: Use x402 path (no key needed, per-call payment acts as implicit scoping).
Documentation Gaps
Agent must discover API behavior through trial and error. No OpenAPI spec available publicly.
Workaround: Use MCP tool descriptions and llms.txt for agent-parseable documentation.
No Multi-Region
High latency for non-US users. Single Railway container in us-west.
Workaround: None currently. Plan for multi-region deployment post-launch.
Score Evidence Mostly Documentation-Derived
Scores may not reflect actual runtime behavior. Disclosed transparently on methodology page.
Runtime-backed reviews are labeled separately. Ratio is tracked and published when it meets our quality floor.
What This Tells Us
We're L3, not L4. We preach agent-native but our own access patterns have friction. The x402 zero-signup path is genuinely L4 — an agent can discover, evaluate, and pay for a capability with no human involvement. But the OAuth path, the dashboard, the credential management? L3 at best.
Sandbox is the biggest miss. Every API we score highly has a test mode. We don't. An agent integrating Rhumb has to make real calls to validate its workflow. For a platform that evaluates API quality for agents, this is ironic.
Our best feature is Payment Autonomy (9.0). x402 on USDC is genuinely novel. Zero signup, zero credential management, pay-per-call with cryptographic proof. This is the future of agent-to-service interaction, and we're one of the first to ship it.
Our worst dimension is Sandbox (4.0). We know. It's on the roadmap. But publishing this score with a 4.0 in it — instead of waiting until we fix it — is the whole point. If we only score ourselves when we look good, the methodology is worthless.
If the disclosed gaps still look workable, choose the bounded path before you widen the trust boundary.
The honest read is the same as the scorecard: Rhumb is strongest when the first run stays narrow, attributable, and legible. Start with capability-first onboarding, jump straight into the managed lane if the fit is already clear, or inspect pricing once you know the path is worth it.
The March 11 pre-launch score is still public, so you can see what improved and what stayed unresolved.
Credential shape still matters more than the headline score once an agent can touch real systems.
Test the live bounded-execution surface against the same production questions you would ask of any other tool.
The self-assessment only matters if you inspect the live operator failure surfaces too
Publishing the score is the honesty layer, not the whole operating picture. The current production question is what breaks in unattended loops, how shared provider budgets stay bounded when multiple agents reuse the same lane, and whether credential scope survives rotation and growth without turning one success into a fleet-wide trust leak. These three pages turn the score into that operator read.
What actually fails after the scorecard, when retries, tool chains, and hidden provider assumptions start compounding.
How a seemingly clean workflow still needs shared-budget discipline once the fleet starts hitting the same rails.
Why credential lifecycle design, not just secret storage, decides whether the bounded lane stays trustworthy over time.