GitHub stars
Stars tell you a server is visible. They do not tell you whether retries are safe, auth drift is survivable, or shared authority stays bounded.
Flat top-server lists
A local coding helper, a read-mostly lookup surface, and a write-capable remote integration should not compete on one undifferentiated leaderboard, even when a curated registry made the first browse feel cleaner.
Demo friendliness
A server can look magical in a supervised demo and still be the wrong choice for unattended or shared production use.
The score is not trying to settle every trust question at once. It is a structural baseline for whether an API is legible, durable, and recoverable enough for autonomous use. After that baseline, you still need workflow fit, trust class, and runtime evidence.
1. The score starts with execution, not marketing language
Rhumb's public methodology weights Execution at 70 percent and Access Readiness at 30 percent. That bias is deliberate. An autonomous agent does not care whether the landing page says "AI-ready." It cares whether the real API returns machine-readable errors, survives retries safely, communicates rate pressure, and keeps response shape stable enough for unattended use.
That is why raw popularity is such a weak proxy. A visible server can still sit on top of brittle auth, vague schemas, or retry-hostile write paths. Structural reliability matters more than ecosystem noise once a workflow is live.
2. Rhumb scores the underlying API surface, then pairs it with operator context
The public rubric is intentionally mechanical. It asks whether a system is scorable in a way an operator can inspect and dispute. It does not pretend that one number replaces workflow judgment.
That is why Rhumb now treats static score pages and runtime trust pages as complements, not substitutes. Structural quality is useful, but only alongside the live questions around authority, shared principals, and failure recovery.
- error ergonomics and machine-readable failure shape
- schema stability, latency distribution, and idempotency
- rate-limit transparency, output structure, and retry safety
- signup autonomy, auth complexity, and provisioning speed
- credential lifecycle management and sandbox availability
- documentation quality that works for agents, not only humans
- trust class, principal model, and scope containment
- what actually happens under auth expiry, drift, or partial failure
- whether the server is good for this workflow, not only scorable in the abstract
3. A score helps narrow the field. It does not remove the trust decision.
This is the main honesty boundary. A high score does not automatically make a server right for a shared remote workflow. A middling score does not automatically make a narrow governed capability useless. The score tells you something structural and repeatable. It does not erase operator judgment.
Trust class
The score cannot decide whether a local read helper and a remote write-capable system should be trusted in the same way. That is a workflow and authority question first.
Principal model
A server can score well structurally and still be the wrong fit if the real caller, tenant, or credential boundary is unclear once more than one actor is involved.
Live drift
A static score cannot replace fresh runtime evidence, change communication, or proof that today's auth and schema behavior still match yesterday's assumptions.
Scoring proof is free until one route survives.
A score can narrow the candidate set, but it is still proof work. Rhumb should not treat a high score, a registry listing, or a clean demo as billable execution until one caller-safe lane is specific enough to estimate and receipt.
- Score review, shortlist building, schema inspection, route-card reading, and denied-neighbor rehearsal are candidate proof. They should happen before a priced execution route exists.
- A paid lane starts only after one scored candidate becomes a selected route with capability id, caller or tenant, credential mode, quota owner, side-effect class, estimate fields, and receipt evidence attached.
- If the methodology review cannot name the acting principal, failure shape, or evidence trail, the honest result is review/no-candidate rather than a billable fallback call.
Boundary guide: free proof vs paid execution
Curated registry notes are still editorial confidence, not runtime proof
Fresh launch notes from a curated MCP registry make the useful part clearer, not smaller: editorial filtering can improve the first browse and remove obvious dead ends. It still does not prove that the current caller can authenticate cleanly, see the right candidate set, or trust the listed server in a real lane.
- A curated registry can improve discovery quality without proving current auth viability, caller-safe scope, or runtime freshness.
- Keep structural score, editorial inclusion, and live runtime trust as three separate layers so one signal does not counterfeit the other two.
- If registry inclusion is why a server made the shortlist, re-check the claims that justified the listing before promotion into a real lane.
- The production question is still whether this caller can use the server safely now, not whether an editor or launch-week review found it promising.
Parameter scope is methodology, not implementation trivia
The current official-server audit pressure is mostly about unconstrained arguments: strings that can name any path, URL, repository, tenant, environment value, or write target after the model has already chosen a legitimate-looking tool. A score that ignores argument scope will overrate the safety of broad MCP surfaces.
- Do not score a tool as production-ready just because the schema type is valid. Path, URL, repo, tenant, environment, and write-target fields are part of the permission boundary.
- Promotion evidence should include an allowed case and a denied-neighbor case, with the denial tied to a named policy rule instead of a vague exception.
- If the same backend credential can reach broader resources than the workflow needs, the score needs a runtime caveat until caller-specific scope is enforced.
- A strong methodology separates schema validity from authority containment: one proves shape, the other proves blast radius.
4. Six questions before you trust a server in production
- What repeated job does this server actually improve, beyond looking impressive in a demo?
- What happens when the request fails, times out, or retries under pressure?
- Can the intended caller authenticate cleanly, with the right principal and scope?
- How narrow is the callable surface, and does it stay inside the real job boundary?
- Do high-risk arguments stay inside caller-specific allowlists at execution time, and can the denial prove which boundary fired?
- If the environment drifts tomorrow, how would the agent or operator notice before damage compounds?
If a candidate fails two or three of these questions, the problem is rarely one missing feature. It usually means the server is solving the wrong class of problem for the workflow you actually have.
5. Use the score as a public, disputable baseline
Rhumb's methodology is public at /methodology, and the score process is designed to be inspectable, not mystical. If a provider thinks the evidence is wrong, the dispute path is public too.
The reason that matters is simple: a scoring system only becomes useful if operators can audit it, challenge it, and see where the current public truth stops. Rhumb also applies the same public lens to itself, including the remaining gaps on the current self-assessment.
See the full 20-dimension method, weights, tiers, and philosophy in one place.
The score is meant to be challengeable, not hidden behind brand authority.
Rhumb keeps the same public methodology pointed back at itself, including the disclosed gaps.
If the score narrows the field, the next move is to inspect bounded execution, not jump straight to connector sprawl.
The useful sequence is: use the score to cut obvious bad fits, inspect trust class and failure shape, then decide whether the workflow fits Rhumb's current managed lane. That is more honest than treating a flat leaderboard as a production recommendation.
Start with a narrow managed superpower first, then bring your own systems only when the workflow truly needs it.
See the current launchable scope, the fit boundary, and where Rhumb-managed capability surfaces are the honest default today.
A high score can narrow the field, but one unsafe neighboring action still decides whether the route is repeatable.
If a shortlisted server looks structurally strong but the actual workflow still hinges on a dangerous tool, broad credential, tenant boundary, or replay risk, do not treat the score as execution authority. Turn it into one E-007 request: the route to harden, the unsafe neighbor to deny, the credential or budget lane, the repeat volume, and the receipt or typed-denial proof needed before an agent loops on it.
A shortlist gets real only when it survives scope pressure, remote auth, tenant boundaries, and audit needs.
The score narrows the field. These pages answer the operator questions that still decide whether a candidate is safe enough for remote or shared use.
Check whether tool visibility, write reach, and parameter bounds stay narrow after selection.
A structurally good surface still fails if operators cannot trace, debug, and audit real tool calls.
Run the whole operator review before shared rollout instead of treating liveness as readiness.
Selection gets harder when the same server must preserve scope and evidence across several tenants.
Methodology only earns trust when it survives real loops, shared budgets, and credential drift.
A score is the start of the operator read, not the finish. These three pages show what breaks after the selection step, once multiple agents share providers, budgets, and credentials in production.
What actually breaks once model calls chain together unattended and the clean scorecard is no longer enough.
How provider budgets, retry storms, and shared throughput turn selection into a real fleet-design problem.
Why the durable trust question is whether scope, rotation, and revocation stay legible after the first working call.
Keep the score inside a real operator decision
Use workflow fit, trust class, authority, scope, output, failure shape, and evidence before promoting a scored server into production.
Where scoring, route-card inspection, credential review, and denied-neighbor rehearsal stop being proof and become one selected paid route.
Use workflow fit, trust class, capability shape, and runtime reality before any flat leaderboard instinct takes over.
Why curated catalogs and search quality only help after caller-safe visibility narrows the pool.
Why selection still depends on scope, acting principal, and whether evidence survives the runtime you are about to trust.
Why argument-level scope checks decide whether a legitimate tool call can still escape the intended lane.
Why auth proving who connected is still not the same thing as narrowing tool discovery or backend authority cleanly.
Why structural scoring is useful, but incomplete without live trust overlays and current evidence.
Read the full public rubric, the 20 dimensions, and the open dispute path behind Rhumb's score system.