← Blog · MCP Methodology · Updated May 5, 2026 · Rhumb · 8 min read

MCP Server Scoring Methodology: How Rhumb Evaluates APIs for Agents

Most "best MCP server" lists blur local helpers, read-mostly tools, and write-capable remote systems into one popularity stack. Rhumb's score exists to answer a narrower question: how reliably can an agent use the underlying API, and how much setup friction stands in the way?

Methodology snapshot
Scored surface
1,038 services

Rhumb currently publishes scores across 1,038 services and 415 capability definitions. The catalog is wider than today's callable surface, which is why score pages need an honesty boundary too.

Score model
20 dimensions

The public methodology weights Execution at 70 percent and Access Readiness at 30 percent. It cares about how an API behaves under automation pressure, not whether the marketing site sounds modern.

Honesty boundary
16 callable providers today

A strong score is a baseline, not a promise that every service is executable through Rhumb today or safe for every trust class. Workflow fit, authority shape, and runtime evidence still matter.

GitHub stars

Stars tell you a server is visible. They do not tell you whether retries are safe, auth drift is survivable, or shared authority stays bounded.

Flat top-server lists

A local coding helper, a read-mostly lookup surface, and a write-capable remote integration should not compete on one undifferentiated leaderboard, even when a curated registry made the first browse feel cleaner.

Demo friendliness

A server can look magical in a supervised demo and still be the wrong choice for unattended or shared production use.

What the score is for

The score is not trying to settle every trust question at once. It is a structural baseline for whether an API is legible, durable, and recoverable enough for autonomous use. After that baseline, you still need workflow fit, trust class, and runtime evidence.

1. The score starts with execution, not marketing language

Rhumb's public methodology weights Execution at 70 percent and Access Readiness at 30 percent. That bias is deliberate. An autonomous agent does not care whether the landing page says "AI-ready." It cares whether the real API returns machine-readable errors, survives retries safely, communicates rate pressure, and keeps response shape stable enough for unattended use.

That is why raw popularity is such a weak proxy. A visible server can still sit on top of brittle auth, vague schemas, or retry-hostile write paths. Structural reliability matters more than ecosystem noise once a workflow is live.

2. Rhumb scores the underlying API surface, then pairs it with operator context

The public rubric is intentionally mechanical. It asks whether a system is scorable in a way an operator can inspect and dispute. It does not pretend that one number replaces workflow judgment.

That is why Rhumb now treats static score pages and runtime trust pages as complements, not substitutes. Structural quality is useful, but only alongside the live questions around authority, shared principals, and failure recovery.

Execution reliability
  • error ergonomics and machine-readable failure shape
  • schema stability, latency distribution, and idempotency
  • rate-limit transparency, output structure, and retry safety
Access readiness
  • signup autonomy, auth complexity, and provisioning speed
  • credential lifecycle management and sandbox availability
  • documentation quality that works for agents, not only humans
Runtime overlay
  • trust class, principal model, and scope containment
  • what actually happens under auth expiry, drift, or partial failure
  • whether the server is good for this workflow, not only scorable in the abstract

3. A score helps narrow the field. It does not remove the trust decision.

This is the main honesty boundary. A high score does not automatically make a server right for a shared remote workflow. A middling score does not automatically make a narrow governed capability useless. The score tells you something structural and repeatable. It does not erase operator judgment.

Trust class

The score cannot decide whether a local read helper and a remote write-capable system should be trusted in the same way. That is a workflow and authority question first.

Principal model

A server can score well structurally and still be the wrong fit if the real caller, tenant, or credential boundary is unclear once more than one actor is involved.

Live drift

A static score cannot replace fresh runtime evidence, change communication, or proof that today's auth and schema behavior still match yesterday's assumptions.

Pricing boundary

Scoring proof is free until one route survives.

A score can narrow the candidate set, but it is still proof work. Rhumb should not treat a high score, a registry listing, or a clean demo as billable execution until one caller-safe lane is specific enough to estimate and receipt.

  • Score review, shortlist building, schema inspection, route-card reading, and denied-neighbor rehearsal are candidate proof. They should happen before a priced execution route exists.
  • A paid lane starts only after one scored candidate becomes a selected route with capability id, caller or tenant, credential mode, quota owner, side-effect class, estimate fields, and receipt evidence attached.
  • If the methodology review cannot name the acting principal, failure shape, or evidence trail, the honest result is review/no-candidate rather than a billable fallback call.

Boundary guide: free proof vs paid execution

Fresh operator signal

Curated registry notes are still editorial confidence, not runtime proof

Fresh launch notes from a curated MCP registry make the useful part clearer, not smaller: editorial filtering can improve the first browse and remove obvious dead ends. It still does not prove that the current caller can authenticate cleanly, see the right candidate set, or trust the listed server in a real lane.

  • A curated registry can improve discovery quality without proving current auth viability, caller-safe scope, or runtime freshness.
  • Keep structural score, editorial inclusion, and live runtime trust as three separate layers so one signal does not counterfeit the other two.
  • If registry inclusion is why a server made the shortlist, re-check the claims that justified the listing before promotion into a real lane.
  • The production question is still whether this caller can use the server safely now, not whether an editor or launch-week review found it promising.
Fresh security signal

Parameter scope is methodology, not implementation trivia

The current official-server audit pressure is mostly about unconstrained arguments: strings that can name any path, URL, repository, tenant, environment value, or write target after the model has already chosen a legitimate-looking tool. A score that ignores argument scope will overrate the safety of broad MCP surfaces.

  • Do not score a tool as production-ready just because the schema type is valid. Path, URL, repo, tenant, environment, and write-target fields are part of the permission boundary.
  • Promotion evidence should include an allowed case and a denied-neighbor case, with the denial tied to a named policy rule instead of a vague exception.
  • If the same backend credential can reach broader resources than the workflow needs, the score needs a runtime caveat until caller-specific scope is enforced.
  • A strong methodology separates schema validity from authority containment: one proves shape, the other proves blast radius.

4. Six questions before you trust a server in production

  1. What repeated job does this server actually improve, beyond looking impressive in a demo?
  2. What happens when the request fails, times out, or retries under pressure?
  3. Can the intended caller authenticate cleanly, with the right principal and scope?
  4. How narrow is the callable surface, and does it stay inside the real job boundary?
  5. Do high-risk arguments stay inside caller-specific allowlists at execution time, and can the denial prove which boundary fired?
  6. If the environment drifts tomorrow, how would the agent or operator notice before damage compounds?

If a candidate fails two or three of these questions, the problem is rarely one missing feature. It usually means the server is solving the wrong class of problem for the workflow you actually have.

5. Use the score as a public, disputable baseline

Rhumb's methodology is public at /methodology, and the score process is designed to be inspectable, not mystical. If a provider thinks the evidence is wrong, the dispute path is public too.

The reason that matters is simple: a scoring system only becomes useful if operators can audit it, challenge it, and see where the current public truth stops. Rhumb also applies the same public lens to itself, including the remaining gaps on the current self-assessment.

Next honest step

If the score narrows the field, the next move is to inspect bounded execution, not jump straight to connector sprawl.

The useful sequence is: use the score to cut obvious bad fits, inspect trust class and failure shape, then decide whether the workflow fits Rhumb's current managed lane. That is more honest than treating a flat leaderboard as a production recommendation.

Route-hardening fit check

A high score can narrow the field, but one unsafe neighboring action still decides whether the route is repeatable.

If a shortlisted server looks structurally strong but the actual workflow still hinges on a dangerous tool, broad credential, tenant boundary, or replay risk, do not treat the score as execution authority. Turn it into one E-007 request: the route to harden, the unsafe neighbor to deny, the credential or budget lane, the repeat volume, and the receipt or typed-denial proof needed before an agent loops on it.

Fleet follow-through

Methodology only earns trust when it survives real loops, shared budgets, and credential drift.

A score is the start of the operator read, not the finish. These three pages show what breaks after the selection step, once multiple agents share providers, budgets, and credentials in production.

Related reading

Keep the score inside a real operator decision