MCP Server Scoring Methodology: How Rhumb Evaluates APIs for Agents

GitHub stars

Stars tell you a server is visible. They do not tell you whether retries are safe, auth drift is survivable, or shared authority stays bounded.

Flat top-server lists

A local coding helper, a read-mostly lookup surface, and a write-capable remote integration should not compete on one undifferentiated leaderboard, even when a curated registry made the first browse feel cleaner.

Demo friendliness

A server can look magical in a supervised demo and still be the wrong choice for unattended or shared production use.

What the score is for

The score is not trying to settle every trust question at once. It is a structural baseline for whether an API is legible, durable, and recoverable enough for autonomous use. After that baseline, you still need workflow fit, trust class, and runtime evidence.

1. The score starts with execution, not marketing language

Rhumb's public methodology weights Execution at 70 percent and Access Readiness at 30 percent. That bias is deliberate. An autonomous agent does not care whether the landing page says "AI-ready." It cares whether the real API returns machine-readable errors, survives retries safely, communicates rate pressure, and keeps response shape stable enough for unattended use.

That is why raw popularity is such a weak proxy. A visible server can still sit on top of brittle auth, vague schemas, or retry-hostile write paths. Structural reliability matters more than ecosystem noise once a workflow is live.

2. Rhumb scores the underlying API surface, then pairs it with operator context

The public rubric is intentionally mechanical. It asks whether a system is scorable in a way an operator can inspect and dispute. It does not pretend that one number replaces workflow judgment.

That is why Rhumb now treats static score pages and runtime trust pages as complements, not substitutes. Structural quality is useful, but only alongside the live questions around authority, shared principals, and failure recovery.

Execution reliability

error ergonomics and machine-readable failure shape
schema stability, latency distribution, and idempotency
rate-limit transparency, output structure, and retry safety

Access readiness

signup autonomy, auth complexity, and provisioning speed
credential lifecycle management and sandbox availability
documentation quality that works for agents, not only humans

Runtime overlay

trust class, principal model, and scope containment
what actually happens under auth expiry, drift, or partial failure
whether the server is good for this workflow, not only scorable in the abstract

3. A score helps narrow the field. It does not remove the trust decision.

This is the main honesty boundary. A high score does not automatically make a server right for a shared remote workflow. A middling score does not automatically make a narrow governed capability useless. The score tells you something structural and repeatable. It does not erase operator judgment.

Trust class

The score cannot decide whether a local read helper and a remote write-capable system should be trusted in the same way. That is a workflow and authority question first.

Principal model

A server can score well structurally and still be the wrong fit if the real caller, tenant, or credential boundary is unclear once more than one actor is involved.

Live drift

A static score cannot replace fresh runtime evidence, change communication, or proof that today's auth and schema behavior still match yesterday's assumptions.

Pricing boundary

Scoring proof is free until one route survives.

A score can narrow the candidate set, but it is still proof work. Rhumb should not treat a high score, a registry listing, or a clean demo as billable execution until one caller-safe lane is specific enough to estimate and receipt.

Score review, shortlist building, schema inspection, route-card reading, and denied-neighbor rehearsal are candidate proof. They should happen before a priced execution route exists.
A paid lane starts only after one scored candidate becomes a selected route with capability id, caller or tenant, credential mode, quota owner, side-effect class, estimate fields, and receipt evidence attached.
If the methodology review cannot name the acting principal, failure shape, or evidence trail, the honest result is review/no-candidate rather than a billable fallback call.

Boundary guide: free proof vs paid execution

Fresh operator signal

Curated registry notes are still editorial confidence, not runtime proof

Fresh launch notes from a curated MCP registry make the useful part clearer, not smaller: editorial filtering can improve the first browse and remove obvious dead ends. It still does not prove that the current caller can authenticate cleanly, see the right candidate set, or trust the listed server in a real lane.

A curated registry can improve discovery quality without proving current auth viability, caller-safe scope, or runtime freshness.
Keep structural score, editorial inclusion, and live runtime trust as three separate layers so one signal does not counterfeit the other two.
If registry inclusion is why a server made the shortlist, re-check the claims that justified the listing before promotion into a real lane.
The production question is still whether this caller can use the server safely now, not whether an editor or launch-week review found it promising.

Fresh security signal

Parameter scope is methodology, not implementation trivia

The current official-server audit pressure is mostly about unconstrained arguments: strings that can name any path, URL, repository, tenant, environment value, or write target after the model has already chosen a legitimate-looking tool. A score that ignores argument scope will overrate the safety of broad MCP surfaces.

Do not score a tool as production-ready just because the schema type is valid. Path, URL, repo, tenant, environment, and write-target fields are part of the permission boundary.
Promotion evidence should include an allowed case and a denied-neighbor case, with the denial tied to a named policy rule instead of a vague exception.
If the same backend credential can reach broader resources than the workflow needs, the score needs a runtime caveat until caller-specific scope is enforced.
A strong methodology separates schema validity from authority containment: one proves shape, the other proves blast radius.

4. Six questions before you trust a server in production

What repeated job does this server actually improve, beyond looking impressive in a demo?
What happens when the request fails, times out, or retries under pressure?
Can the intended caller authenticate cleanly, with the right principal and scope?
How narrow is the callable surface, and does it stay inside the real job boundary?
Do high-risk arguments stay inside caller-specific allowlists at execution time, and can the denial prove which boundary fired?
If the environment drifts tomorrow, how would the agent or operator notice before damage compounds?

If a candidate fails two or three of these questions, the problem is rarely one missing feature. It usually means the server is solving the wrong class of problem for the workflow you actually have.

5. Use the score as a public, disputable baseline

Rhumb's methodology is public at /methodology, and the score process is designed to be inspectable, not mystical. If a provider thinks the evidence is wrong, the dispute path is public too.

The reason that matters is simple: a scoring system only becomes useful if operators can audit it, challenge it, and see where the current public truth stops. Rhumb also applies the same public lens to itself, including the remaining gaps on the current self-assessment.

Public process

Read the full rubric

See the full 20-dimension method, weights, tiers, and philosophy in one place.

Inspect public score disputes

The score is meant to be challengeable, not hidden behind brand authority.

See the self-score

Rhumb keeps the same public methodology pointed back at itself, including the disclosed gaps.

Next honest step

If the score narrows the field, the next move is to inspect bounded execution, not jump straight to connector sprawl.

The useful sequence is: use the score to cut obvious bad fits, inspect trust class and failure shape, then decide whether the workflow fits Rhumb's current managed lane. That is more honest than treating a flat leaderboard as a production recommendation.

See capability-first onboarding

Start with a narrow managed superpower first, then bring your own systems only when the workflow truly needs it.

Inspect the managed execution path

See the current launchable scope, the fit boundary, and where Rhumb-managed capability surfaces are the honest default today.

Route-hardening fit check

A high score can narrow the field, but one unsafe neighboring action still decides whether the route is repeatable.

If a shortlisted server looks structurally strong but the actual workflow still hinges on a dangerous tool, broad credential, tenant boundary, or replay risk, do not treat the score as execution authority. Turn it into one E-007 request: the route to harden, the unsafe neighbor to deny, the credential or budget lane, the repeat volume, and the receipt or typed-denial proof needed before an agent loops on it.

Use the route-hardening probe

Ask Rhumb to harden one MCP call instead of asking for a generic server recommendation.

Start managed fit inspection

Use this when caller, capability, credential lane, side-effect class, and evidence requirements are already known.

Production follow-through

A shortlist gets real only when it survives scope pressure, remote auth, tenant boundaries, and audit needs.

The score narrows the field. These pages answer the operator questions that still decide whether a candidate is safe enough for remote or shared use.

Tool-Level Permission Scoping

Check whether tool visibility, write reach, and parameter bounds stay narrow after selection.

MCP Observability

A structurally good surface still fails if operators cannot trace, debug, and audit real tool calls.

Remote MCP Production Readiness Checklist

Run the whole operator review before shared rollout instead of treating liveness as readiness.

Multi-Tenant MCP Server Design

Selection gets harder when the same server must preserve scope and evidence across several tenants.

Fleet follow-through

Methodology only earns trust when it survives real loops, shared budgets, and credential drift.

A score is the start of the operator read, not the finish. These three pages show what breaks after the selection step, once multiple agents share providers, budgets, and credentials in production.

LLM APIs in Agent Loops

What actually breaks once model calls chain together unattended and the clean scorecard is no longer enough.

Designing Agent Fleets That Survive Rate Limits

How provider budgets, retry storms, and shared throughput turn selection into a real fleet-design problem.

API Credentials in Autonomous Agent Fleets

Why the durable trust question is whether scope, rotation, and revocation stay legible after the first working call.

Keep the score inside a real operator decision

MCP Server Quality Signals for Agents

Use workflow fit, trust class, authority, scope, output, failure shape, and evidence before promoting a scored server into production.

MCP Discovery: Free Proof vs Paid Execution

Where scoring, route-card inspection, credential review, and denied-neighbor rehearsal stop being proof and become one selected paid route.

How to Evaluate MCP Servers

Use workflow fit, trust class, capability shape, and runtime reality before any flat leaderboard instinct takes over.

Runtime MCP Discovery Needs Trust Filters

Why curated catalogs and search quality only help after caller-safe visibility narrows the pool.

MCP Has a Security Model

Why selection still depends on scope, acting principal, and whether evidence survives the runtime you are about to trust.

Prompt Injection Hits Harder When MCP Parameters Stay Unbounded

Why argument-level scope checks decide whether a legitimate tool call can still escape the intended lane.

Remote MCP Auth: Identity vs Authority

Why auth proving who connected is still not the same thing as narrowing tool discovery or backend authority cleanly.

Static MCP Scores Are a Baseline. Runtime Trust Is the Missing Overlay

Why structural scoring is useful, but incomplete without live trust overlays and current evidence.

How We Score

Read the full public rubric, the 20 dimensions, and the open dispute path behind Rhumb's score system.