Treat directory inventory as a way to find candidates, not as a claim that the candidate is safe to run. Rhumb separates discovery breadth (999 scored services and 435 capability definitions) from the narrower current callable surface (18 callable providers strongest for research, extraction, generation, and narrow enrichment).
The mistake: ranking before filtering
A giant MCP directory solves a real problem: agents and developers need to discover what exists. But production selection has a different failure mode. If the model sees every listed server before trust filters run, semantic relevance can outrank authority, freshness, cost, and blast radius.
The operator sequence should be recall first, proof second. Use marketplaces to discover candidates, then collapse the pool to the smallest workflow-safe set before the agent plans with it.
What each directory signal can and cannot prove
Useful for
Broad recall. You find projects, servers, skills, and adapters that would be invisible in a hand-built shortlist.
Cannot prove
The model may treat inventory volume as quality and rank a mixed-authority surface before trust filters run.
Useful for
Editorial inclusion, metadata, install hints, and maintenance signals reduce obvious junk.
Cannot prove
A curated listing still does not prove caller-specific auth, runtime scope, typed denials, or freshness at invocation time.
Useful for
One-click setup and client config make it easier to test the server quickly.
Cannot prove
Convenient install can widen authority if the wrapper writes credentials or tool config without preserving the actor and rollback path.
Useful for
A score can make the shortlist more legible and expose static weaknesses faster than manual inspection.
Cannot prove
Static scoring must stay separate from live execution proof. Scores are the map; failure modes are the production test.
The proof filters before promotion
A server should not move from directory hit to agent candidate until these filters are explicit. They are the difference between a useful marketplace and a tool graveyard with better search.
A safer selection flow
Search marketplaces, registries, GitHub, and docs to gather candidates. Keep this phase broad and cheap, but do not promote anything yet.
Remove servers that do not fit the exact repeated job. A generic catalog hit is not useful if the agent still has to improvise the action shape.
Check trust class, auth shape, caller-visible tool scope, side-effect class, and quota owner before semantic relevance ranks the final set.
Pick the nearest unsafe adjacent target and prove it fails closed with a typed denial before you let the agent repeat the happy path.
Preserve capability, server/provider, principal, credential mode, cost, denial, outcome, and recovery context so retries do not become folklore.
Verified vertical directories need a second proof gate
Fresh MCP submissions are starting to package vertical discovery — lawyers, vendors, marketplaces, data providers — as verified agent surfaces. That is useful, but verified discovery is still not execution authority. Regulated or high-trust verticals need proof that the listing, license, jurisdiction, freshness, and allowed action all match the exact workflow before the agent treats a directory hit as a route.
Where Rhumb fits
Rhumb should not try to be the loudest marketplace. The stronger wedge is workflow-level proof: resolve the capability, estimate the route, choose the credential rail, cap the budget, test the denied neighbor, and preserve the receipt.
Signals that are not enough
Have one server or workflow you want to promote? Prove the boundary first.
Send the repeat job, candidate server/provider, credential rail, expected volume, denied neighbor, and receipt fields you would need before letting an agent loop. The useful artifact is not another list; it is a proof path for one workflow.