Wrong-tool risk
A local helper and a high-side-effect remote system look equally available because semantic relevance was allowed to outrank authority.
Context tax
The model sees too many mixed-authority candidates and spends tokens exploring options that should have been filtered out upstream.
Auth-blind ranking
The top result may be impossible for the current caller to use safely because auth shape and principal mismatch were hidden until too late.
Freshness theater
A giant directory looks rich even when stale, dead, or auth-broken entries remain in the candidate pool.
The runtime discovery question is not “how many tools can the agent find?” It is “how many safe, relevant, caller-appropriate tools can the agent see before it starts choosing?”
1. Giant indexes feel like progress because they improve recall
The current MCP ecosystem really does have a discovery problem. There are too many demos, too many stale entries, and too many directories that make every surface look equally real. A giant index improves one important thing, recall.
If the right tool exists somewhere, broader coverage increases the odds that the agent can discover it. That matters. It is just not the whole problem.
Fresh curated-registry launch notes sharpen the same point. Curated is better than random, because editorial filtering can remove obvious dead ends before the model ever looks. It is still not runtime truth. A registry entry is inventory until the runtime re-applies caller visibility, trust class, auth viability, and freshness for the current lane.
The practical audit is whether curation hands the runtime a cleaner starting set, or whether it turns one editorial label into a false proof of current safety.
The harder problem is selection, and selection gets more dangerous as the candidate pool mixes more trust classes and side-effect profiles together.
A curated registry earns trust only when it preserves the separation between editorial review and runtime permission. Before an agent ranks a listed server, the discovery layer should be able to answer four questions without asking the model to infer them from prose.
Who decided this entry belongs in the registry, what evidence did they inspect, and when was that evidence last refreshed?
Does the advertised setup path still complete, and does the live handshake expose the same tools the listing describes?
Can this principal authenticate with the intended scope now, or is the listing only proving that some maintainer once connected it?
After policy, tenant, and trust-class filters run, is the server still visible to this caller — and are out-of-scope choices denied with typed evidence?
2. Runtime discovery changes the problem from browsing to mediation
A human browsing a directory can apply judgment before clicking anything. They can notice that one tool is local and harmless while another is remote, stale, or broad enough to be dangerous.
An agent does not inherit that judgment by default. If the runtime exposes one giant mixed-authority pool, the model is being asked to solve relevance, availability, safety, and authority all at once.
That is the moment where discovery becomes part of the control plane. The runtime is no longer just describing what exists. It is shaping what choices the model is allowed to consider.
3. The wrong abstraction is “best search over the whole catalog”
Better embeddings or a smarter ranker do not fix the authority problem if the candidate pool is wrong. A local read-mostly helper and a high-side-effect remote business integration should not appear as interchangeable ranking candidates just because both match the same task description.
If the only safety layer is hoping the ranker prefers the harmless one, the runtime has already delegated control-plane work to the model that should have been solved upstream.
The real job of the discovery layer is to remove bad candidate classes before semantic ranking begins.
4. Trust filters belong before ranking
Trust class
Local helper, read-mostly surface, reversible write tool, high-side-effect execution surface, or shared remote integration. Ranking without this is ranking blast radius by accident.
Auth shape
Public, static key, delegated user auth, or tenant-bound runtime credential. A candidate the caller cannot safely authenticate to is not a real candidate.
Side-effect class
Inspect, write, execute, or egress. These need to be visible before the model starts reasoning, not after the call is already selected.
Caller-visible scope
Generated manifests and gateway policy layers only count when the runtime actually hides what this principal should not see now. Discovery truth is caller-visible scope, not global inventory plus a promise.
Freshness and viability
Handshake, auth viability, failure shape, and stale-entry suppression decide whether the candidate pool is still operational truth.
5. The useful discovery surface is the smallest caller-safe subset
A good runtime discovery system should not say, “Here are 14,000 things, good luck.” It should say something closer to, “For this caller, in this environment, under this policy, here are the few candidates that are both relevant enough and safe enough to consider.”
That bounded candidate set lowers context pressure, lowers wrong-tool risk, and makes auditability cleaner because the pool itself reflects policy rather than only search quality.
Bigger catalogs are only better when the runtime gets stricter about what the model is allowed to see.
6. A better runtime-discovery ladder
- Discoverable, the service exists in an index.
- Caller-visible, this principal can actually see it right now.
- Trust-classed, side-effect and authority shape are explicit before selection.
- Auth-viable, the intended caller can complete auth with the expected scope.
- Rankable, only then should semantic search, rules, or LLM ranking choose among the remainder.
That ordering matters. If ranking happens before trust filtering, the system is asking the model to decide blast radius while it decides relevance.
Runtime mediation should optimize for bounded choice first, then better selection inside that bounded set.
7. What a useful evaluator should score here
The strongest evaluation questions are whether the system exposes caller-specific visibility, whether trust class and side-effect class are visible before selection, whether auth shape is legible, whether stale entries are suppressed, and whether the runtime bounds the pool before semantic ranking begins.
That means a useful discovery layer should separate “worth reviewing” from “safe for this caller right now.” If curated-registry inclusion, launch-week enthusiasm, and runtime availability collapse into one badge, the model mistakes editorial confidence for authorization.
That now includes a harder discovery-truth test: do generated manifests or gateway policy layers actually narrow the live candidate set for this caller, does an out-of-scope choice yield a typed denial instead of a vague failure, and can the runtime still explain which lane consumed shared quota or backend authority after the remote hop.
Those questions separate search quality from control quality. For agent systems, that separation is not optional.
The fast production shortcut is the same one from MCP has a security model: filter first by caller-visible scope, acting principal, and surviving evidence, then let relevance rank the smaller safe set.
Filter by trust and authority before the model ranks anything
If runtime discovery is part of the control plane, the first production lane should surface only the tools this caller can safely consider, not a giant mixed-authority marketplace.
If this article reframes discovery as mediation, these pages sharpen the operator model around the current manifest-and-governance signal too: the core security model, the auth-versus-authority split, workflow fit versus trust class, governed capability surfaces, and the checklist for real remote readiness.
Scope, principals, and evidence are the pre-ranking filters that keep giant catalogs from becoming mixed-authority traps.
A directory entry only gets safer when authentication narrows discovery and the backend authority still matches the caller after the remote hop.
The right shortlist starts by separating what job the server improves from what authority it carries.
The safer answer is not raw endpoint sprawl. It is a bounded capability surface with visible authority and policy.
Auth, scope, tenant isolation, governors, recovery, and auditability belong in one operator checklist.
The candidate pool stays useful only if the runtime stays narrow under load
Once trust filters narrow the pool, the next operator questions are what breaks in the loop, how shared provider budgets are contained, and how credentials stay narrow as more agents come online.
What actually breaks once retries, tool use, and unattended execution are live.
How shared provider budgets and retry windows turn discovery and execution into a fleet coordination problem.
Why bounded discovery still fails if the credential layer widens faster than the trust model.