12 May 2026
Agentic MCP server testing
Stax.sh is a CLI that recommends MCP servers to a developer based on what they are actually trying to do. Drop it into a project, it inspects the codebase and the recent conversation with your coding agent, and suggests the handful of servers worth installing for this specific job. The recommendations are only as good as the index behind them: per-server keywords, use-cases, and the natural-language phrases a user is likely to type (“i need up-to-date docs for X”, “help me query a Postgres”). Hand-writing that index for every server in the catalogue would have been weeks of work, and writing it without actually trying each server first would have produced README-flavoured copy instead of user-flavoured copy.
So I built a harness. An Opus 4.7 orchestrator, a fleet of Sonnet 4.5 sub-agents in parallel, all inside a single Docker container so the installs and tool calls had no access to my host machine. Each sub-agent installed its assigned server, used it for one believable real-world task, and wrote a structured JSON file: deterministic measurements, a senior-engineer verdict, the directory page copy, and the discovery entry the recommender now matches user prompts against. By the time I came back to my desk the result files were waiting.
The interesting part is not that an LLM can run a test. It is that the unit of work is “install this thing and try to use it for what a real user would use it for,” and the same shape scales from one target to many without writing a test script per target.
The shape
The orchestrator is small. Opus 4.7 loads data/servers.json, builds a bundle per server (entry, assigned task, auth tier, required env vars), and dispatches one Sonnet 4.5 mcp-tester sub-agent per bundle. Claude Code runs sub-agents in parallel with their own context window, so the fleet is a set of fresh, isolated reasoning loops rather than one gigantic prompt. The whole thing runs inside a Docker container so a misbehaving server cannot reach the host filesystem, my SSH keys, or my real environment.
Each sub-agent has three tools: Bash, Read, Write. That is enough to install a package, connect via the MCP Inspector CLI, attempt the assigned task, and write a JSON file. No browser, no special harness, no per-server fixtures. The sub-agent figures out the rest by reading the server’s README and the tool descriptions it gets back from tools/list.
What one sub-agent actually does
The brief the sub-agent receives is a goal, not a script. It says things like: “Find up-to-date docs for setting up Tailwind v4 with Next.js 16. Report the library IDs you resolved, at least one concrete config snippet, and whether the docs are current.” The agent decides which tools to use, in what order, with what arguments. If the obvious tool is missing it picks the closest by description and notes the substitution. If a call fails it decides whether to retry or treat the failure as a finding.
Two things make this work that a fixed test script cannot do. The agent reads the natural-language tool descriptions and picks the right one even when the names are slightly different from what the brief expected. And when it gets stuck, the stuck point itself becomes signal: the agent writes down where it got blocked, in plain English, and that line ends up both on the directory page and as a watch-out the CLI shows next to its recommendation.
What came back
The interesting findings were ones I would never have written assertions for: GitHub advertises 26 tools totalling 7,032 tokens before any work is done, the Filesystem server’s search_files is a filename glob not a content grep, DBHub’s README install command silently fails because npx -y skips an optional native dep. Each finding became one piece of data: a sentence on a directory page, a watch-out the CLI surfaces when it recommends the server, or a problem-statement that boosts ranking when a user’s prompt looks like the one the sub-agent ran. A finished example is stax.sh/servers/context7: tagline, verdict, watch-outs, tool table, and the keywords behind the recommender all came out of the same sub-agent run.
Why this beats a fixed test suite
The sub-agent does not just test the server. In the same pass, it writes the directory page copy and the recommender entry that decides whether this server gets surfaced when a future user types “help me lookup the latest Tailwind docs” into their agent. The same attention that picked the right tool produced the tagline, the verdict, the watch-outs, and the lowercased problem-statements the CLI matches against. Splitting that into “run the test” and “write the copy” would have meant a second pass re-deriving the understanding the first one already had.
Failures are signal, not red lights. A traditional regression suite treats a failed assertion as a stop-the-line event. Here, a stuck point is just a known gotcha for the page, and a missing env var is a skipped outcome that still produces the directory page from the tool surface and README alone. The audit has no concept of red CI. It has measurements, verdicts, and an honest record of where each thing fell over.
Because the brief is plain English, changing what gets tested is a prose edit. “Now also confirm the server supports the HTTP transport, not just stdio” is a one-line addition to the sub-agent prompt. No fixture, no helper, no new file.
The shape generalises
Nothing in this is specific to MCP. The pattern is: take a list of targets, write down what a real user would do with one of them, give an isolated agent a small toolbelt and a structured output schema, and fan out. The targets can be MCP servers, internal APIs, customer journeys, vendor integrations, or pages on a site. The toolbelt can be the MCP Inspector, a Playwright session, an HTTP client, or a Chrome DevTools MCP attached to a real browser. The output schema is whatever you need to render the results.
The win is not that the agent is smarter than a test script. It is that writing what a real user would do is cheap, and the agent absorbs all the small judgment calls that would otherwise be a thousand lines of fixtures, helpers, and brittle selectors. Every finding above came out of a sub-agent reading a README, looking at the tool surface in front of it, and deciding what made sense. That is the same work a human tester does. The difference is that one orchestrator can dispatch 101 of them at once and still hand back clean, structured results that drop straight into a product.