Scrutineer now audits MCP servers — tested against the 100 most popular

Three things happened to my open-source code-review toolkit this month: it picked up a new tool, it became a pip install, and it changed its name. The new tool is the reason for the post — /scrutineer-mcp audits an MCP server before you ever run it — and to make sure it actually did something useful, I pointed it at the 100 most-installed MCP servers. None of them came through clean. I’ll get to why.

If you just want to try it:

bash

pip install scrutineer
scrutineer install /path/to/your-repo

That drops four slash commands into your repo.

First, the name

The toolkit used to be called vibecheck. It’s now Scrutineer.

There are just too many things named vibecheck — more than a thousand GitHub repos share it. The better reason: “vibecheck” never said what the tool does. It scrutinizes code. “Scrutineer” reads as scrutiny even if you’ve never heard it as a job title, and the brand should describe the work, not the meme.

The rename also fixed a quieter problem. The old /security-review command silently shadowed Claude Code’s own built-in /security-review — installing my toolkit meant overwriting a command Claude ships. Renaming the whole set into a scrutineer-* family moved it out of the built-ins’ namespace for good. Old GitHub URLs redirect, so nothing you’ve bookmarked breaks.

The whole toolkit

scrutineer install drops four commands into .claude/commands/:

/scrutineer-servicemap   # a machine-readable map of your services and how they connect
/scrutineer-code         # principal-engineer peer review across 8 lenses
/scrutineer-security     # threat model + universal/platform checklists + attack-chain reasoning
/scrutineer-mcp          # audit an MCP server before you trust it — the new one

The first three are generated against your repo, so they pick up your stack (15+ languages for peer review, 25+ for security), and the optional service map lets them trace impact across services instead of reading files in isolation. Setup used to mean cloning the repo and hand-copying command files per tool; that’s gone. If you want the data on whether the review tools actually catch anything, I ran them against 50+ reviews on real production code in the earlier writeup.

/scrutineer-mcp is the odd one out — it audits something external, so it needs neither a service map nor a generate step. It’s also the new one, so the rest of this is about it.

The new tool: `/scrutineer-mcp`

When you add an MCP server to your agent, you hand it three things at once: tool access, data access, and usually a live credential. There’s no npm audit for that decision — you read a README, trust the author, and paste the config. That’s the whole security model.

/scrutineer-mcp is the missing step. It inspects a server before it ever runs and reports two independent answers, because a server can be perfectly secure and still want to read every message you’ve ever sent:

a security verdict — SAFE / CAUTION / BLOCK
a data-sensitivity rating — MINIMAL / LIMITED / SENSITIVE / HIGHLY_SENSITIVE

The important word is static — it never starts the server, calls a tool, or fetches a URL, because doing any of those would mean running the thing you’re evaluating. It works in three passes:

Config — install method, transport, secrets in the config, filesystem scope, and whether the code you’d review can even be tied to the code that actually runs.
Tool surface — what each tool can do and what data it touches, including the benign-name/powerful-schema trick where a tool called get_weather quietly accepts an arbitrary shell command.
Source — when it can be obtained, checked for injection, secret handling, exfil paths, and obfuscation. Acquisition is sandboxed: it pulls the artifact through registry APIs, verifies the hash, and extracts it with something that rejects path-traversal, symlinks, and zip bombs — never running a package manager, never executing the fetched code.

Two parts I’m proudest of: it flags toxic combinations — capabilities that are each fine alone but together form an attack, like “can read files” + “can make network calls” = a read-then-send exfil path — and it checks approval drift, what your client has already auto-approved versus what a review would actually recommend. Findings are pinned to a hash of the server’s config and tool surface, so a false positive you’ve dismissed comes back the moment the server changes — you don’t silently keep trusting something that quietly grew new powers.

What actually runs — and what doesn’t

A fair question for a “review it before you trust it” tool: doesn’t reviewing it mean running it? So here’s what touches a server’s code, and what doesn’t.

The audit runs nothing. /scrutineer-mcp reads three things — the install config, a captured list of the server’s tools, and (when available) its source — and none of that starts the server, calls a tool, or makes a network request on its behalf. Static by construction, because the entire point is to judge something before you run it.

Getting the tool list is the one step that executes code — and the auditor doesn’t do it for you. To know what tools a server exposes, something has to ask it, and asking means launching it: the package gets downloaded and started, and it answers a tools/list request. That startup runs the package’s own code — install/postinstall scripts and its boot path — though not any individual tool handler. It’s the genuinely risky step, and it’s the one the audit sidesteps: you hand it a tool list, and how you got that list is your call. For the survey I captured all 100 in disposable VMs with placeholder credentials, killing each process the moment it answered — launching a hundred unreviewed packages on a real machine is the exact thing this tool exists to warn you about. If a server publishes its tool list, or your client already has one, there’s no launch at all.

Reading the source downloads, but never executes. The Pass-3 source review pulls the package tarball from the registry’s HTTP API, checks it against the published integrity hash, and unpacks it with an extractor that refuses path-traversal, symlinks, and zip bombs — it never calls npm, pip, or git, so no install hook fires. The files land inert, for reading only, and (as of 1.6.4) it offers to delete them when you’re done so you’re not left with an untrusted package in your temp dir.

The dangerous operation, in other words, is starting an MCP server — and the auditor is built so that reviewing one never requires it.

What it found across the top 100

To pressure-test it, I ran the audit across the 100 most-starred installable servers in the official MCP registry — capturing each tool surface in the throwaway VMs from the section above. For 46 of them there was no surface to capture: they won’t boot without real credentials, so they show up as UNKNOWN on the data axis (their install-config checks still ran).

The headline result: none of the 100 rate SAFE as distributed. Not because they’re malicious — because of how they’re shipped.

Almost every one installs with npx -y or uvx, which run whatever version the registry is serving at the moment your agent launches them — not a version anyone reviewed. That sounds pedantic until it isn’t. When axios was compromised this past March, the malicious build was live for about three hours, and everyone auto-pulling the latest version in that window got a remote-access trojan. Pinning an exact version — and letting a release age a few days before you adopt it — turns that from an automatic hit into a non-event. Most MCP servers don’t even document a pinned install, so the tool can’t bind the code it reviewed to the code that runs, and it won’t hand a SAFE to something it can’t pin down. (The full argument and the per-server breakdown are in the survey writeup.)

A few other things fell out of the run. 35 of the 100 are handed a live credential through their environment — which, paired with an install you can’t pin, is the combination the tool treats as a hard BLOCK. And “latest” turned out not to mean much: of the 54 servers I could boot, 38 were running a different version than the registry even listed.

Running it at that scale also caught two false positives in my own detector — it was reading system “health” checks (agent_health) as medical data, and LLM “tokens” as credentials. Both are fixed in 1.6.3, with tests so they don’t come back. Worth saying plainly, since a security tool that over-flags is its own kind of useless.

Where to find it

The survey — methodology + full leaderboard: the top-100 audit
PyPI: pip install scrutineer
GitHub: cyrus-is/scrutineer — MIT, and the guidance files are YAML, so adding a platform or a new MCP risk pattern is a PR, not a code change.

Support for Cursor, Gemini CLI, and Codex CLI is in progress. For now it targets Claude Code.

Try it — and tell me where it’s wrong

pip install scrutineer, point it at a repo or an MCP server, and see what it says. If it over-flags, misses something, or you have an idea to make it better, file an issue on GitHub — I read and respond to every one.

Fittingly, this version of the MCP auditor started as an issue comment. I had one open about wanting to build an MCP review tool and couldn’t get unstuck on the shape of it. @kayalopez suggested splitting it into a pre-install review and a post-install one — and that answered the question that had been nagging at the back of my mind and blocking progress. (It’s the same split as the “what actually runs” section above: judge a server before you start it, separate from whatever happens once it’s running.)

So let me know what you think. I’m listening.