I Tested AI Code Review Against Human-Written Review Skills. Neither Won.

Update — April 23, 2026: Vibecheck 1.1.0 is out. Short version below; original article follows.

1.1.0: false positive filtering, multi-repo service maps, C# support

Security reviews that don’t cry wolf. A false positive filtering system inspired by Claude Code’s built-in security review. 12 hard exclusion categories (DoS, test-only code, theoretical race conditions), precedent rules that calibrate severity to real-world exploitability, and a 0.8 confidence threshold that kills speculative findings before they reach the report. The filter is platform-aware — if your repo uses Go and React, the generated skill automatically knows not to flag memory safety issues or XSS in JSX.

Service maps that span repos. The servicemap generator now supports multi-repo architectures. Crawl repo A, then point the same servicemap at repo B — it merges the new components without touching repo A’s data. Every component tracks its source repo, staleness is scoped per-repo, and nothing is ever deleted unless you explicitly ask.

C# and .NET support. Detection for .csproj, .sln, ASP.NET route attributes, Entity Framework, Dapper, HttpClient/IHttpClientFactory, appsettings.json, and NuGet package references. Plus a fallback discovery phase that reasons from project structure when it encounters an unknown stack.

Tighter output formats. Verdict at the top, not buried at the bottom. No-findings reviews are a single line, not a page of empty sections. Security findings require a category slug, confidence score, and concrete exploit scenario. GitHub issue filing is opt-in, not automatic.

I verified the service map output against two public repos — dotnet/eShop (29 components, 33 connections, zero phantoms) and bitwarden/server (48 components, 40 connections, zero phantoms) — with senior-engineer-persona agents checking every claim against source. That verification surfaced the bugs this release fixes: route prefix hallucination on ASP.NET controllers, Azure services mislabeled as AWS equivalents, inconsistent library connection tracking.

On the backlog: threat model generator, MCP security review, deeper cloud provider coverage, background service and messaging pattern detection. See the v1.1.0 release notes for specifics.

I’m releasing a set of open-source generators that build Claude Code skills designed for your repo — service map, peer code review, and security review skills that work together. They found 5 P1s sitting in production code that months of human review missed. But the humans caught things the AI never would. This is what I learned building them, and why I think the answer isn’t one or the other.

I’ve been building tools that generate code review and security audit skills for AI coding assistants. Not generic “scan this file for bugs” tools. Skills that are customized to your actual codebase, your tech stack, your architecture. You run them against your repo and they produce a review skill that knows your patterns.

I wanted to know: are they any good? Better than hand-written review checklists? Worse? Different?

So I tested them. 50+ reviews across 8 real production components. Backend services in Go, iOS app in Swift, Android app in Kotlin, infrastructure, the works. I ran the generated skills, ran hand-written skills I’d built over months, and also ran pure agentic review with zero guidance (just “review this code”). Then I compared everything.

The results were not what I expected.

The generated skills found bugs the humans missed

The AI-generated review skills found 23 unique issues. The hand-written skills found 21. Close enough to call it a draw on volume. But the types of bugs were completely different.

The generated skills caught a P1 data loss bug that months of hand-written reviews never surfaced. The iOS sync engine had link tables (person-to-house, pet-to-house, person-to-relationship) that were never synced. They had syncStatus fields. They looked like they were part of the sync system. They weren’t. If you edited a link offline and synced later, the edit was silently dropped. Gone. No error, no warning, no recovery path.

The generated skill found it because it was systematically checking every entity type against the sync pipeline. A human reviewer would have to know to look for that, and none of us did. We’d all just assumed the sync engine handled everything with a syncStatus field.

It also found that the pending-delete purge routine was purging records whose server-side delete had failed. If the network request errored out, the local record was deleted anyway. The app was treating “I tried to delete this and couldn’t” the same as “I deleted this successfully.” That’s a data loss path that only shows up under bad network conditions, which is exactly when you’d expect sync operations to matter most.

Here’s what the iOS peer review comparison actually looked like:

Finding	Generated	Hand-Written
Sync engine drops fields during push (P1 data loss)	✅	—
Fatal error on model corruption (P2)	✅	—
Sync race: edits lost during active sync (P2)	✅	—
Fire-and-forget tasks race with logout (P3)	✅	—
Location service thread-safety	✅	✅
Capture flow empty author ID	✅	✅
Save guard not set during mutations (double-tap)	—	✅ (P2)
View model catches non-throwing function	—	✅ (P2)
Main-actor annotation blocks background task	—	✅
Cache race on re-login	—	✅
Overlapping sync guard not reentrant	—	✅

The generated review found a P1 the hand-written review missed. The hand-written review found 5 things the generated review missed. Different strengths, not a clean winner.

The hand-written skills caught things the AI never would

The hand-written reviews caught platform-specific edge cases that the generated skills didn’t have in their guidance files. An isSaving race condition in Swift where two rapid saves could interleave. A CancellationException in Kotlin that wasn’t being handled in a coroutine scope, which would silently swallow the cancellation and leave the operation in a half-finished state.

The Android comparison was even more lopsided toward the hand-written skills:

Finding	Generated	Hand-Written
Auth client missing timeouts (P2)	✅	—
Missing ProGuard keep rules (P2)	✅	—
Orphaned entity on partial failure (P3)	✅	—
Duplicate envelope unwrapping (P3)	✅	—
Cancellation exception swallowed in authenticator	✅	✅
Search with no debounce	✅	✅
Sequential API calls where parallel would work	✅	—
Dual token refresh race (P1)	—	✅
Retry method infinite recursion	—	✅
Error messages leak internals (P2)	—	✅
Two auth client instances with different configs	—	✅
Cancellation exception swallowed in view models (P1)	—	✅
Thundering herd (50 concurrent API calls)	—	✅
Push token not unregistered on logout (P2)	—	✅

The generated review found 4 unique issues. The hand-written review found 7, including two P1s — a dual token refresh race that caused random logouts, and coroutine cancellation exceptions being swallowed across the entire view model layer.

These are the kinds of bugs you learn about by shipping code in those languages for years. The generated skills didn’t know to look for them because nobody had added those patterns to the guidance files yet. A senior Android developer would catch the CancellationException pattern in a manual review. The AI couldn’t, because it didn’t know that pattern was dangerous.

This was the first real lesson: generated skills are systematically better at comprehensive coverage. Hand-written skills are better at platform-specific depth. Neither replaces the other.

Then I tried pure agentic review and everything got interesting

I ran a third experiment. No checklists, no pre-flight checks, no focus areas. Just pointed an AI agent at each component and said “review this.” No template. No guidance. Just reasoning.

It found a completely different class of bug.

The generated and hand-written skills both scan files. They look at code and check it against known patterns. The pure agentic review traced flows. It followed a request from entry point through service calls, database writes, cache updates, and response handling. And by doing that, it caught things that file-level analysis can’t see.

Unique finds from the agentic review that neither the generated nor hand-written skills caught:

The sync engine’s link tables (person-to-house, pet-to-house) were never registered with the sync pipeline. Offline link edits were lost permanently. No error, no warning.
The pending-delete routine purged records whose server-side delete had failed. Network error = local data gone.
A model context was being created on the wrong actor boundary — a crash risk on iOS 18.
An upsert operation dropped a deduplication ID on server pull, breaking contact matching.
One API flow had a batch endpoint for bulk operations. A parallel flow doing essentially the same thing made N sequential HTTP calls. Both worked. One was 10x slower.
Across 6 services, every single one had Row Level Security “enabled” but configured as USING (true) WITH CHECK (true) — allow everything. Worse than no RLS because it gives you false confidence.
A database connection URL was passed as a plain environment variable instead of a secrets reference — visible in the task definition JSON.
The admin IAM role used Resource: "*" for CloudWatch — could read any log group in the account.

The agentic review caught these because it wasn’t scanning files for patterns. It was building a mental model of the system and asking “does this make sense?” That’s a fundamentally different operation.

The pattern was clear: skill-driven reviews scan files. The agentic review traced flows. The systemic bugs — data integrity issues, architectural inconsistencies, infrastructure security gaps — only showed up when an agent followed a request end-to-end across service boundaries.

The built-in review prompt beat all four custom tools on a PR

This one actually surprised me. I tested the built-in /review command against all four custom review tools (two generated, two hand-written) on the same pull request. The built-in prompt found 8 unique issues. The custom tools found 3 to 7 each.

Why? Because the custom tools are optimized for finding bugs in the codebase. The built-in prompt is optimized for reviewing changes. It compares what changed to what the PR says it’s doing. It caught that an Android function was bundled into a security fix PR but had nothing to do with security. It caught that a delete operation was logging “deleted” even when zero rows were affected. It caught partial failure paths in multi-step operations where step 2 failing would leave step 1’s side effects orphaned.

These are all change-quality issues, not code-quality issues. The custom review tools were looking at the code and asking “is this correct?” The built-in prompt was looking at the diff and asking “does this change make sense?” Different question, different results.

Severity calibration was the quiet win

One thing I didn’t expect: the generated skills calibrated severity better than the hand-written ones.

Here’s what the iOS security review comparison looked like:

Finding	Generated	Hand-Written
Certificate pinning missing for production domain	✅ HIGH (BLOCK)	✅ MEDIUM (PASS)
PII in debug print statements	✅	✅
No background screenshot masking	✅	✅
Local data store not encrypted	✅	✅
Deep link parameter not validated	✅	✅
Sensitive ID stored in plain user defaults	✅	✅
Clipboard exposure analysis	—	✅
Server error messages exposed in UI	—	✅

Both found the certificate pinning gap. But the generated skill rated it HIGH/BLOCK — production users are affected, MITM is possible. The hand-written checklist rated it MEDIUM/PASS. The generated skill got it right because it had a phase where it considers blast radius before assigning severity. The hand-written checklist just said “check for cert pinning” with no context about who’s affected if it’s missing.

This matters because severity is what determines whether a finding gets fixed today or sits in a backlog for six months. Getting it wrong in either direction is bad. Too high and everything is urgent and nothing is. Too low and real issues get buried.

The numbers

Across all 40 skill-driven reviews (20 generated, 20 hand-written) and 4 agentic reviews across 8 components, the final tally:

30 actionable work items after deduplication and triage
5 P1s (data loss, security, crash)
11 P2s (data integrity, UX, correctness)
14 P3s (quality, performance, maintainability)
6 actively exploitable, 9 conditionally exploitable, 2 theoretical, 13 quality
32 delta patterns identified for improving the generated skills

The generated skills found more unique issues overall and calibrated severity better. The hand-written skills caught platform-specific patterns the generator didn’t cover. The agentic review found systemic bugs neither approach caught. The built-in /review prompt beat all custom tools on PR review specifically.

No single approach won. But together, they found 5 P1s that had been sitting in production code.

What I actually learned

Code review isn’t one thing. It’s at least four different things that don’t align:

Pattern matching catches known-bad code fast. Missing error checks, SQL injection patterns, unsafe deserialization. High confidence, low cost, shallow.

Flow tracing catches systemic issues that pattern matching can’t see. Inconsistent architectures across similar endpoints, missing batch operations, false security configurations. Slow, requires reasoning across files, but finds the bugs that actually cause incidents.

Change review catches issues with the delta, not the code. Bundled unrelated changes, side effects that weren’t accounted for, missing test coverage for new behavior. Completely different skill from codebase review.

Platform expertise catches language-specific gotchas that only come from years of shipping code in that language. Race conditions, concurrency edge cases, framework-specific pitfalls. Hard to generate, easy to hand-code if you know where to look.

No single tool does all four well. And that’s the point. You don’t want one tool. You want different modes for different cadences.

So I built four modes into each skill

All of these findings went into the generated skills. I tested, iterated, and retested until the gaps closed. Both the peer review and the security review now have four modes, each designed for a different cadence and a different class of finding:

/peercodereview              # Current branch diff
/peercodereview 123          # Review PR #123
/peercodereview <component>  # Full review of a service/app
/peercodereview --deep       # Full repo review

/security-review             # Current branch diff
/security-review 123         # Audit PR #123
/security-review <component> # Full security audit of a service/app
/security-review --deep      # Full repo security audit

Branch diff and PR review are your every-PR tools. Fast, focused on the change, catches pattern-match bugs and change-quality issues. This is where platform-specific pre-flight checks and change-type signals live. Run it on every PR. It takes minutes.

Component review is the weekly or daily sweep depending on your tolerance. It reviews an entire service or app directory against all 8 evaluation lenses. This is where you catch drift. Things that are individually fine but collectively wrong. An endpoint that’s missing a guard that every other endpoint has. A service that’s grown a new dependency nobody discussed. Run it on your critical services regularly.

Deep review is the full agentic mode. No shortcuts. It traces flows across service boundaries, builds a mental model of the system, and looks for systemic issues. This is the mode that found the batch endpoint inconsistency, the fake RLS policies, the sync engine data loss. It’s slow and expensive but it catches the bugs that cause incidents. Run it weekly or monthly.

The idea is that each mode maps to what I learned about the different dimensions of review. PR mode handles change review and pattern matching. Component mode handles coverage and consistency. Deep mode handles flow tracing and architectural reasoning. Platform expertise lives in the YAML guidance files that feed all four modes. Both the peer review and security review skills follow the same structure. The peer review evaluates across 8 lenses (production reliability, correctness, data integrity, etc.). The security review builds a threat model first, then runs universal and platform-specific checklists, then does agentic attack chain reasoning. Different goals, same four modes, same cadence logic.

You’re not supposed to run all four on every PR. You layer them. The fast checks run constantly. The deep checks run periodically. Together they cover a lot of ground.

But I want to be honest: they won’t catch everything. The data showed real variance across review approaches. The generated skills missed things the hand-written ones caught. The agentic review found things neither caught. The built-in prompt beat all of them on a specific PR. No single tool won across the board. So use these, but also experiment. Try other approaches. Tweak the guidance files. Run the deep mode and see what surprises you. The YAML is open for a reason. If your team’s senior iOS developer knows about a class of bug these tools miss, add it. The tools get better the more real-world expertise goes into them.

So I open-sourced the tools

The project is called vibecheck. It generates code review and security audit skills customized to your repo. You point it at your codebase, it scans your tech stack, and it produces a Claude Code skill with platform-specific pre-flight checks, focus areas, and the four review modes above. It supports 20+ platforms across backend, web, mobile, infrastructure, and API layers.

There’s also a service map generator that crawls your repo and builds a machine-readable topology of all your services, datastores, and connections. The review skills use it for cross-service analysis. Without it they still work. With it they can trace impacts across service boundaries. Component mode and deep mode get significantly better with a service map because they can reason about the architecture, not just the code.

The guidance files are YAML. Adding a new platform or a new check is a PR, not a code change. The tools self-heal. If they detect a platform in your repo that isn’t in the guidance file, they flag it.

The skills feel solid now and the data backs it up. But after you deploy them, test them against the built-in review and see where the gaps are for your environment. I plan to re-analyze mine every month or so and continually tweak them. As I do, I’ll update the generators so others can benefit from what I learn.

I don’t think this replaces senior engineers doing code review. I think it makes the review they’re already doing more systematic. The generated skills caught a P1 data loss bug that experienced humans missed for months. That’s not because the humans were bad at review. It’s because humans can’t hold an entire sync pipeline in their heads while also checking for 50 other things. The tools can.

And the places where the tools fell short, the platform-specific edge cases, the change-quality issues? Those are the things humans are genuinely good at. The combination is better than either one alone.

That’s the whole point. Not replacing review. Making it work.