AI-Powered Code Review: Our Internal Workflow
Six months ago, our code review process was a bottleneck. Senior engineers spent 6-8 hours per week reviewing pull requests. Reviews sat in queue for an average of 14 hours. Junior developers waited idle while their PRs collected dust.
We didn't replace human reviewers with AI. We gave our reviewers AI-powered tools that handle the routine checks so they can focus on architecture, logic, and mentorship.
Here's the workflow we built and what changed.
The Problem With Manual-Only Review
Our review process had three pain points:
-
Style and formatting debates — 40% of review comments were about code style, naming conventions, and formatting. These are important but shouldn't require senior engineer attention.
-
Missed edge cases — Human reviewers are great at architecture feedback but inconsistent at catching null checks, error handling gaps, and boundary conditions.
-
Context switching — A reviewer needs 15-20 minutes to load a PR's context into their head. Interruptions reset this clock.
Our AI Review Pipeline
We built a three-layer review system:
Layer 1: Automated Checks (Pre-Review)
Before any human sees a PR, automated checks run:
- ESLint + TypeScript strict mode — Catches type errors, unused variables, import issues
- Prettier — Eliminates all formatting debates
- Codacy — Static analysis for security vulnerabilities, complexity, duplication
- Bundle size check — Flags PRs that increase the client bundle by more than 5KB
These run in CI. If any fail, the PR is blocked from review. This alone eliminated 40% of review comments.
Layer 2: AI Analysis (Context-Aware)
This is where it gets interesting. When a PR passes automated checks, an AI analysis runs that understands our codebase:
What the AI reviews:
- Pattern consistency — "This component uses
useStatefor form state, but our convention isuseFormfrom react-hook-form. Seesrc/components/ContactForm.tsxfor reference." - Error handling gaps — "This API route doesn't handle the case where
supabase.from('projects').select()returns an error. Other routes insrc/app/api/consistently check for errors." - Performance concerns — "This
useEffectdependency array includesrouter, which changes on every render. Consider usinguseCallbackor moving the navigation logic." - Security issues — "This route reads
req.body.userIdwithout validating it matchesauth.uid(). Similar routes use RLS, but this one bypasses it."
What the AI doesn't review:
- Architecture decisions — those need human judgment
- Business logic correctness — the AI doesn't know the product requirements
- UX feedback — visual design and interaction design stay human
Layer 3: Human Review (Focused)
By the time a human reviewer sees the PR, the boring stuff is handled. Their review focuses on:
- Does this architecture decision make sense for the long term?
- Is this the right abstraction level?
- Will this be maintainable by the team in 6 months?
- Does the testing strategy cover the important paths?
Implementation Details
The AI Review Agent
We use Claude through the API for code analysis. The key is context: a raw AI review without codebase context produces generic suggestions. Our agent includes:
- Repository conventions — Extracted from our engineering playbook and stored as system prompt context
- Related files — The AI sees not just the changed files but related imports and tests
- Git history — Recent changes to the same files help the AI understand ongoing refactors
const reviewPR = async (prDiff: string, changedFiles: string[]) => {
// Gather context: related files, conventions, recent history
const relatedFiles = await getRelatedFiles(changedFiles)
const conventions = await getConventions()
const analysis = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
system: `You are a senior code reviewer for a Next.js 16 + Supabase project.
Follow these conventions: ${conventions}
Focus on: pattern consistency, error handling, performance, security.
Do NOT comment on: formatting (handled by Prettier), types (handled by TypeScript).
Be specific. Reference existing code patterns when suggesting changes.`,
messages: [
{
role: 'user',
content: `Review this PR diff. Here are the related files for context:
${relatedFiles}
PR Diff:
${prDiff}`,
},
],
})
return analysis
}
Reducing False Positives
The biggest risk with AI review is noise. If the AI generates 10 comments and 8 are unhelpful, developers ignore all of them.
We track a "signal-to-noise ratio" for every AI comment category:
- Comments that lead to code changes: signal
- Comments dismissed by the reviewer: noise
Categories below 60% signal rate get suppressed. We started at 45% signal and are now at 78% after three months of tuning.
Integration With GitHub
AI review comments post as a bot account on the PR. They're visually distinct from human comments:
- AI comments have a "🤖 AI Review" prefix
- They're posted as "suggestions" that can be applied with one click
- Each comment includes a confidence score (high/medium/low)
- Low-confidence comments are collapsed by default
The Workflow in Practice
- Developer opens PR
- CI runs lint, typecheck, tests, bundle analysis (3 minutes)
- AI review runs in parallel, posts comments (2 minutes)
- Developer addresses AI feedback (self-serve, no review queue)
- Human reviewer is assigned (usually 1-2 hours wait)
- Human reviewer sees clean PR with AI findings already addressed
- Human focuses on architecture and logic (15-20 minutes average)
Measurable Results
After six months:
| Metric | Before | After | Change | | --------------------------------- | ------------ | ----------- | ------ | | Avg PR review queue time | 14 hours | 3.2 hours | -77% | | Human review time per PR | 45 minutes | 18 minutes | -60% | | Style/formatting comments | 40% of total | 2% of total | -95% | | Bugs caught in review | 12/month | 23/month | +92% | | Senior engineer review hours/week | 6-8 hours | 2-3 hours | -62% |
The most surprising result: bugs caught in review almost doubled. The AI is relentless about edge cases, null checks, and error handling that human reviewers inconsistently catch.
What Didn't Work
Fully Automated Approval
We tried having the AI auto-approve PRs that passed all checks and had no AI findings. Bad idea. Two problems:
- The AI misses architectural issues that only manifest over months
- Developers lost the mentorship opportunity that comes from human review
AI-Generated Code Fixes
We experimented with the AI not just identifying issues but generating fix PRs. The code was usually correct but lacked the developer's understanding of why they made certain choices. It created merge conflicts and confusion.
Reviewing Test Files
The AI generated too many false positives on test files — suggesting refactors to test utilities, questioning mock patterns, and flagging test-specific code smells that were intentional.
Recommendations
If you're building a similar workflow:
- Start with automated checks — Eliminate the easy wins before adding AI
- Context is everything — Generic AI review is noise. Codebase-specific review is valuable
- Track signal-to-noise — Measure which AI comment categories actually lead to changes
- Keep humans in the loop — AI handles the what (style, patterns, edge cases). Humans handle the why (architecture, design decisions)
- Make AI feedback actionable — One-click apply suggestions, specific file/line references, confidence scores
- Iterate weekly — Review dismissed AI comments and tune the prompt/rules
The goal isn't to replace reviewers. It's to make review time count.
Austin Coders
We build SaaS & AI apps that actually scale. React, Next.js, and AI-powered solutions for startups and enterprises.