How we taught AI to approve pull requests

In this article

Problem: The cognitive weight of waiting

You finish a feature, open a pull request (PR), and wait. The code is done, but the work isn't. It sits in your head, a small, persistent weight. You start the next task, but part of your attention is still on that review. Will you need to context-switch back when feedback arrives? Did you handle the edge cases? Are the tests convincing enough? You can't move on.

Meanwhile, your reviewers face the same problem from the other side. They spend significant time on routine PRs. Documentation fixes, simple refactors, and minor updates. These reviews follow predictable patterns, yet still require attention. The straightforward PRs consume time. Complex architectural changes get less scrutiny than they need.

We wanted to use AI to handle the routine cases. Engineers could achieve cognitive closure faster. Reviewers could focus on what actually requires human judgment.

The challenge was encoding the contextual judgment that makes an experienced engineer approve one PR immediately and flag another for closer inspection. How do you systematize that pattern recognition?

Most AI code review tools stop at comments. They suggest improvements, flag issues, and point out problems. We built a system that approves PRs for merging.

Architecture: Three-layer decision system

Our system has three components:

Context Building: Builds the complete picture of each change.

Heuristic Evaluation: Applies decision rules: complexity assessment, risk analysis, quality evaluation, approval thresholds.

Review Decision: Approves changes within AI thresholds. Provides actionable feedback when needed and escalates beyond thresholds to human review.

Heuristics: How we encode engineering judgment

1. Progressive complexity model

The core insight is that complexity and quality are independent dimensions.

A 500-line refactor across 15 files has the same architectural scope whether or not it has tests. Tests increase confidence that the complex change is safe, but they don't reduce complexity.

The framework evaluates changes through three sequential phases. These phases build on each other:

Progressive complexity model: Three phases

Phase 1: Complexity assessment

Measure the architectural scope objectively. Lines changed, files modified, component boundaries crossed. This gives us a complexity classification.

Phase 2: Risk amplification

Amplify complexity based on what domains the change touches. Security changes (auth, encryption, permissions) carry higher risk than documentation updates. When a change touches multiple risk areas, we use the highest-risk domain.

Phase 3: Confidence calibration

Only after establishing complexity and risk do we evaluate quality signals. Test coverage, safety mechanisms (such as feature flags and rollback plans), observability, and documentation all increase confidence without altering the underlying complexity.

The categories emerge naturally:

Routine changes: Documentation, config files, minor updates. Low complexity, basic evidence is enough.
Structural changes: Multi-component refactors, cross-boundary modifications. High complexity, needs strong evidence.
Critical changes: Architectural changes, infrastructure modifications, breaking changes. Beyond the AI boundary, human judgment essential.

API endpoint changes illustrate this progression:

Complexity assessment classifies the change. But determining whether code is actually correct requires deeper analysis.

2. Semantic analysis: Catching what type checkers miss

Static analysis handles pattern matching well. Type checking, style enforcement, and basic code quality. But it misses semantic bugs that require understanding what the code actually does.

The kind of bug the system catches:

A developer builds a background job to process user activity data. The job needs to identify inactive users for cleanup and archival.

They create a workflow request using an existing domain object:

The developer needs user ID, activity count, and last activity timestamp. The UserWithActivity object has all three. Type-safe, clean code. Tests pass.

Here's what makes this interesting: The developer doesn't write any problematic code in isolation.

The issue is in how this object gets used. The code sends it to an external service. That service stores all incoming requests. Here's the new code:

What the developer doesn't realize: UserWithActivity embeds the complete User struct using struct composition:

That embedded struct contains Email, FirstName, LastName, and other sensitive fields. The service call serializes this entire composed object and stores it. This exposes unintended PII (personally identifiable information).

Our semantic analysis catches this by tracing the full data flow:

Tracing struct composition from UserWithActivity to the embedded *User pointer
Analyzing the User struct to identify all fields including PII (Email, FirstName, LastName)
Understanding struct embedding semantics: the entire embedded object gets serialized
Recognizing the external service call: request data gets stored outside our system
Connecting the dots: embedded PII + serialization + external storage = compliance violation

When a system approves PRs instead of just commenting, it can't afford to miss bugs like this. The cost of a mistake isn't an annoying false comment. It's a production incident.

The system flagged this as a critical compliance violation and escalated for human review. Catches like this prevent data exposure before it reaches production. Here's what this looks like in practice:

User experience: Show only what matters

Output fatigue destroys engagement. We show only what's critical. Everything else is hidden.

We designed against a specific failure mode: verbose feedback that overwhelms. When reviews generate extensive output, developers stop reading carefully. They scan. They tune out. Worse, volume creates a false sense of thoroughness while reducing actual engagement. Our approach is ruthless selectivity. Only what's critical appears uncollapsed. This ensures developers actually engage.

Results: Speed and cognitive relief

The speed impact

For PRs that got AI approval, merge time dropped from 3.5 hours to 33 minutes. That's 6.4x faster.

The AI is approving over 60% of all PRs. For those PRs, engineers get feedback in minutes, not hours. They can close the mental loop and move on.

When the AI doesn't approve, that doesn't mean it found problems. These PRs simply need more evidence at their complexity level. The system recognizes changes beyond its approval thresholds and escalates them for human review.

Scenario	Median time	Impact
Before AI review system	3.5 hours	Baseline
AI-approved PRs	33 minutes	6.4x faster than baseline

AI's cognitive impact

The speed numbers tell half the story. With AI approval, the mental loop closes immediately—no waiting, no context switching, no uncertainty. Engineers can fully engage with what's next.

The consistency matters. When the system approves something, engineers trust it. When it escalates for human review, they know there's a real reason.

Safety validation

The Progressive Complexity Model behaves exactly as designed. High-risk changes get scrutiny. Low-risk changes move fast.

High-risk PRs that the system flagged for human review took about a day to approve. That's longer than the baseline. This is correct behavior. These are architectural changes, infrastructure modifications, and breaking changes. They need time and human judgment.

Lessons learned: Building systems engineers trust

Approval requires different engineering than comments. When approvals are binding, mistakes become production incidents. We choose safety over approval rate.

Consistency beats perfection. The same rules apply to every PR, without variation due to fatigue or time pressure. Engineers trust the system because it's predictable, not because it's flawless. When it approves, they know why. When it escalates, they know there's a real reason.

Know your limits and show your work. Some changes are beyond automation: architectural decisions, complex trade-offs, and organizational implications. The system recognizes these boundaries and escalates. Every decision is traceable, making it easy to identify which heuristic needs adjustment when something seems wrong.

What's next?

We're evolving the architecture to bring this capability to local development. Using Anthropic's Claude Skills and Plugins, we're building review heuristics as reusable components that integrate directly with our development workflow.

The goal isn't perfect automation. It's intelligent triage that meets engineers where they work.

Want to innovate with us?

This project represents how we think about engineering at Customer.io: solving real problems with thoughtful systems that respect both the technology and the people using it. We're building the infrastructure that powers billions of customer messages, and we need engineers who care about making complex systems work reliably at scale. Check out our open roles and see where you might fit.