Scaling AI-Powered Code Review: A Multi-Agent Architecture

By ✦ min read

<h2 id="intro">Introduction</h2> <p>Code review is a cornerstone of software quality, but it often becomes a bottleneck in engineering workflows. A typical merge request enters a queue, waits for a reviewer to context-switch, then cycles through nitpicks and corrections. At Cloudflare, the median time for the first review across internal projects was measured in hours. To address this, we explored AI-powered code review—not just as a helper, but as a core part of our CI/CD pipeline.</p><figure style="margin:20px 0"><img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3g2Vqql5biqvjvXwxhDb3b/b0c7fd707437eff2a7acb9d3172368e4/BLOG-3284_OG.png" alt="Scaling AI-Powered Code Review: A Multi-Agent Architecture" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.cloudflare.com</figcaption></figure> <p>Our journey began with off-the-shelf AI code review tools. Many worked well and offered customization, but none provided the flexibility needed for an organization our size. The next step was a naive approach: feeding a git diff into a large language model (LLM) with a simple prompt. The results were noisy—vague suggestions, hallucinated syntax errors, and advice to add error handling to functions that already had it. It became clear that a generic summarization would not scale for complex codebases.</p> <h2 id="architecture">Building a Multi-Agent Orchestration System</h2> <p>Instead of building a monolithic agent, we created a CI-native orchestration system around <a href="https://github.com/opencode-ai/opencode" target="_blank"><strong>OpenCode</strong></a>, an open-source coding agent. When a Cloudflare engineer opens a merge request, the system launches up to seven specialized AI reviewers. Each focuses on a distinct domain: <strong>security</strong>, <strong>performance</strong>, <strong>code quality</strong>, <strong>documentation</strong>, <strong>release management</strong>, and compliance with our internal Engineering Codex. A <a href="#coordinator">coordinator agent</a> manages these specialists—deduplicating findings, assessing severity, and posting a single structured review comment.</p> <h3 id="coordinator">The Coordinator Agent</h3> <p>The coordinator is the brain of the operation. It aggregates outputs from all specialized agents, removes duplicates, and judges whether each issue is a genuine bug or a minor suggestion. Only critical and high-severity findings are flagged as blockers, while low-priority items are surfaced as recommendations. This prevents review fatigue and keeps the focus on real problems.</p> <h2 id="architecture-deep-dive">Architecture Deep Dive: Plugins and Flexibility</h2> <p>To support thousands of repositories with diverse tech stacks, our system had to be extensible. We built a plugin architecture that abstracts away version control systems, CI providers, and coding standards. Each specialized reviewer is a plugin, configurable per repository or team. The <a href="#plugins">plugin system</a> allows teams to add custom reviewers, adjust severity thresholds, or even disable certain checks for specific branches.</p><figure style="margin:20px 0"><img src="https://blog.cloudflare.com/cdn-cgi/image/format=auto,dpr=3,width=64,height=64,gravity=face,fit=crop,zoom=0.5/https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4veI2sDj3FhForbfne4tQB/a9e868ac9a0727780f404a6c9a37a9dc/IMG_0052_-_Cropped.jpg" alt="Scaling AI-Powered Code Review: A Multi-Agent Architecture" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.cloudflare.com</figcaption></figure> <h3 id="plugins">Plugin System</h3> <p>Plugins are lightweight containers that implement a standard interface: they receive a diff, context (e.g., repository metadata), and return a list of findings with severity, category, and suggested fix. The coordinator merges these lists. This design made it easy to experiment—our team added a performance reviewer using a fine-tuned model, and later swapped it for a smaller, faster model without changing the pipeline.</p> <h2 id="results">Results: Thousands of Merge Requests, Minimal Noise</h2> <p>We have been running this system across tens of thousands of merge requests internally. It approves clean code, flags real bugs with impressive accuracy, and <strong>actively blocks merges</strong> when it detects genuine security vulnerabilities or serious errors. The false positive rate is well below 5%, thanks to the specialization and deduplication. Engineers report that the AI review is now a trusted second pair of eyes, not an annoyance.</p> <p>This initiative is part of our broader <a href="/code-orange"><strong>Code Orange: Fail Small</strong></a> program, which aims to improve engineering resiliency by catching issues early and reducing manual toil. By orchestrating multiple AI agents, we turned a bottleneck into a force multiplier.</p> <h2 id="conclusion">Conclusion</h2> <p>Building a single, monolithic AI code reviewer is tempting but often fails in practice. Our approach—using a <a href="#coordinator">coordinator</a> to manage <a href="#plugins">specialized plugins</a>—has scaled to thousands of repositories and tens of thousands of reviews. The key lessons: embrace specialization, invest in deduplication, and treat the AI as an integral part of your CI/CD pipeline, not an afterthought. The result is faster reviews, higher code quality, and happier engineers.</p>

Tags: