AI Code Review Benchmarks (2026)
Externally authored benchmark data across seven AI code review tools and production open source pull requests.
Benchmark Report
Benchmark Bias
Can you trust a benchmark written by the winning tool?
Benchmark bias is common. Many benchmarks are authored by the same vendors they are meant to evaluate, which introduces bias in PR selection, labeling, and scoring.
To reduce this bias, we evaluated Propel amongst six other AI code review tools using an externally authored benchmark suite of pull requests across production open source repositories.
The benchmark scores tools on precision, recall, and F-score.

Results Overview
Propel led with the highest F-score.
Across all evaluated repositories, Propel led with an F-score of 64%, followed by Cursor Bugbot at 49% and Greptile at 45%.
Codex Code Review matched Propel on precision at 68%, but had the lowest recall among the tools at 29%, indicating a bias toward precision at the cost of coverage.
| Tool | Precision | Recall | F-score |
|---|---|---|---|
| Propel | 68% | 61% | 64% |
| Cursor Bugbot | 60% | 41% | 49% |
| Greptile | 45% | 45% | 45% |
| Codex Code Review | 68% | 29% | 41% |
| CodeRabbit | 36% | 43% | 39% |
| Claude Code | 23% | 51% | 31% |
| GitHub Copilot | 20% | 34% | 25% |
Bugs by Severity
Higher-impact bugs are where recall matters most.
Benchmarks often treat all misses equally, but engineering teams do not. Missing a low-severity style issue is very different from missing a critical correctness or security bug.
Propel is strongest at catching higher-impact bugs, with 77.8% recall for critical bugs and 70.7% recall for high-severity bugs.
| Bug Severity | Recall |
|---|---|
| Critical | 77.8% |
| High | 70.7% |
| Medium | 59.6% |
| Low | 50.0% |
- 77.8%Critical
- 70.7%High
- 59.6%Medium
- 50.0%Low
Methodology
Externally authored data, out-of-the-box configuration.
Propel was evaluated using an externally authored benchmark suite created by another company. Propel did not influence repository selection, pull request selection, labeling, or the externally produced results for other tools.
Propel was evaluated with no repository-specific tuning, custom rules, or historical learning. This reflects how teams experience Propel immediately after installation.
Independent benchmark data
Repository selection, pull request selection, labels, and baseline tool results came from the external benchmark.
Base configuration
Propel was tested without repository-specific tuning, custom rules, or historical learning.
Scoring
Precision, recall, and F-score balance correctness and coverage.
Precision measures how often reported findings are correct. Recall measures how many real issues are successfully caught. F-score summarizes both accuracy and coverage.
This formulation penalizes tools that optimize for precision at the cost of missing issues, as well as tools that maximize recall by generating excessive noise.
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F-score = 2 * (precision * recall) / (precision + recall)

Conclusion
Strong out-of-the-box results without giving up customization.
Propel delivers strong benchmark performance immediately, while also learning from your codebase, review patterns, and standards over time.