AI Code Review Needs Eval Provenance: How to Trust Agent-Run Benchmarks

Coding agents no longer stop at writing code. They run tests, generate benchmark tables, capture demo videos, compare models, and return with a neat summary that says the change is ready. That is useful progress, but it creates a new review problem. If a pull request includes AI-generated claims about correctness, performance, or quality, reviewers need to understand how those claims were produced. The diff is only half the work. The evaluation itself now needs provenance.
Key Takeaways
- Agent-generated benchmark claims are becoming first-class review artifacts.
- Without eval provenance, reviewers cannot tell reruns, cherry-picks, and demos apart.
- Provenance should capture the task, fixture, environment, retries, selection rule, and failures, not only the winning result.
- Demo videos and screenshots are useful, but they should complement logs and rerunnable evidence, not replace them.
- High-risk AI code review should separate generation from judging so the same model is not grading its own work in the dark.
TL;DR
Eval provenance is the review artifact that explains how an AI-generated claim was created. For any pull request that includes agent-run tests, benchmarks, or demo outputs, require a compact record of the objective, fixtures, environment, retries, human edits, and raw results. If reviewers cannot rerun or audit the claim, they should not trust it as merge evidence.
Why this topic is breaking out right now
Between January 8 and March 18, 2026, several of the engineering and AI feeds most relevant to software teams pointed at the same shift: agents are not only writing code. They are producing the evidence that humans use to decide whether that code is correct.
- On February 24, 2026, Cloudflare described how one engineer and an AI model rebuilt a Next.js-compatible framework in under a week, backed by 1,700+ Vitest tests, 380 Playwright tests, and AI-assisted review loops. That is impressive output, but it also means the review surface includes the testing and benchmark methodology, not only the implementation. How we rebuilt Next.js with AI in one week.
- On March 6, 2026, Simon Willison's agentic manual testing guide made the core operational point explicit: never assume LLM-generated code works until it has been executed.
- Also on March 6, 2026, Latent Space's coverage of Cursor cloud agents argued that testing, demo videos, and remote control matter because reviewing agent-written code is becoming the bottleneck. That turns generated evidence into part of the product, not an optional extra. Cursor's Third Era: Cloud Agents.
- On February 9, 2026, Interconnects framed model comparison as part of a post-benchmark era, where public scores are less important than how teams evaluate models in their own workflows. That same logic applies inside a PR. Reviewers need to know how the benchmark was run before they treat it as evidence. Opus 4.6, Codex 5.3, and the post-benchmark era.
- On January 8, 2026, ICML updated its peer-review rules to allow some LLM-assisted review while also tightening policies around abuse and disclosure. Academic peer review and code review are not the same, but the direction is similar: AI assistance is becoming normal, so auditability matters more, not less. What's New in ICML 2026 Peer Review.
Put together, these signals describe a real shift in the review stack. Agents are increasingly responsible for producing claims about what happened. Humans still own the merge decision, which means humans need a way to inspect the claim-generation process.
What eval provenance means in a pull request
Eval provenance is the compact artifact that explains how an AI-generated result was obtained. If session provenance tells you how an agent produced the diff, eval provenance tells you how the agent produced the evidence that the diff is safe.
This extends the ideas in our guides to session provenance, evidence-first AI code review, and post-benchmark evaluation. The difference is that the artifact is anchored to a specific claim inside a PR, such as “build time improved by 27%,” “all critical flows passed,” or “Model A beat Model B on this repository task set.”
| Claim type | What reviewers need | Common failure mode |
|---|---|---|
| Benchmark result | Baseline, fixture, trial count, selection rule, raw distribution | Best run presented as representative result |
| Agent-run test pass | Command, environment, seeded inputs, failures, rerun policy | Hidden retries or missing failed attempts |
| Video or screenshot demo | Source run, linked logs, exact branch, actions taken | Demo looks correct but underlying behavior is fragile or incomplete |
| Model comparison | Same prompt set, same tools, same judge, same cost and latency budget | Different harnesses make the comparison meaningless |
Why agent-run evals are easy to over-trust
The core risk is simple. AI systems are getting much better at producing plausible evidence. A clean markdown table, passing video, or chart screenshot can look persuasive long before it is trustworthy. Reviewers are already overloaded, so they naturally accept summarized evidence if it feels coherent. That is exactly why provenance needs to be structured and mandatory.
1. Best-of-N can hide behind a single “final result”
Many agent workflows implicitly sample, retry, or search. That is often the right design, but it changes what the result means. If an agent tested six variants and surfaced the best one, reviewers should know that. Without that context, the claim looks more stable than it really is. This is closely related to why model diversity and disagreement analysis matter in AI review.
2. Hidden retries change the interpretation of “pass”
A green result after four silent retries is not the same as a green result on the first run. For flaky tests or browser-based flows, retry policy is part of the evidence. If the retry count, timeout policy, or failure reason is missing, the summary is incomplete.
3. Environment drift makes small claims look bigger than they are
Benchmark claims are highly sensitive to runner type, cache state, seed handling, network conditions, and fixture size. The Cloudflare vinext post does a good job spelling out parts of its methodology because early benchmark numbers are easy to over-read. PR review should demand the same discipline for internal claims.
4. Demo videos are an entry point, not the ground truth
Latent Space highlighted why demo videos help: they make huge diffs easier to approach. That is true, and we expect video artifacts to become common. But a video is still a summary. It should point reviewers toward the run, not replace the run. For large AI-authored changes, combine video with the artifact discipline in our guide to AI rewrite review artifacts.
5. Self-grading is still a weak spot
If the same model generates a patch, writes the benchmark harness, scores the output, and summarizes the conclusion, you have a correlated failure path. That does not make the result useless, but it should raise the review threshold. High-risk claims deserve an independent verifier path, the same way production review deserves a verification layer.
The minimum eval provenance artifact
The artifact should be compact enough that teams will actually use it, but complete enough that a reviewer can rerun or challenge the claim. In practice, six fields do most of the work.
1. Objective and success rule
What exact question was the evaluation answering, and what counts as success?
2. Fixture and baseline
Which repo state, dataset, branch, user flow, or benchmark set was used for comparison?
3. Environment
Which runner, hardware, tool versions, caches, seeds, and external services were active?
4. Search and retry policy
How many attempts ran, what was retried, and which result was selected for reporting?
5. Human interventions
Any manual prompt changes, cherry-picks, deleted runs, or edited outputs should be disclosed.
6. Raw artifacts
Logs, traces, screenshots, videos, or result files that let a reviewer inspect the original evidence.
A simple JSON envelope is often enough. The important part is that it is standardized and linked from the PR.
{
"claim_id": "build-benchmark-2026-03-19",
"claim_type": "performance",
"objective": "Compare branch build time against main",
"baseline_ref": "main@abc123",
"candidate_ref": "pr@def456",
"fixture": "33-route app-router benchmark",
"environment": {
"runner": "github-actions-ubuntu-24",
"node": "22.7.0",
"cache": "cold",
"trials": 10
},
"selection_rule": "report median of all successful runs",
"retry_policy": "retry only on infra timeout, keep all attempts",
"human_interventions": [],
"artifacts": [
"artifacts/build-times.csv",
"artifacts/runner-log.txt"
]
}How to review an AI-generated benchmark claim
Once provenance exists, the review flow becomes much simpler. Reviewers stop debating screenshots and start checking whether the claim is decision-useful.
- Confirm that the objective matches the merge decision being requested.
- Check that baseline and candidate used the same harness and fixture.
- Inspect retries, failed runs, and selection policy before reading the summary.
- Verify that demos and screenshots link back to machine-readable artifacts.
- Rerun high-risk claims or route them through an independent verifier.
This is especially important for teams adopting agent-heavy workflows like the spec-to-PR model. The faster the generation loop gets, the more you need a predictable review contract for the evidence that accompanies it.
What should block merge
Blockers for high-risk claims
- No reproducible command, fixture, or baseline for the reported result
- Only the winning run is attached, with no failed attempts or distribution
- Video or screenshot evidence has no linked logs, traces, or terminal output
- The generating model also judged the result with no independent verification path
- Human prompt edits or cherry-picked comparisons are undisclosed
None of these blockers require perfect determinism. They require declared variance and honest methodology. Our guide to LLM nondeterminism explains why reproducibility is a spectrum. Provenance makes that spectrum visible.
How teams can adopt this without slowing down
Start narrow. You do not need provenance for every trivial copy change. You do need it wherever AI-generated evidence is being used to justify a merge.
Rollout plan
- Week 1: require eval provenance for benchmarks, model comparisons, and performance claims.
- Week 2: require provenance for agent-run browser tests and demo videos on medium-risk UI work.
- Week 3: add CI checks that reject missing artifacts and compare reruns to prior baselines.
- Week 4: measure how often AI-generated claims survive review unchanged versus being corrected, rerun, or discarded.
That final metric matters. If many claims fail review, you do not have a model problem only. You have a provenance problem. Teams that track resolved outcomes instead of comment volume already understand this instinct from resolution rate.
Where Propel fits
Propel is built for this operating model. The goal is not to attach more AI-generated markdown to every PR. The goal is to make AI evidence reviewable. That means standard artifacts, risk routing, independent verification on sensitive changes, and metrics that tell you whether the evidence changed the merge outcome for the better.
As more teams adopt coding agents, the durable advantage will not be who can generate the most benchmark claims. It will be who can trust those claims quickly. Eval provenance is one of the simplest ways to make that trust operational.
Frequently Asked Questions
Is eval provenance only for formal model benchmarks?
No. It also applies to build-time claims, browser demos, regression tests, migration checks, and any AI-generated artifact used as merge evidence.
Are demo videos enough for AI code review?
No. They are helpful as an entry point, especially for UI changes, but they should link to the run logs, branch, and validation steps behind the demo.
Do we need perfect reproducibility before we can use provenance?
No. Most teams only need declared variance, explicit retries, and enough context to rerun or challenge the result.
Should the same model generate and judge the result?
For low-risk tasks it can be acceptable, but for medium and high-risk claims an independent verifier path is safer and usually more informative.
Review the evidence behind every AI claim
Propel helps teams require provenance, compare reruns, and route risky AI-generated claims through a higher-signal review path.


