Best Practices
Artifact-First Coding Agents: Why Files Beat Chat Memory in Code Review
May 11, 2026

Coding agents are getting better at working for hours, touching more files, and returning with larger pull requests. That makes throughput look great. It also creates a new failure mode: the most important state is trapped inside a long chat transcript that nobody wants to read. Reviewers inherit a polished diff but not the agent's working memory, verification trail, or decision boundaries. The result is slower review, weaker trust, and higher merge risk.
Key Takeaways
- Chat history is useful for steering an agent, but weak as the long-term source of truth for review.
- Durable artifacts such as plans, checkpoints, HTML review pages, and provenance summaries scale better than giant transcripts.
- Recent agent workflows increasingly externalize state into files, visible outputs, and structured handoff artifacts.
- Reviewers need compact evidence they can diff, search, and verify, not a replay of every prompt turn.
- Artifact-first workflows reduce context rot and make async coding agents easier to supervise.
- Propel fits naturally into this stack by turning AI-generated changes into reviewable evidence and routing by risk.
TL;DR
If your coding agent only remembers work through an ever-growing chat, the review surface gets worse as the session gets longer. Put the important state in durable artifacts instead: a task brief, a scoped plan, checkpoint notes, test output, an HTML explainer when the diff is complex, and a compact provenance record in the PR. That gives reviewers something stable to inspect and gives teams a cleaner path to trustworthy AI code review.
Why this matters right now
The latest wave of agent news is not just about smarter models. It is about how teams are packaging long-running work so humans can still supervise it. Simon Willison highlighted Thariq Shihipar's case for asking Claude Code to produce rich HTML review artifacts instead of plain Markdown, because HTML can carry diagrams, navigation, inline annotations, and a clearer handoff for complex work.
Using Claude Code: The Unreasonable Effectiveness of HTML
At Anthropic's Code w/ Claude event on May 6, 2026, Simon also noted several
workflow patterns that matter here: multiple async sessions, richer desktop
outputs, code review built into the product, and the new research-preview
"Dreaming" feature creating a persistent descent-playbook.md
file from prior runs. OpenAI made a parallel point in its GPT-5.3-Codex
launch: the best agents now give frequent progress updates and let humans steer
without losing context while work is still in flight.
Live blog: Code w/ Claude 2026
and
Introducing GPT-5.3-Codex
The common theme is simple: better agents need better artifacts. That is the part many teams still under-design.
Chat memory is not a strong review surface
A rolling conversation is convenient for the model, but awkward for humans. It mixes planning, exploration, dead ends, partial fixes, tool output, and outdated assumptions into one blob. As sessions get longer, the most important decisions become harder to locate, harder to compare, and easier to misread.
We see this repeatedly in coding workflows:
- The agent's best reasoning is buried between dozens of routine status turns.
- Early assumptions stay alive even after the repository reality changes.
- Verification steps are mentioned informally but not recorded as durable evidence.
- Reviewers receive summaries that sound plausible but are hard to audit.
This is closely related to the long-context problem. More history does not guarantee better understanding. In fact, longer and dirtier context often makes systems less reliable over time. If you want the broader framing, our guide on
long context windows and context rot
explains why larger working sets often degrade quality instead of improving it.
What artifact-first actually means
Artifact-first does not mean banning chat. It means treating chat as the coordination layer, not the durable system of record. The important state gets externalized into files and outputs that humans can inspect directly.
| Workflow need | Chat-heavy approach | Artifact-first approach |
|---|---|---|
| Task intent | Buried in a prompt thread | Short task brief with explicit scope and non-goals |
| Working memory | Growing conversational history | Plan file, checkpoint notes, issue links, and open questions |
| Review handoff | Paragraph summary in PR body | HTML explainer, test evidence, and scoped provenance record |
| Auditability | Search raw transcripts after the fact | Stable artifacts attached to the change request |
| Async collaboration | Ask someone to "read the thread" | Hand them the latest artifacts and checkpoints |
The minimum artifact stack for coding agents
You do not need a giant framework. Most teams can get real gains from five repeatable artifacts.
1. Task brief
Start with a compact brief that states the goal, constraints, and non-goals. This sounds obvious, but it is the easiest way to prevent the agent from expanding scope invisibly. Our
agentic engineering guardrails
post covers how small constraint errors become large review problems later.
2. Plan or checkpoint file
Long-running work needs a place to store what the agent believes right now:
the active plan, unresolved questions, failed attempts worth remembering, and
which files or systems were touched. Anthropic's "Dreaming" example producing
a .md playbook is directionally important here because it turns
ephemeral context into something durable.
3. Verification evidence
Test output, lint results, screenshots, traces, and failed checks should not live only inside a summary sentence. They should be first-class artifacts. Our
evidence-first AI code review
post makes the core argument: when code gets cheaper, proof gets more valuable.
4. Rich review artifact for complex diffs
Not every change needs this, but large or cross-cutting work often benefits from a richer output than Markdown. HTML review pages are compelling because they can combine structure, navigation, diagrams, annotated diffs, and embedded evidence in one place. That lines up with our earlier post on
the artifacts reviewers need for AI rewrites
.
5. Compact session provenance
Reviewers do not need full chain-of-thought dumps, but they do need the minimum record of what happened: task intent, tools touched, guardrails, checkpoint outcomes, and human overrides. That is the exact gap addressed in our guide on
session provenance for AI code review
.
Why this pattern fits async and background agents
Async agents multiply the need for good artifacts because the supervising human was not present for every turn. Anthropic's keynote emphasized parallel sessions and waking up to PRs. Stripe's Minions story framed the same shift from interactive prompting to one-shot, end-to-end execution. Once agents work out of band, review artifacts become the interface between autonomy and trust.
Introducing Minions: Stripe's one-shot, end-to-end coding agents
If your team is moving more work into async lanes, pair this post with our guide to
background agents in engineering
and our piece on
parallel coding agents and branch chaos
. The workflow design problems start to look similar once many agents are operating at once.
Design the handoff for the reviewer, not the model
A common mistake is optimizing every artifact for what helps the agent most in the moment. Reviewers need something different. They need:
- A stable summary of intent and scope
- Evidence that important checks really ran
- A quick way to inspect the riskiest files or decisions
- A record of what the agent accessed or assumed
- A clean path to challenge or rerun questionable work
That is why HTML explainers, provenance snippets, and checkpoint files matter more than a huge transcript export. They reduce the cost of supervision without pretending supervision is optional.
A practical workflow you can adopt this month
- Create a task brief template with goal, constraints, owner, and risk tier.
- Require the agent to update one checkpoint artifact during long runs.
- Collect tests, logs, and screenshots into a small evidence bundle.
- Ask for an HTML review page when the diff spans multiple systems or exceeds your normal review size threshold.
- Attach compact provenance fields to medium and high-risk PRs.
- Route the final change through independent AI review before merge.
This workflow works especially well when paired with
AI review process improvement
and
a verification layer that tracks real resolution outcomes
.
Where Propel fits
Propel is most valuable when your team already accepts that AI-generated code needs an independent review surface. Artifact-first workflows make that surface stronger. Instead of asking reviewers to trust a generator's chat history, you can feed Propel the diff, tests, evidence, and policy context and get a higher-signal review loop around the actual change.
If you are operationalizing this at team level, start with an evidence-first policy and connect it to your review queue. Our
AI code review and development playbook
is the broader operating model, and the Propel pricing page covers deployment options.
FAQ
Do we need to save full agent transcripts?
Usually no. Keep raw transcripts for incident response or compliance-heavy work. For normal engineering review, compact provenance plus durable artifacts is a better default.
Are HTML artifacts overkill for every pull request?
Yes. Use them selectively for large diffs, cross-system changes, or complex debugging work where the reviewer benefits from navigation and annotated evidence.
Is this just another name for documentation?
Not exactly. Documentation is usually written for future readers in general. Artifact-first agent workflows are specifically about preserving the critical state that makes an AI-authored change reviewable right now.
What metric should we watch first?
Track review turnaround and rework rate on AI-authored PRs that include artifacts versus those that do not. If artifacts are helping, reviewers should reach the risky parts of the change faster and ask fewer context-rebuilding questions.


