Best Practices

Artifact-First Coding Agents: Why Files Beat Chat Memory in Code Review

May 11, 2026

Artifact-First Coding Agents: Why Files Beat Chat Memory in Code Review

Coding agents are getting better at working for hours, touching more files, and returning with larger pull requests. That makes throughput look great. It also creates a new failure mode: the most important state is trapped inside a long chat transcript that nobody wants to read. Reviewers inherit a polished diff but not the agent's working memory, verification trail, or decision boundaries. The result is slower review, weaker trust, and higher merge risk.

Key Takeaways

  • Chat history is useful for steering an agent, but weak as the long-term source of truth for review.
  • Durable artifacts such as plans, checkpoints, HTML review pages, and provenance summaries scale better than giant transcripts.
  • Recent agent workflows increasingly externalize state into files, visible outputs, and structured handoff artifacts.
  • Reviewers need compact evidence they can diff, search, and verify, not a replay of every prompt turn.
  • Artifact-first workflows reduce context rot and make async coding agents easier to supervise.
  • Propel fits naturally into this stack by turning AI-generated changes into reviewable evidence and routing by risk.

TL;DR

If your coding agent only remembers work through an ever-growing chat, the review surface gets worse as the session gets longer. Put the important state in durable artifacts instead: a task brief, a scoped plan, checkpoint notes, test output, an HTML explainer when the diff is complex, and a compact provenance record in the PR. That gives reviewers something stable to inspect and gives teams a cleaner path to trustworthy AI code review.

Why this matters right now

The latest wave of agent news is not just about smarter models. It is about how teams are packaging long-running work so humans can still supervise it. Simon Willison highlighted Thariq Shihipar's case for asking Claude Code to produce rich HTML review artifacts instead of plain Markdown, because HTML can carry diagrams, navigation, inline annotations, and a clearer handoff for complex work.

Using Claude Code: The Unreasonable Effectiveness of HTML

At Anthropic's Code w/ Claude event on May 6, 2026, Simon also noted several workflow patterns that matter here: multiple async sessions, richer desktop outputs, code review built into the product, and the new research-preview "Dreaming" feature creating a persistent descent-playbook.md file from prior runs. OpenAI made a parallel point in its GPT-5.3-Codex launch: the best agents now give frequent progress updates and let humans steer without losing context while work is still in flight.

Live blog: Code w/ Claude 2026

and

Introducing GPT-5.3-Codex

The common theme is simple: better agents need better artifacts. That is the part many teams still under-design.

Chat memory is not a strong review surface

A rolling conversation is convenient for the model, but awkward for humans. It mixes planning, exploration, dead ends, partial fixes, tool output, and outdated assumptions into one blob. As sessions get longer, the most important decisions become harder to locate, harder to compare, and easier to misread.

We see this repeatedly in coding workflows:

  • The agent's best reasoning is buried between dozens of routine status turns.
  • Early assumptions stay alive even after the repository reality changes.
  • Verification steps are mentioned informally but not recorded as durable evidence.
  • Reviewers receive summaries that sound plausible but are hard to audit.

This is closely related to the long-context problem. More history does not guarantee better understanding. In fact, longer and dirtier context often makes systems less reliable over time. If you want the broader framing, our guide on

long context windows and context rot

explains why larger working sets often degrade quality instead of improving it.

What artifact-first actually means

Artifact-first does not mean banning chat. It means treating chat as the coordination layer, not the durable system of record. The important state gets externalized into files and outputs that humans can inspect directly.

Workflow needChat-heavy approachArtifact-first approach
Task intentBuried in a prompt threadShort task brief with explicit scope and non-goals
Working memoryGrowing conversational historyPlan file, checkpoint notes, issue links, and open questions
Review handoffParagraph summary in PR bodyHTML explainer, test evidence, and scoped provenance record
AuditabilitySearch raw transcripts after the factStable artifacts attached to the change request
Async collaborationAsk someone to "read the thread"Hand them the latest artifacts and checkpoints

The minimum artifact stack for coding agents

You do not need a giant framework. Most teams can get real gains from five repeatable artifacts.

1. Task brief

Start with a compact brief that states the goal, constraints, and non-goals. This sounds obvious, but it is the easiest way to prevent the agent from expanding scope invisibly. Our

agentic engineering guardrails

post covers how small constraint errors become large review problems later.

2. Plan or checkpoint file

Long-running work needs a place to store what the agent believes right now: the active plan, unresolved questions, failed attempts worth remembering, and which files or systems were touched. Anthropic's "Dreaming" example producing a .md playbook is directionally important here because it turns ephemeral context into something durable.

3. Verification evidence

Test output, lint results, screenshots, traces, and failed checks should not live only inside a summary sentence. They should be first-class artifacts. Our

evidence-first AI code review

post makes the core argument: when code gets cheaper, proof gets more valuable.

4. Rich review artifact for complex diffs

Not every change needs this, but large or cross-cutting work often benefits from a richer output than Markdown. HTML review pages are compelling because they can combine structure, navigation, diagrams, annotated diffs, and embedded evidence in one place. That lines up with our earlier post on

the artifacts reviewers need for AI rewrites

.

5. Compact session provenance

Reviewers do not need full chain-of-thought dumps, but they do need the minimum record of what happened: task intent, tools touched, guardrails, checkpoint outcomes, and human overrides. That is the exact gap addressed in our guide on

session provenance for AI code review

.

Why this pattern fits async and background agents

Async agents multiply the need for good artifacts because the supervising human was not present for every turn. Anthropic's keynote emphasized parallel sessions and waking up to PRs. Stripe's Minions story framed the same shift from interactive prompting to one-shot, end-to-end execution. Once agents work out of band, review artifacts become the interface between autonomy and trust.

Introducing Minions: Stripe's one-shot, end-to-end coding agents

If your team is moving more work into async lanes, pair this post with our guide to

background agents in engineering

and our piece on

parallel coding agents and branch chaos

. The workflow design problems start to look similar once many agents are operating at once.

Design the handoff for the reviewer, not the model

A common mistake is optimizing every artifact for what helps the agent most in the moment. Reviewers need something different. They need:

  • A stable summary of intent and scope
  • Evidence that important checks really ran
  • A quick way to inspect the riskiest files or decisions
  • A record of what the agent accessed or assumed
  • A clean path to challenge or rerun questionable work

That is why HTML explainers, provenance snippets, and checkpoint files matter more than a huge transcript export. They reduce the cost of supervision without pretending supervision is optional.

A practical workflow you can adopt this month

  1. Create a task brief template with goal, constraints, owner, and risk tier.
  2. Require the agent to update one checkpoint artifact during long runs.
  3. Collect tests, logs, and screenshots into a small evidence bundle.
  4. Ask for an HTML review page when the diff spans multiple systems or exceeds your normal review size threshold.
  5. Attach compact provenance fields to medium and high-risk PRs.
  6. Route the final change through independent AI review before merge.

This workflow works especially well when paired with

AI review process improvement

and

a verification layer that tracks real resolution outcomes

.

Where Propel fits

Propel is most valuable when your team already accepts that AI-generated code needs an independent review surface. Artifact-first workflows make that surface stronger. Instead of asking reviewers to trust a generator's chat history, you can feed Propel the diff, tests, evidence, and policy context and get a higher-signal review loop around the actual change.

If you are operationalizing this at team level, start with an evidence-first policy and connect it to your review queue. Our

AI code review and development playbook

is the broader operating model, and the Propel pricing page covers deployment options.

FAQ

Do we need to save full agent transcripts?

Usually no. Keep raw transcripts for incident response or compliance-heavy work. For normal engineering review, compact provenance plus durable artifacts is a better default.

Are HTML artifacts overkill for every pull request?

Yes. Use them selectively for large diffs, cross-system changes, or complex debugging work where the reviewer benefits from navigation and annotated evidence.

Is this just another name for documentation?

Not exactly. Documentation is usually written for future readers in general. Artifact-first agent workflows are specifically about preserving the critical state that makes an AI-authored change reviewable right now.

What metric should we watch first?

Track review turnaround and rework rate on AI-authored PRs that include artifacts versus those that do not. If artifacts are helping, reviewers should reach the risky parts of the change faster and ask fewer context-rebuilding questions.

Code review you can trust.

Propel surfaces what matters so your team can ship with confidence. Built to scale code quality across your teams.

Book a Demo