Best Practices
AI-Resistant Technical Evaluations: How to Review Engineers in the Coding-Agent Era
May 26, 2026
Legacy take-homes assumed code was the scarce thing. In 2026, that assumption is gone. A candidate can show up with Claude Code, Codex, Cursor, or another agent that drafts, debugs, and rewrites faster than most humans can type. The question is no longer "can this person produce code without assistance?" It is "can this person steer AI toward the right solution, verify the result, and explain tradeoffs under review?" That is much closer to the real work modern engineering teams need done every day.
Quick answer
Modern technical evaluations should allow AI, then score how candidates frame the problem, set constraints, validate behavior, reject bad output, and simplify the final change. The best interview loops now look a lot like strong AI code review: explicit scope, evidence of execution, and a follow-up discussion that tests judgment instead of memorization.
Key Takeaways
Old take-homes are losing signal because strong coding agents can now solve them inside the same time box as good human candidates.
Banning AI is usually the wrong fix. It rewards concealment and ignores the actual way modern teams work.
The best evaluations now measure steering, debugging, verification, and tradeoff judgment more than raw code production.
A compact prompt contract and evidence bundle are more useful than a full chat transcript.
Teams should align hiring review with production review so candidates and employees are judged on the same habits.
TL;DR
If candidates can use coding agents on the job, your interview loop should test how well they use them. Give a realistic problem, allow AI tools, require a short record of scope and validation, and spend your live interview on why-this-way questions rather than syntax trivia. The evaluation that still carries signal is the one that reveals whether the candidate can guide and critique machine output under real constraints.
Why this topic matters right now
Over the past few months, several engineering sources converged on the same operating reality: coding agents are not a side tool anymore. They are reshaping how engineering work gets produced and reviewed.
Anthropic's January 21, 2026 post on AI-resistant technical evaluations
explained that Claude Opus 4.5 matched the best human take-home performance within the same two-hour window, forcing the team to redesign the exercise again.
The Pragmatic Engineer's AI Tooling for Software Engineers in 2026 survey
found AI is mainstream: 95% of respondents use AI weekly, 75% use it for at least half of their work, and staff+ engineers are the heaviest agent users. Interview policies that assume "no AI" no longer match day-to-day engineering reality.
Uber's AI development case study
reported that more AI-generated code produced more review noise, enough that Uber built Code Inbox and uReview to keep routing and feedback high signal.
Nolan Lawson's "Using AI to write better code more slowly"
made the most useful counterpoint to the slop narrative: the leverage is not raw typing speed, it is prioritizing and validating what the models found.
Latent Space's May 2026 roundup on coding agents "breaking containment"
captured the broader shift. Once agents move from code assistance into planning and multi-step execution, the human value moves up the stack into scope, control, and verification.
The pattern is clear. Production engineering is shifting from pure authorship toward steering and verification. Hiring loops need to shift with it.
Why banning AI is usually the wrong default
The instinct to ban AI is understandable. It feels clean, enforceable, and comparable to older interview loops. In practice, it creates the wrong incentives.
It selects for concealment. Candidates know AI is useful, so the rule often tests who is willing to hide usage rather than who uses tools well.
It measures a workflow the job may not require anymore. Many teams already expect engineers to use agents in the IDE, terminal, or review pipeline.
It over-weights code production and under-weights critique. The highest-value engineering judgment now often appears after the first draft, not before it.
It makes the interview less representative of real work, which is the exact opposite of what a good evaluation should do.
Anthropic's write-up makes this especially clear. Their team explicitly did not want to ban AI assistance because people still play a vital role in the work. The hard part was designing a loop where humans could distinguish themselves with AI in the loop, the same way they would on the job.
What the evaluation should actually measure now
Once AI is allowed, the scoring target changes. You are no longer evaluating only whether the candidate can write code. You are evaluating whether they can use AI to reach a trustworthy result.
| Dimension | What good looks like | Red flag |
|---|---|---|
| Problem framing | Clarifies the goal, non-goals, edge cases, and likely failure modes. | Jumps straight into generation with no task model. |
| Agent steering | Uses bounded prompts, iterations, and corrective guidance. | Blind delegation followed by shallow acceptance. |
| Verification | Runs tests, reproduces behavior, and checks claims against evidence. | Relies on model confidence or a passing summary. |
| Code quality | Produces readable, scoped, maintainable code with stable defaults. | Large, noisy patch with accidental complexity. |
| Simplification | Removes weak ideas and improves the shape of the solution. | Ships the first working output even if it is awkward. |
| Review judgment | Can explain tradeoffs, risks, and what still needs human review. | Cannot defend why the final patch should merge. |
This is one reason our posts on
prompt requests vs pull requests
and
evidence-first AI code review
map so cleanly into hiring. The same artifacts that make an AI-authored pull request reviewable are the artifacts that make an AI-assisted candidate submission evaluable.
A take-home format that still carries signal
The best modern take-home is realistic, bounded, and explicit about allowed tooling. It should not try to pretend agents do not exist. It should force the candidate to show how they work with them.
Use a realistic but safe problem. Pick a scoped codebase slice or simulator, not your production repo with secrets or sensitive business logic.
Allow AI tools explicitly. State that the candidate may use coding agents, but must document how they constrained and validated the output.
Require a compact submission package. Ask for the patch, a short prompt contract, evidence of execution, and a brief note on tradeoffs or open questions.
Timebox the work to 90 to 180 minutes. Long enough for real steering and verification, short enough to compare approaches.
Follow with a 30 to 45 minute review conversation. That is where you test whether the candidate actually owns the result.
Prefer one meaningful problem over a laundry list. Depth reveals judgment far better than breadth.
Suggested submission package
This is much more useful than demanding a full chat transcript. Reviewers do not need every prompt. They need enough structure to understand intent, boundaries, and proof.
How to score AI-assisted submissions
A practical rubric should reward judgment more than volume. One workable split looks like this:
- 25% verification quality: did they prove the result, or just claim it?
- 20% problem framing: did they model the task before generating code?
- 20% code quality: is the patch readable, bounded, and maintainable?
- 15% agent steering: did they guide the tool well, or let it wander?
- 10% simplification: did they remove unnecessary complexity?
- 10% explanation: can they defend tradeoffs and remaining risk?
This is also a useful place to borrow from production review metrics. If your team already cares about signal quality, provenance, and resolution rate in live pull requests, those same instincts belong in candidate evaluation. Our guide to
verification layers and resolution rate
explains why those outcome-oriented measures beat raw comment volume.
What to do in the live follow-up interview
The follow-up discussion is where weak, over-delegated submissions usually break down. Instead of asking trivia, ask the candidate to review their own AI-assisted work the way a staff engineer would review a pull request.
- Ask them to explain one rejected path and why they abandoned it.
- Ask which test or manual check gave them the most confidence.
- Ask what they would tighten before merge if this landed in production.
- Ask them to identify one place the agent output was wrong or misleading.
Ask what a human reviewer should still inspect even after all evidence passes.
These questions reveal whether the candidate can do the real senior work: turning machine output into a trustworthy engineering decision.
What not to optimize for
Many interview loops still over-index on the wrong signals.
- Do not optimize for who typed the most code manually.
- Do not optimize for who finished first if the output is under-validated.
- Do not optimize for raw syntax recall on work agents now do well.
Do not confuse a polished final patch with strong engineering judgment unless the candidate can explain how it was verified.
Nolan Lawson's point is useful here: the problem is not finding bugs, it is prioritizing and validating them. That same distinction applies to hiring. Many candidates can generate plausible code quickly. Fewer can prove which version should survive review.
How this maps directly to AI code review
A strong hiring loop and a strong AI code review system should reinforce each other. In both cases, the best artifact set looks the same:
- Intent: what was the task and what was out of scope?
- Provenance: which tools or models were involved?
- Evidence: what ran, what passed, and what still looks risky?
- Judgment: why is this the right patch to keep?
If you want candidates to learn the habits your team actually values, build the loop around the same standards you expect in production. That means session context from
AI code review provenance
, scoped review policies from
agentic engineering guardrails
, and repo-specific outcome thinking from
post-benchmark AI code review evals
.
How Propel fits
Propel helps teams operationalize exactly this review model in production PRs. Instead of rewarding AI output for volume alone, it helps reviewers reason about risk, evidence, and whether findings lead to meaningful change. If you want your hiring loop and your delivery workflow to measure the same habits, that is the standard to aim for.
See plans and start free trial
FAQ
Should we require full prompt or chat transcripts from candidates?
Usually no. Ask for a compact prompt contract plus evidence bundle instead. Full transcripts are noisy and easy to over-interpret. Reviewers mostly need scope, constraints, verification, and tradeoff notes.
Should we ban AI for junior roles?
Usually no. Junior engineers still need to learn how to guide tools, understand output, and verify behavior. If the job expects AI-assisted work, the interview should test those habits.
Can we use our real internal repository for take-homes?
Prefer a stripped-down or synthetic environment. Keep secrets, customer data, and sensitive internal logic out of the exercise. Candidates need realism, not access to your production risk surface.
How long should the take-home be?
For most roles, 90 to 180 minutes plus a short follow-up conversation is the best balance. It is enough time for steering and validation without turning the process into unpaid project work.
What is the fastest upgrade if we cannot redesign the whole loop right now?
Allow AI explicitly and add a required evidence section to the submission. Even that single change shifts the evaluation toward judgment and away from blind code generation.
Related Reading
Prompt requests vs pull requests
Evidence-first AI code review
Parallel coding agents and branch chaos
Why verification layers beat comment volume
What to store in every AI-authored PR


