Towards the end of my last role, I noticed a new phenomenon take shape in code reviews. They were drifting away from the old norm of combing through every changed or added line and turning into something closer to a quick check of the test signals and a rubberstamp.
Two things worth noting about that.
- This was at a large tech company notorious for stringent code requirements. The kind of place where everyone's first PRs coming in were expected to get carefully picked apart. The new rubberstamp habit felt unusual as a result.
- It wasn't the result of any new rule about code reviews. It took shape as a response to organizational pressure to move faster using codegen tooling.
With the codegen push came longer pull requests, ones that could often span thousands of lines of code or touch hundreds of files. Usually the changes were simple on their own. Reformatting how SQL gets written across a directory of pipelines, that sort of thing. The problem was the volume. When one PR touches that many files, reviewer(s) naturally miss things. Bugs would slip through. I caught some of them. Others would get past me and other reviewers turning into bugs we'd patch the next day after they had already hit prod.
The honest truth is that a pull request with a wide enough surface can't be reliably reviewed by a human. There aren't enough minutes in the day.
And the old fix doesn't apply anymore. A few years ago you'd just say split it up, narrow the scope, ship five small PRs instead of one big one. But the whole org is pushing to ship faster with codegen and your competitors are finding ways to outpace you. Deliberately slowing down to handcraft smaller PRs becomes friction nobody wants to own. Changes can still be split across PRs in some cases but if someone is using generative AI to publish them, they are trusting it to read the interdependencies and split them in smart, testable ways.
So here is what actually happened, and I'll say the part most people won't. The realistic standard for reviewing a PR was five to maybe thirty minutes, because there is always another one waiting and your own work to get back to. A thousand line, AI generated PR would get a skim and a thumbs up. Everybody rubber stamps. Nobody admits to it. But I've found easy to spot problems sitting in PRs other senior devs had already approved. I've done the same skim and thumbs up myself missing easy ones as well. So I had a decent hunch about what was going on.
The incentives steer you toward it. The reward for a thorough review is outweighed by the cost of being the bottleneck. Mention you need more time to review a PR properly and the next morning you'll get asked why it hasn't been approved or sent back to the author yet.
That's the setup for everything that follows. AI writes a large and growing share of the code now. At a place like that it was old news. In a lot of older, bigger companies I suspect it's still uncomfortable. That discomfort makes people want to slow it down or push back. But slowing down just hands the advantage to whoever didn't. Orgs that drag their feet fall behind competitors. Individuals who do lose the next stack ranking. That "refuse and fall behind" dynamic runs well past engineering, and it's most of what another post here is about. The interesting question isn't whether to accept it. It's whether the whole philosophy around how we build, starting with how we review, has to evolve to keep up.
Forget combing through the deltas line by line. Build gates instead.
If a human can no longer practically give the thorough review needed, it no longer makes sense to treat them as the final gate. That doesn't mean the review disappears. It means the review work moves upstream, from reading output to defining the constraints that output has to satisfy.
Concretely, in my own repos today, the loadbearing artifacts aren't just code. They're a layer of specs the codegen has to obey: a PRINCIPLES file, an INFRASTRUCTURE file, a LOGGING_SPEC, and the git rules that keep an agent from acting destructively. The model reads those and generates inside the lines they draw.
On top of that sits the stuff that catches what slips: automated tests on anything that matters, hardened CI, and AI review passes that run before a human is ever pulled in, ideally specialized by area so the reviewer for the data layer isn't the same one squinting at the auth flow.
The human still reviews. They just focus on the gates and the genuinely risky deltas instead of every line of a thousand. You stop reading the code and start owning and delegating to the system that reads it for you.
Use the smallest tool that does the job
Agents are worth reaching for when a task warrants them. They are not worth reaching for when the same task can be handled by something simpler.
A few things to consider:
- If you do need an agent, can a prebuilt skill or function do the job instead, in a way that's both more structured (its shape pinned down by its own arguments and parameters) and far cheaper on tokens? In my current work, agents reach for prebuilt functions for the simple things they do constantly, like posting to web pages instead of working each one out from scratch. Consider the meme going around of people using agents to count the letters in a given word. I doubt anyone actually embeds logic like that into their processes but it's an extreme example of what I'm trying to get at.
- Use the smallest model that's feasible for the task. And better yet, use an open source model locally or through on prem resources. If you have a parent agent delegating to more specialized ones, the parent is usually the only one that needs the more expensive model to hold the wider scope.
This is the same instinct every data engineer already has. You throw together a query to eyeball a sample and you don't care that it's ugly or slow because it runs once over a sliver of the data. But before that logic goes anywhere near the full population or prod, you rewrite it. Because at scale the sloppy version costs real money and real time. Model choice is the same move.
Reach for whatever's most capable to prove the thing works, then before you scale it, pull as much as you can onto smaller, cheaper, ideally local models. Optimizing the model before you know the thing works is premature. Skipping it before you scale is how you wake up to a bill that catches you by surprise.
There's a second reason to get off someone else's API before you scale and it has nothing to do with your own efficiency. The cheap rate is a subsidy, not a promise. Anthropic has moved three times this year alone to claw back the discount that let programmatic agent usage run at near interactive rates. An OAuth block in January it reversed within days after backlash, an outright ban on third party agents in April, and an Agent SDK credit split announced in May that it paused on the very day it was meant to take effect.1 And it isn't only them. The same subsidy is being pulled across the entire category, one provider at a time.2 Build your scaling plan on someone's subsidized rate and you're building on quicksand.
The codebase starts tending itself
This one is a different axis from the first. The gates are there to block harmful changes coming in. But codebases break down from the inside too no matter whether they're tended by humans or by bots. Dependencies move, bugs surface, and scaffolding has to change when a new feature can't be supported by the existing parts.
So to keep up with those recurring deltas that can easily slip under the rug, maintenance itself should be orchestrated and ran on a schedule. A good way to start is adhoc audits seeded with deliberate deviations from your criteria so you can see how thoroughly they actually scan your surfaces before you trust them to run on their own. Once that maintenance runs on its own, the failure mode worth fearing is no longer a bad change since the gates catch those but rather orchestration mistakes that loop and quietly burn money while nobody's watching. This is why it's important to keep the audits on a leash until you've watched them catch what they should. This is one of the principles I'm building into my platform and there'll be more on that in the coming weeks.
What all of this is really saying
Consider what the three of them have in common. In each one the human stops producing the implementation and stops reading it. Instead, they author the things that govern it: constraints, tests, audit policies, and budgets. The durable human artifact moves from the code to the spec around the code. It's like going from framing a house board by board to building a single use mold and letting the concrete fill it where each mold is shaped for one build and broken after the pour.
So if you asked me what's really changed in the game?
Focusing on systems and constraints rather than the lines themselves used to be the thing that defined someone as more senior. It came from years of grinding implementation work and making enough mistakes to develop the right instincts. That's the environment I learned in as I'm sure is the case for most engineers reading this. In my last few years as a tech lead for a pod of data engineers, most of the actual work had already become spec as a lot of it came down to advising teammates how to approach new requirements while keeping things modular and blast radii small.
Codegen takes that skill and makes it the day one floor for everyone. Generating code is table stakes now. Doing it in a way that doesn't blow other things up, cause long term problems, or wall off the features you'll want later is the entire job. Which leaves a question I don't have a clean answer to.
I was recently helping onboard the first true gen AI native hire I've worked with. They were using codegen tools from day one, as they should have been and were encouraged to - but one thing kept nagging at me. If design judgment is now required of everyone on day one and the way people used to develop that judgment was years in the same trenches we're working to automate away, then where is the judgment supposed to come from now? The learning process itself starts to feel like the "chicken or the egg".
The only fix I can think of is frontloading it. Teaching systems level thinking from day one, drilling in that no change is an isolated change, and putting most of the early emphasis on the gates themselves before anything else. If I onboarded someone tomorrow who was completely new to development, the gates would be the very first conversation rather than something they pick up naturally through years of reps.
Honestly though, that answer feels insufficient. The tension doesn't resolve cleanly and I'd be skeptical of anyone who says it does. The early data isn't reassuring either. Entry level software roles are already contracting3 and the obvious risk is that we erase the pipeline that turns juniors into the seniors who'd know how and what kind of gates to build in the first place.
References
-
The three moves this year: a January block on subscription OAuth tokens for third-party tools, reversed within days after developer backlash; an April ban on third-party agents using subscription credentials; and the Agent SDK credit split announced May 14 and paused June 15, the day it was due to take effect. Most authoritative for the last one is Anthropic's own Help Center article (the support docs were updated to confirm the pause); secondary coverage at The New Stack (announcement: https://thenewstack.io/anthropic-agent-sdk-credits, pause: https://thenewstack.io/anthropic-pauses-claude-agent-sdk-subscription-change). ↩
-
GitHub withdrew a comparable subsidy on June 1 with its AI Credits change (announcement: https://github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing). ↩
-
Stanford HAI data on software-developer employment for workers under 26 falling sharply since 2024, plus the broader concern about eroding the entry-level pipeline (Stanford HAI 2026 AI Index: https://hai.stanford.edu/news/inside-the-ai-index-12-takeaways-from-the-2026-report). ↩