Faster drafts, denser days: the burnout risk on AI-heavy teams / Field Notes

AI coding tools are usually sold as time savers. In one narrow sense, that is true. GitHub’s early Copilot study reported a 55.8% speedup on a bounded task, Google measured roughly a 21% reduction in time-on-task in an enterprise RCT, and a multi-company field paper found a 26.08% increase in completed tasks (Peng et al., Paradis et al., Cui et al.).

But “faster” is not the same thing as “lighter.” The better question is whether AI changes the weight and texture of the workday. My view is that it often does. AI compresses some drafting and lookup friction, then shifts more of the remaining day toward verification, integration, review, prompt steering, and accountability for the merge (Lee et al., DORA).

That distinction matters because reduced keystrokes do not necessarily mean reduced strain. On AI-heavy teams, faster drafting can shift more of the day into review, judgment, supervision, and exception handling. If that is true, leaders may need to design for more recovery, not less.

Software engineering was already cognitively heavy before AI. The work has long been dominated by understanding code, not just typing it, so “AI replaced the boring part with the hard part” is too simple (Feitelson). What AI really changes is the texture of the leftover work. When the draft appears instantly, the human job becomes deciding whether the thing is correct, safe, locally appropriate, and worth shipping at all (Bainbridge, Lee et al.).

Here is the loop I keep coming back to:

Flow chart showing how AI-assisted coding can turn local drafting gains into denser supervisory work, heavier verification, and burnout risk if teams absorb the savings as more scope instead of more recovery.

One plausible path from faster drafting to denser work.

The problem is the almost-right output

The expensive failure mode is not obviously wrong code. It is plausible code. Wrong output can be rejected quickly. “Almost right” output has to be read line by line, checked against local conventions, run through edge cases, and tested against system behavior that only the team actually knows.

DORA has a good phrase for this: the verification tax. Time saved in generation gets repaid in auditing, prompt refinement, review, and rework (DORA). The 2025 Stack Overflow survey points in the same direction: 66% of developers said AI answers that are “almost right” are frustrating, and 45.2% said debugging AI-generated code takes more time (Stack Overflow 2025).

There is also direct workload evidence here. A study on validating and repairing LLM-generated code found that when developers knew code was AI-generated, they performed better checks but also showed higher cognitive workload (Tang et al.). That matches the lived experience pretty closely: the tool removes some keystrokes, but it can leave you with more supervisory attention per hour.

Productivity is real, but it is not uniform

The mixed productivity literature is not a contradiction. It is a clue. AI looks strongest when the task is self-contained and the success condition is clear. It looks weaker, or even negative, when the task depends on repository history, tacit conventions, and careful integration.

The most useful counterweight to the upbeat productivity story is METR’s RCT with experienced open-source maintainers working in their own repositories. Those developers were 19% slower with early-2025 frontier tools even though they believed AI had helped (Becker et al.). That does not erase the positive findings from Copilot, Google, or the field experiments. It tells you the gains are conditional, and that supervision costs are large enough to flip the sign in realistic work.

That is also why I do not find self-reported productivity persuasive on its own. The work can feel smoother while still becoming denser. Generating a first draft faster is not the same thing as reducing end-to-end effort once review, integration, and defect cleanup are included (Becker et al., DORA).

The burnout argument should be made carefully

I do not think the evidence supports the lazy claim that AI simply causes burnout. DORA’s work found higher flow, higher satisfaction, and less burnout among heavier AI users, which is real counterevidence (Storer).

But that is not the whole picture. Berkeley researchers following AI use inside an actual company found a different dynamic: expanded scope, more simultaneous threads, fewer natural stopping points, and work seepage into lunch, evenings, and other recovery time (Ye and Ranganathan).

The clean way to reconcile those findings is to stop asking whether AI is good or bad in the abstract. The better question is what the organization does with the local speedup. If faster drafting gets translated into bigger PRs, more scope, and machine-paced expectations without redesigning review and recovery, the day gets denser even if some people feel more productive in the short run (DORA, Ye and Ranganathan).

I want to be careful not to overclaim here. I am not saying the research is settled, or that I can prove AI causes burnout. I am saying that after more than 25 years running engineering teams across different industries, I recognize the shape of work intensification when I see it. On AI-heavy teams, the hours on the calendar do not always go up, but more of the day gets spent in the most mentally expensive mode: judgment, review, supervision, and exception handling. People end the day more drained. Breaks get squeezed. Focus gets chopped up. Work-life balance feels worse even when the timesheet looks roughly the same. That is not proof, but it is a signal strong enough that leaders should redesign the work before the damage is obvious.

If the workday gets denser, shorter schedules make more sense

This is why I think shorter schedules deserve more serious attention than “keep the same hours and supervise at machine tempo.” Long hours are associated with worse cognitive outcomes in Whitehall II, and newer diary research suggests longer days can help same-day performance while hurting next-day performance through worse sleep and lower morning resilience (Virtanen et al., ten Brummelhuis et al.).

The strongest modern evidence points to real hour reduction, not compression. A large six-country four-day-week study found better burnout, mental health, physical health, and job satisfaction (Fan et al.). The case for a six-hour day is still directionally positive, but thinner and older (Akerstedt et al., Schiller et al.).

If AI is increasing the cognitive density of engineering work, then reducing hours is not just a perk. It is one way to keep the gains from turning into continuous supervision with no recovery buffer.

What I would change to protect judgment and recovery on an AI-heavy team

The first thing I would change is the assumption that local coding speed should automatically become end-to-end schedule compression. That is where I think a lot of teams are going to get this wrong. If a project used to take two to three weeks of design, one to two weeks of coding, and one week of review, I would not try to turn that into one week of design, three days of coding, and one week of review just because the model can draft faster. I would meet in the middle. I would keep the total timeline roughly similar, maybe two weeks for design and implementation together and one week for team review.

That might sound like leaving productivity on the table. I think it is the opposite. The gain is not just calendar math. It is better design quality, more time for the author to read their own work with fresh eyes, and less pressure to jam review into lunch, evenings, and context-switched gaps. The draft may appear faster, but judgment does not. The real constraint moves from typing to thinking, and teams should plan accordingly.

I would also make author review a first-class part of the schedule. AI can produce a lot of plausible code quickly, but the person closest to the change still has the best chance of catching local mistakes before they hit the rest of the team. That means the author should have time blocked for a real self-review pass, not just generation followed by immediate PR creation. For larger changes, I would normalize a cooling-off period between drafting and review so the author can come back with fresher judgment instead of merging on momentum.

Reviewer capacity has to be protected the same way on-call capacity is protected. Human review does not scale at the same rate as machine generation. So I would push harder on smaller batches, tighter PR boundaries, explicit evidence requirements, and clearer ownership. A reviewer should not have to reconstruct intent from a wall of generated code. The PR should say what changed, why it changed, what the risky parts are, what tests were run, and where the reviewer should spend attention. That does not remove review work, but it lowers the cognitive tax.

I would also deliberately add recovery time back into the system. No lunch-hour review expectations. No quiet assumption that people will finish the thinking work at night because the coding part got faster. No back-to-back days packed with deep review, meetings, and prompt steering. On an AI-heavy team, recovery is not a nice extra. It is part of keeping judgment sharp enough to be trustworthy. That can mean fewer simultaneous threads, protected focus blocks, review rotations, quiet hours, or even shorter schedules if more of the day is now spent in high-attention work.

I would resist the urge to turn every local efficiency gain into more scope. This is one of the easiest management mistakes to make. The model saves a few days on drafting, so the organization quietly fills those days with more tickets, more PRs, or more parallel work. That is how denser work gets normalized without anyone saying it out loud. A healthier choice is to spend some of that gain on better design, more complete testing, smaller changes, and actual breathing room between cognitively expensive tasks.

Finally, I would measure whether the system is getting healthier, not just faster. Review time, PR size, rework, after-hours activity, reopened work, and escaped defects tell you far more than prompt counts or raw merge volume. The real question is not whether the model wrote code quickly. The real question is whether the team can absorb that speed without turning every day into continuous supervision.

Some practical changes fall out of that pretty quickly:

Require authors to include a risk note and test summary for larger AI-assisted PRs.
Put a soft cap on how many AI-heavy reviews one person is expected to do in a day.
Normalize overnight or half-day cooldowns before opening large PRs.
Protect at least one meeting-light block each day for deep review or decompression.
Ban the quiet cultural habit of “just review it over lunch” or “take one more look tonight.”
Spend some AI time savings on design reviews and edge-case thinking instead of immediately increasing scope.

My view is not that AI makes software work easier or harder in general. It makes the work different. It removes some friction and concentrates more of the remaining day in supervision, exception handling, and judgment. That can become a real productivity gain, or a faster path to mental saturation. The difference is whether teams redesign the system around human review and human recovery instead of pretending people should work at machine tempo.

The management mistake is to treat faster drafting as permission to squeeze the rest of the system. That is backwards. The faster the draft appears, the more deliberate teams have to be about design quality, review load, and recovery time. Otherwise AI does not lighten the work. It just removes the pauses that used to keep people from operating at continuous supervisory intensity.

Faster drafts, denser days: the burnout risk on AI-heavy teams

The problem is the almost-right output

Productivity is real, but it is not uniform

The burnout argument should be made carefully

If the workday gets denser, shorter schedules make more sense

What I would change to protect judgment and recovery on an AI-heavy team

Browse the archive