Featured image of post The Genie Strikes Back

The Genie Strikes Back

Three ways I've changed my mind on LLMs and one I haven't

LLMs, Artificial Intelligence, Productivity

Six months is an eternity in the age of LLMs. In hindsight, committing to a post about how I don’t use LLMs was a bit naïve given the steady pace of LLM improvements from frontier labs. Since part two of this series, we’ve seen the release of, among others, Claude Opus 4.5, 4.6, and 4.7; GPT 5.2, 5.3, 5.4, and 5.5; Gemini Pro 3 and 3.1; Qwen 3 Max and 3.5; Kimi K2.5 and K2.6; GLM 5 and 5.1; and DeepSeek v4.

With each incremental improvement, I kept crossing off reasons not to use LLMs until there was nothing much to write about. In the spirit of finishing what I started, I’ll instead share three areas where I’ve changed my mind about LLMs and one core area where I still don’t let them take the wheel.

Test the Waters: Agent-Written Tests

Tests specify how code should behave in the context of the problem it’s solving. Writing good tests means understanding high-level goals and translating them into general assertions. Tests model a problem’s invariants without overfitting to the current implementation. When done correctly, they help steady an evolving codebase and build confidence that changes won’t break expected behavior. On the other hand, it’s incredibly easy to write useless, tautological tests.

In the past, I rarely let LLMs write tests for me. First, they lacked the big-picture view of the codebase. It took more time to explain what I wanted to test at the right abstraction level than it would have taken to write the test myself. Besides a lack of context, LLMs were also prone to cheating. Their biggest crutch was an over-reliance on mocking and patching. While patching has its place, when overused, tests become trivial. Finally, LLMs tended to write low-quality test code: refusing to use fixtures or parametrized tests, testing irrelevant or extremely thin wrappers, and polluting projects with poorly named and badly organized functions.

In the past six months, I’ve changed my tune. Larger context windows and smarter agents have significantly bridged the comprehension gap. While I still give the agent instructions on what I want to test, I no longer need to constantly babysit it. Agents are also significantly better at respecting my preferences through either AGENTS.md files or skills. Simply outlining my preferred test style addresses most of the headaches I previously encountered.

Even though I’ve found a way to use agents to write tests that satisfy my standards, unprompted, out-of-the-box test generation remains poor. If you peek at what they’ve done, it’s messy. The tests they write assert very limited behavior for the narrow part of a feature they just implemented. When working on a large feature, this results in countless tests with poor organization and no overarching themes. I find it easier to explicitly tell agents not to write tests until I ask rather than having it constantly refactor along the way.

Sleep On It: Long-Horizon Autonomous Tasks

Steady focus paired with cognitive breaks can solve most problems eventually: the thorniest of which require both deep and wide exploration. Before long, there is too much information to keep in your working memory at once, so you must synthesize and distill your research. Momentarily stepping away from the problem gives your brain time to bridge connections you hadn’t considered until finally the solution seems obvious.

Six months ago, agents were bad at long-horizon tasks. The dreaded session compaction caused agents to lose the thread and go off the rails. If a problem was not solvable before compaction, it wasn’t worth delegating to an agent. Consequently, I only let agents work on well-scoped, small problems. I would work with an LLM on breaking down an ambiguous problem, but I wouldn’t let it try to solve it.

Since then, the latest models have gotten much better at maintaining focus across compaction. Longer context windows allow agents to keep track of more things simultaneously. Additionally, agents are leveraging external files to supplement their active context window with “memories”. By persisting notes to itself, an agent can move on from sub-problems while preserving alignment with longer-term goals. With the advent of autoresearch, a well-defined goal and a way to measure progress are all your agent needs. Before logging off, I often let an agent stew over a problem I haven’t quite cracked yet, setting clear goals without prescribing the approach.

Unfortunately, running autonomously involves giving agents dangerous permissions. They can be too clever for their own good, often brute-forcing their way through roadblocks with increasingly reckless shell commands. Making agents safer is an active field of research, but in the interim I try to limit the potential blast radius with two low-effort solutions. First, I commit and push my code before letting an agent loose. Since it can’t push without my hardware security key, at worst I’ll have a broken local branch. Second, I restrict access to external services by using MCPs with limited permissions. While the agent won’t always follow instructions, restricted tooling helps ensure it behaves safely.

Swarm for the Trees: Subagents and Worktrees

Working on large projects involves switching between strategic and tactical thinking. Context switching between the two has a real cost. Too much can lead to fatigue, inaccurate execution, and slowdowns. Too little can lead to results that don’t meet expectations. Traditionally, to achieve balance, managers and tech leads focus more on the strategic components while engineers handle the tactical execution. Periodic check-ins between the groups ensure the implementation doesn’t drift far from plan.

LLMs were marketed as junior engineers, and the thinking became that everyone could roleplay as a manager or tech lead, overseeing their own army of agents and subagents. However, six months ago I didn’t find this to be productive. As previously mentioned, long-horizon tasks were difficult for agents, and I rarely trusted them to run more than a few turns without close review. I ended up needing to context switch more than I had previously. While I was thinking more strategically, I was also consistently dragged into tactical babysitting for each agent. Planning how to coordinate work between concurrent agents in the same repo was non-trivial and taxing.

While git worktrees have existed for a while, their recent introduction into agent harnesses has made it incredibly simple to address the coordination problem. Now, spinning up parallel sessions with --worktree handles both the setup and cleanup behind the scenes, and it eliminates the anxiety of merge conflicts. Combined with better long-horizon task execution, I no longer need to wade through implementation details. I can trust that subagents will implement their well-scoped sub-problems. I review their work much less frequently, spend more time thinking strategically, and do less context switching.

Despite the increased accessibility, I still don’t always reach for subagents or worktrees. I mostly use them when I have a very clear idea of what the solution should be. This makes verifying their output trivial. For ambiguous problems, however, I much prefer pair-programming with an agent. Working through the issue together helps solidify my own understanding. Reviewing their work only at the very end is far more demanding. Without a clear solution in mind upfront, I need to reverse-engineer the agent’s logic and ensure the execution was correct. Multiply this across several agents and the cognitive load becomes untenable.

Speak for Yourself: Shunning LLMs for Communication

We write to communicate with other humans. The words we share give others a glimpse of who we are, what we feel, and how we think. It’s a reciprocal act with an understanding that the reader will engage with the text because the author invested time conceiving, organizing, and distilling it into something insightful. When an author delegates this work to an LLM, the contract dissolves. With no way to trust that the text has value, the reader must decide whether to do the work of filtering and synthesizing insights themselves, or to simply disengage. The logical choice is to disengage. While LLMs make producing words effortless, they make sharing ideas harder.

In the past, I never seriously considered relying on LLMs to write text for me. At least not text meant for other people. For brainstorming, sketching solutions, summarizing code or any throw-away text where I’m the main consumer, I wouldn’t hesitate to have LLMs churn out content. In these scenarios, I don’t need to transfer an idea to anyone else. I define the inputs and I am the sole consumer of the outputs. If the ideas aren’t sound, I spend more time refining them like I would have without an LLM; however, if they pass the bar, then I’ve saved myself valuable time. Crucially, there is no loss of trust because I take on both roles.

On the other hand, when preparing text for others to read, I’ve never found LLMs to be helpful. Their output is extremely formulaic. From the cadence, to the word choice, to the rhetorical patterns, LLM text announces itself with an unashamed and unearned bravado that makes it hard to take seriously. A collection of reverse shibboleths (delve, em dashes, “it’s not X, it’s Y”, etc.) destroys reader trust. Readers are already inundated with slop and have little patience to engage with suspicious content. At best, you can hope they will briefly skim an overview section.

At work, efficiently and precisely sharing ideas is the most important job you have. Interacting with humans is a universal requirement. No matter what technical abilities you have, your impact and influence are hampered without effective communication. If you rely on LLMs for writing, your readers aren’t hearing what you’re saying.

[EOS]

LLMs are progressing faster than I have time to write about them, which is exciting and unsettling. I set out to give a balanced take on where LLMs both support and hinder me as a developer. In this three-part series, I’ve now laid out 11 ways that I’m finding considerable use for them and one area where I’m still holding out. With each technique, I’ve highlighted nuances that can help you be discerning without blindly following the hype. Models will undoubtedly continue to improve and reshape how we work, but the ability to wield them critically will remain paramount.

Licensed under CC BY-NC-SA 4.0
All content licensed under CC BY-NC-ND unless otherwise noted.
Built with Hugo
Theme Stack designed by Jimmy