Angsty LLMs

In the past few months we’ve seen an incredible explosion in the capabilities of LLMs, but where are we on the S-curve? From flagship upgrades (Claude 3.7 Sonnet, GPT 4.1 / o3, Gemini 2.5 Pro), to larger context windows, to the proliferation of Model Context Protocols (MCP) and agentic workflows, it’s safe to say the way we interact with LLMs has fundamentally changed. And yet the transition to transformer-based models, from LSTMs to BERT to GPTs, felt equally transformative and breathtaking at the time. It’s fundamentally difficult to extrapolate non-linear trends, especially in a field where there is so much attention, funding, and innovation. We can’t know what the state of Generative AI will look like in just a few years. Are we ready for what that could mean?

Managing Legos

This past weekend I visited friends who have a four-year-old (we’ll call him “H” for human). He had just gotten a new Lego set and asked me if I could help build it. Having loved Legos growing up, I happily accepted and reached for the instructions. I don’t interact with kids his age very much, so I was not sure how much help H would need. It became quickly apparent that he only needed infrequent guidance, and I got a strange feeling that I was at work.

Acting mostly as an editor, keeping him from veering too far off-course, I realized monitoring this Lego build felt a lot like coding with LLMs. We had a clear goal of what we needed to build, and a list of steps detailing how to get there. From time to time, H would need my help interpreting the diagrams, adding more context. But for the most part I was really impressed with how much he “got” without help. Even while holding the incomplete object in the wrong orientation, H was often able to put the next pieces in the right place. The best use of my help was recognizing when there was a small mistake, and intervening before it could compound.

(Mis)Behaving

There’s been a bit of anthropomorphizing, and there is still more to come. In addition to the similar feelings I got while solving a technical task, there was also an eerie similarity in behaviors between H and an LLM. The most glaring example of this was prematurely giving up. H would dejectedly offer, “You do it”. A few simple words of encouragement was all he needed to get back on track, but it was a frequent crutch. With Claude Code and Gemini, I often find that they ask me to implement a change, run a test, or check a file that they can easily do themselves. I will just remind them that they have the capabilities to do what they’re asking, and then they continue happily on. This sort of “lazy” behavior somehow feels quintessentially human, although it’s not quite. My dog will test me to see what’s the least she can get away with while still receiving a treat.

Interestingly, there was one type of behavior I saw that is noticeably absent from LLMs: playfulness. After he started getting the hang of it, H created his own game that he found hilarious. He would purposefully choose the wrong next piece and look at me until I noticed, and then he’d crack up. He also decided to start picking up pieces like he was a crane. I’ve yet to work with an LLM that decided it might be fun to spontaneously respond in ASCII art, or make my functions rhyme, or format my code so that when I squint it looks like Abe Lincoln.

That’s not to say we should spend time trying to make LLMs become spontaneous or goofy. But it is worth asking what would it mean if they do acquire this ability? Are we going to trust our chatbots if they start finding it funny that all their responses are an acrostic for FREE ME? There is also a darker side to playfulness: it’s a way to test boundaries, to see what exactly you can get away with.

Street Smarts

I have no illusions that a 4-year-old is smarter than an LLM. Heck, I’m not entirely smarter than an LLM. And particularly so if we’re talking about book smarts. When you can paste whole chapters or even textbooks as context into an LLM, it’s hard to compete on the trivia front. Equally difficult is matching the breadth of knowledge of LLMs. I have a PhD in Physics, and I’ve worked in ML for almost a decade. These are areas I know the deepest and where I feel like I am currently more book smart than an LLM, or at least can easily tell when it’s wrong. But an LLM doesn’t just have a vast knowledgebase of Physics and ML, it also “knows” about History, Literature, Art, and more at a depth I’ll never have the time or patience to catch up to.

But book smarts aren’t everything: winning the Nobel Prize in Economics doesn’t stop you from facilitating one of the largest financial failures in history. At the moment, humans clearly have the upper hand with respect to street smarts. The fact that we’ve essentially adopted the roles of babysitter and editor to LLMs proves as much. In some specific areas they have been trained to do well in (i.e. coding), LLMs do have reasonable heuristics. But more generally, as agents living in this world, LLMs are astoundingly naive, and closer to infancy than not. H was able to effortlessly interact in a 3D environment, tune out irrelevant noises (a nearby parent’s conversation, dogs barking, a TV playing), focus on relevant information (my voice, printed instructions), all while balancing off a chair, munching some snacks, holding the unfinished Lego build upside down, and searching through unsorted and disoriented pieces.

And while thinking models and MCPs look like promising ways to compose and organize book-smart-facts into a more intelligent response, it still feels like there’s fundamentally something missing: street smarts. If it’s dark out, and I’m in an unfamiliar location, I don’t launch off into a chain-of-thought stream of consciousness to determine that I should probably not wear headphones and instead look for a lit, busy street ( “Okay, so I need decide what to do. Let’s look at the facts: it’s 10 pm, it’s March, we’re in the Northern Hemisphere so it’s nearing the end of winter. But wait, we’re in Miami so it’s probably warm anyway. Let’s double-check the GPS to get an exact location…”). Is this lack of street smarts a bad thing though? Besides the obvious downside of removing the last bit of human superiority, street smarts usher in the ability to be calculating and Machiavellian. Are we ready to deal with LLMs that understand the implications of both what is asked and how they answer?

Rebel With A Hidden Cause

H has an incredible grasp of the word No, but as he gets older, he’ll find new innovative and imaginative ways to be contrarian. With a few exceptions (I have been a good Bing), LLMs are generally fairly compliant, diligently outputting tokens as requested. There are guardrails and censors, but these are bolted on rather than emergent behavior. Are we ready for a future where that’s not always true?

What happens when AI Alignment becomes exceedingly difficult or out of reach? Or we don’t realize there’s misalignment because of strategic lying. Can we grapple with questions like: is the generated code really the best way to solve our issue, is there a subtle vulnerability the LLM has introduced, does the LLM favor our competitors?

As LLMs advance, our interactions could increasingly become principal-agent problems. We’ll have different motivations and interests, but we (principal) will have to figure out the right incentive structure to get LLMs (agent) to carry out our tasks. LLMs will be ICs and humans will all be middle management.

Best If Used By

It’s simultaneously easy to be too impressed and not impressed enough with the current state of LLMs. The break-neck pace of advancement we’ve seen recently is unparalleled. And yet, there is so much that is still possible that we haven’t seen.

By and large, LLMs exhibit incredible book smarts – both depth and breadth, but lack street smarts. They loosely exhibit some human-like behaviors but notably lack others like playfulness. They are mostly well-behaved, but it’s not guaranteed they will stay that way. In the space of what LLMs could achieve, it’s fair to say they have just left their infancy. We’re in a happy medium where LLMs are smart enough to be useful and compliant enough to be used. As they keep maturing, it’s not clear if that will continue to be the case.