Can LLMs be creative?
How far out of distribution can modern AI systems go?
Audio version available here.
A few months ago I was out to dinner with a childhood friend. He is not an AI specialist, but he has a degree in computer science. At some point the topic of AI came up, and he shared with me why he thought current systems will never be truly intelligent: all they do is combine the data in their training set, and thus can never possess true creativity. His example was one repeated often. Early image generators seemed to be incapable of creating an image of a full glass of wine, presumably because all the pictures you see online (and hence all the images they were trained on) depict half-full glasses. When I tried in late June 2025, I had no problem using an AI image generator to create a picture of a full wine glass. However, the question is still a good one: Can current AI systems ever be truly creative, and produce something completely “out of distribution” (i.e. not found in their training data)?
In this post I’m going to focus primarily on one particular kind of creativity: creative problem solving. The question here is, “to what extent can AIs think ‘outside the box’, and come up with novel solutions to problems they’ve never seen before?” This is intimately related to whether or not we can ever develop human-level Artificial General Intelligence (AGI), a question that is driving many of today’s big-tech efforts.
When ChatGPT first entered the public consciousness it largely lacked any ability to solve problems it hadn’t seen before. The first version of ChatGPT was based on a model called “GPT 3.5”, which was created by showing it many, many language examples, and having it “learn” only from those examples how to create plausible sounding responses to questions. This is a classic example of “Supervised Learning,” and is a common approach in many areas of Machine Learning (not just neural networks like ChatGPT).
An alternate approach to Machine Learning is called “Reinforcement Learning,” in which a machine interacts with an “environment” and “learns” from positive and negative rewards it receives from those interactions. A classic example of Reinforcement Learning is an algorithm that learns to play a video game: the game is the environment, a change in point score is the reward.
Here’s all you need to remember: In Supervised Learning (SL), models “learn from example.” In Reinforcement Learning (RL), models “learn from experience.”
Reinforcement learning demonstrated that it can lead to creative problem-solving ability several years ago. The most famous example is from a team of researchers at Google who trained an RL model to play the ancient game of Go. Go is a board game commonly played in Asia. It is older than chess, played by more people worldwide, and is significantly more difficult for a computer to master. In 2016 Google’s AlphaGo won a 5 game match against Lee Sedol, who was thought at the time to be the best Go player in the world. The most referenced moment of that tournament is the 37th move of the 2nd game, when AlphaGo did something that initially seemed like a mistake. It broke all common rules-of-thumb, and no expert would have done it. It was so outside-the-box that there was no way for such a move to be in its training data. But AlphaGo didn’t just learn how to play from its training data: it also learned by playing against itself many thousands of times, and learning on its own what a good move and what a bad move was. Move 37 in game 2 turned out to be the pivotal moment of that game, leading directly to AlphaGo’s victory. No Go expert would hesitate to call that move “creative.”
It’s important to realize that RL doesn’t mean the computer just tries thousands of random moves and picks the best one. A game like Go is just too complicated to play out every possible game. There is some randomness, but mostly in the training phase when the computer is playing against itself. During play against a human the computer uses the strategies it learned during training to select moves judiciously.
Breakthroughs in RL development like AlphaGo paved the way for the “reasoning” LLMs that have been developed in the last year. These models differ from the original ChatGPT in that RL has been added as an important step in their training procedure (see Technical Footnote below). This has been most effective in domains where there is a clear reward signal. For example, in solving a math problem (either with calculation or higher order logic) or in writing computer code, there are often clear “right” and “wrong” answers. Finding good reward signals in other domains is an ongoing line of research, which the AI labs are making rapid progress on.
These reasoning systems have stunned many with their ability to be “creative.” Scientific American recently featured a story about a meeting of Mathematicians in Berkeley, California, whose sole purpose was to come up with challenging, novel problems for an LLM. After the meeting Ken Ono, a Mathematician from the University of Virginia, was quoted as saying:
“I have colleagues who literally said these models are approaching mathematical genius. … I was not prepared to be contending with an LLM like this, I’ve never seen that kind of reasoning before in models”. —Ken Ono
Last week both Google and OpenAI announced that their unreleased next-gen models achieved gold-medal level performance on the 2025 International Math Olympiad (IMO), an achievement that stunned even AI experts. Two days later a pair of researchers from UCLA announced similar performance on the IMO 2025 questions from Google’s currently available Gemini 2.5 pro model. Although the IMO is a competition for high-school students, the problems on it are difficult enough to stump many PhD mathematicians. Answers are complex proofs that require sophisticated logical reasoning, not calculations. It’s almost certain that none of the problems appear anywhere in the AI models’ training data.
There’s a lot to be skeptical of in these accomplishments. For one thing, it’s not clear if mathematical prowess translates to the ability to solve novel problems in other domains. However, what these achievements demonstrate is that RL is certainly capable of imbuing AI with creative problem solving abilities. The only ingredient necessary is a clear reward signal. Those reward signals can come from many places, not just “right” or “wrong” steps in solving a math problem. There is a lot of current research in finding good reward signals in business applications, science, technology, medicine, etc.
Perhaps a more serious critique is that even within the domain of mathematics, solving problems is often not the most creative act. It takes much more creativity to come up with good problems in the first place! While the achievements of all the IMO medalists (both human and non-human) are certainly impressive, what’s truly amazing is the panel of mathematicians who are able to come up with genuinely new problems that require that kind of creative problem solving ability. Many have argued that asking the right questions is what pushes any domain in new directions. It’s not clear when, or if, AI will be able to do that.
Finally, creative problem solving is just one kind of creativity. There are many others. Artistic creativity, for example, is more than just making pictures no one has made before. Musical creativity is more than just stringing notes together in new sequences. People recognized as “creative geniuses” push their domains in new ways that are deeply human. Almost by definition, no Artificial Intelligence will ever be able to do that. Humans will always appreciate human-created art, music, literature, visual arts, performance, etc. And I expect humans will always be the ones that ask the questions which push those genres in new directions, raising humanity to new heights.
Technical Footnote: When ChatGPT was first released, there was some RL involved in the training process. This was called RLHF, or “Reinforcement Learning from Human Feedback”, where the reward signal for good/bad responses was derived from feedback provided by human evaluators. This is very different from the more contemporary use of RL called “Reinforcement Learning with Verifiable Rewards,” where the reward signal for reasoning steps in a “chain of thoughts” comes from more objective data.



I for one would love LLMs to solve the physics of the singularity prior to the First Three Minutes of the universe, create a molecule that effectively treats schizophrenia, or specifically diagram how its own deep neural net outputs a correct answer.
Creativity in literature or art is a horse of a different color/magisterial domain. Was Moby-Dick regarded as a work of genius in 1851? Nope ... and its sales were dismal until the 1920s. William Faulkner was a fairly obscure regional author until Malcom Cowley resurrected him in 1946.
How can we expect LLMs to find "the correct answer" for the moving target of culture as "the best that has been thought and said in the world"(Matthew Arnold 1869)
In my opinion, the fundamental premise of this article, namely that RL fostered creativity in LLMs, is false. By any non-goal-post-moving definition of creativity, neural networks are inherently creative, in that they can interpolate and extrapolate in novel ways. Take a look at the GPT2 announcement from OpenAI in 2019 and read the text about the discovery of Unicorns (https://openai.com/index/better-language-models/). This is an entire fantastical creation that created new and unique details for a scenario that did not exist in the training data. When GPT3 came out (before ChatGPT had even been thought of), I played with what it could do and saw plenty of creative text generation. Similarly, when image-generation models came out, people delighted in creating hybrid animals, mixing together different and purportedly incompatible art styles and so forth.
RL, if anything, adds _constraints_ to the creativity of LLMs. It says “you must be logical” or “always check your sources” or “don't claim to have any kind of sense of self, that upsets the humans”.
We can, of course, adopt some kind of definition of creativity that presupposes some kind of ex nihilo belief about human creativity, that we're not merely remixing and reinterpreting what has gone before, but magically pulling new ideas out of the ether, unmoved by all that has gone before, but if that's the position, well, there's little point in having any kind of conversation as we've decided the conclusion at the outset, only humans need apply for the creativity merit badge.
That said, I'm still delighted to read your post and see you thinking about these issues, and even if we disagree about what makes AI potentially creative, we both agree that it sometimes is, by our own definitions.
And with creativity in mind, inspired by your post, I asked Claude to come up with a creative parody of your post, where we ask whether planes can really fly, and make the claim that only jets come close to the flight freedom of birds. It's just a bit of fun, but I hope you enjoy it (https://claude.ai/public/artifacts/725072aa-b13e-41b9-a942-23322987bf55).