People still don’t appreciate the miracle that is next token prediction and the autoregressive transformer.
We are starting to grasp the impact of “generative ai” when applied to language and code. In the next couple of years, language models will do hours of independent work on our behalf and build complex software applications from scratch. Immense value is being created as the AI’s capabilities improve dramatically, year after year.
Everything we’re seeing with language is playing out in other fields and the future is both obvious and profound. Within a decade, AI will design molecules that cure many diseases and robots will drive the marginal cost of labor for many tasks to zero.
Hyperbole to some, obvious to others. The most amazing thing is that the breakthrough to enable this has already happened – next token prediction and the autoregressive transformer. The miracle is behind us. The future is foreseeable. We just need the data.
“These models, they just want to learn” - Ilya Sutskever
We’ve seen that if you take a sequence of text – “The cow jumped over the” – that an AI model can learn what comes next. Do this with a massive amount of data and compute and the capabilities of the AI become superhuman.
Similarly with code. Take a sequence – “for i in range(5):” – and an AI model can learn what comes next. The AI’s capabilities improve as the amount of training data and compute increase.
It intuitively makes sense. To the AI, “the cow jumped” is no different from “for i in range(5)”, in the sense that both are sequences that have meaning and the model learns the underlying meaning and rules of the “language” as the amount of training data increases.
Both biology and robotics can be expressed as simple sequences – `M-D-K-K-Y-S` and `"joint_positions": [0.2, 1.1, -0.5, 0.8, 1.2, -0.3]`, respectively. The next couple of years are predictable: the models, they just want to learn. And with enough data, we will see revolutions in biology and robotics. In fact, we already are. We just don’t grasp the impact yet.
Eras: Understanding the trajectory of AI
It’s helpful to think about eras of AI, and the transition from toy-like models to AI that is useful and, eventually, AI that is valuable.
In 2018-2020, early language models like GPT-2 could complete a sentence of text, but it was often nonsensical. The models were toy-like. They were not useful and certainty weren’t valuable. But to those who understood what had been done, the toy models pointed to an exciting future.
As the amount of training data and compute increased, and the models entered the second era, early versions of GPT-3 became useful in constrained use-cases, such as writing short ad copy. It took entering the third era, with GPT-3.5 and GPT-4, before models started to become valuable. And the commercial value has continued to increase with model advances like Sonnet 3.7 and o3.
The models are now completing tasks that take hours of time for highly trained knowledge workers to do. As new models arrive, and we enter new eras, the commercial value delivered by these models grows significantly. At launch, ChatGPT was priced at $20 / mo. Today, access to the best models costs $200 / mo. OpenAI is discussing prices of $20,000 / mo for future models.
Each new era has expanded the capabilities of what the models can do, expanded the use-cases where the models are useful, and expanded how valuable the models are.
In 2019, looking at the capabilities of GPT-2, seeing it fail to coherently finish a single sentence, it was hard for many to see what was coming. But the models, they just want to learn.
Code has followed a similar trajectory. From generating short code snippets, rife with errors, to developing entire applications from scratch in a few short years. Had you looked at the early coding models it would have been hard to foresee the explosive growth in revenue that we’re seeing from products like Lovable or Replit. But the pattern with code has been identical to language: from toy to useful to valuable, era by era as the models get larger.
We’re two eras into the same trajectory in biology. Early transformer-based models, trained on a small amount of data, showed that a model could generate sequences for simple proteins that were sometimes functionally correct. Neither useful nor valuable and easily dismissible. Toy-like and no different from early GPT-2 demos. Scale up the model, enter the second era, and the model can generate a protein that’s functionally identical to a molecule worth $60b. Certainly usable, perhaps not yet valuable. But given it’s the same underlying mechanic, next token prediction and an autoregressive transformer, what do you think is going to happen as the model scales and enters the 3rd and 4th eras? What about the 7th and 8th era? We are on a path where models are learning the underlying language of nature. Biology is becoming programmable. Era after era, the models are becoming more useful and more valuable.
When you read Google’s papers from 2023, PALM-E and RT-2, where they show that an autoregressive transformer, trained on next token prediction, can control a robot, it may be easy to miss what’s coming. But when more recent work, from Physical Intelligence and DeepMind, show that generalization and capability improve as data scales, it’s hard to ignore that we’re walking down the same path. Yes, it’s a toy today. Yes, it’s not useful today. Yes, it’s far from valuable today. But we’re in the first era with robotics and it’s working the same way it worked for language and code. The trajectory will be no different.
The miracle is behind us. It’s all going to work. We just need the data.