Entropy

Generalist - Scaling AI in Robotics

Fraser — Wed, 05 Nov 2025 19:46:16 GMT

We have seen what happens when the scaling hypothesis is applied to language. In 2019, AI could complete a sentence. By 2020 it was finishing a paragraph. Today, it’s working in complex codebases for hours and helping prove theorems in quantum complexity theory.

We are seeing what happens when the scaling hypothesis is applied to biology. First, AI could generate functional proteins that were of little use. Now they’re generating functional CRISPR systems completely from scratch. Soon, AI will make biology programmable.

The same trajectory is unfolding in robotics. It continues to amaze me that the transformer, when paired with enough data, works across domains. As Ilya said, “These models, they just want to learn”.

In late 2022 and early 2023 there were a series of influential robotics papers published by Google. PaLM-E was impossible to ignore. The paper showed that with the right data a generalist model beats a specialist model. It suggested that the transformer would take us to a generalized robotics foundation model.

Pete Florence and Andy Zeng were co-authors of PaLM-E and together with Andy Barry, Pete’s PhD classmate at MIT, are the co-founders of Generalist, a startup advancing the frontier of AI in robotics.

Yesterday, Generalist shared GEN-0. It is the highest-dexterity model in the world, the beginning of an era where capabilities predictably scale with data, and the first step toward realizing their mission of making general purpose-robots a reality. Along with the model they are sharing the first scaling laws in robotics, a new model architecture, and news of over 270,00 hours of real-world manipulation data from 1,000s of homes, warehouses, and workplaces.

Pete, Andy, and Andy have built a team that feels equivalent in talent and raw ambition to early OpenAI. And their progress reflects this. At OpenAI I consistently had my hair blown back by the engineering challenges that were solved in service of unblocking scale. Generalist has been no different. The stories are already legendary, including building custom hardware and infrastructure – such as laying new dedicated internet lines – to handle the scale of the data volume they’re collecting for their models.

On a personal note, it’s a joy to be reunited with my friend Evan Morikawa, part of the founding team at Generalist and my partner in building the product development org at OpenAI.

I am beyond thrilled to be involved with Andy, Andy, Pete, and the entire team at Generalist. As part of Spark’s investment I have joined the board. Generalist is well on their way to enabling a future where we all benefit from robots in our homes and work.

The Miracle Is Behind Us: Thinking about Eras of AI to Understand What's Coming

Fraser — Tue, 01 Apr 2025 16:17:15 GMT

People still don’t appreciate the miracle that is next token prediction and the autoregressive transformer.

We are starting to grasp the impact of “generative ai” when applied to language and code. In the next couple of years, language models will do hours of independent work on our behalf and build complex software applications from scratch. Immense value is being created as the AI’s capabilities improve dramatically, year after year.

Everything we’re seeing with language is playing out in other fields and the future is both obvious and profound. Within a decade, AI will design molecules that cure many diseases and robots will drive the marginal cost of labor for many tasks to zero.

Hyperbole to some, obvious to others. The most amazing thing is that the breakthrough to enable this has already happened – next token prediction and the autoregressive transformer. The miracle is behind us. The future is foreseeable. We just need the data.

“These models, they just want to learn” - Ilya Sutskever

We’ve seen that if you take a sequence of text – “The cow jumped over the” – that an AI model can learn what comes next. Do this with a massive amount of data and compute and the capabilities of the AI become superhuman.

Similarly with code. Take a sequence – “for i in range(5):” – and an AI model can learn what comes next. The AI’s capabilities improve as the amount of training data and compute increase.

It intuitively makes sense. To the AI, “the cow jumped” is no different from “for i in range(5)”, in the sense that both are sequences that have meaning and the model learns the underlying meaning and rules of the “language” as the amount of training data increases.

Both biology and robotics can be expressed as simple sequences – `M-D-K-K-Y-S` and `"joint_positions": [0.2, 1.1, -0.5, 0.8, 1.2, -0.3]`, respectively. The next couple of years are predictable: the models, they just want to learn. And with enough data, we will see revolutions in biology and robotics. In fact, we already are. We just don’t grasp the impact yet.

Eras: Understanding the trajectory of AI

It’s helpful to think about eras of AI, and the transition from toy-like models to AI that is useful and, eventually, AI that is valuable.

In 2018-2020, early language models like GPT-2 could complete a sentence of text, but it was often nonsensical. The models were toy-like. They were not useful and certainty weren’t valuable. But to those who understood what had been done, the toy models pointed to an exciting future.

As the amount of training data and compute increased, and the models entered the second era, early versions of GPT-3 became useful in constrained use-cases, such as writing short ad copy. It took entering the third era, with GPT-3.5 and GPT-4, before models started to become valuable. And the commercial value has continued to increase with model advances like Sonnet 3.7 and o3.

The models are now completing tasks that take hours of time for highly trained knowledge workers to do. As new models arrive, and we enter new eras, the commercial value delivered by these models grows significantly. At launch, ChatGPT was priced at $20 / mo. Today, access to the best models costs $200 / mo. OpenAI is discussing prices of $20,000 / mo for future models.

Each new era has expanded the capabilities of what the models can do, expanded the use-cases where the models are useful, and expanded how valuable the models are.

In 2019, looking at the capabilities of GPT-2, seeing it fail to coherently finish a single sentence, it was hard for many to see what was coming. But the models, they just want to learn.

Code has followed a similar trajectory. From generating short code snippets, rife with errors, to developing entire applications from scratch in a few short years. Had you looked at the early coding models it would have been hard to foresee the explosive growth in revenue that we’re seeing from products like Lovable or Replit. But the pattern with code has been identical to language: from toy to useful to valuable, era by era as the models get larger.

We’re two eras into the same trajectory in biology. Early transformer-based models, trained on a small amount of data, showed that a model could generate sequences for simple proteins that were sometimes functionally correct. Neither useful nor valuable and easily dismissible. Toy-like and no different from early GPT-2 demos. Scale up the model, enter the second era, and the model can generate a protein that’s functionally identical to a molecule worth $60b. Certainly usable, perhaps not yet valuable. But given it’s the same underlying mechanic, next token prediction and an autoregressive transformer, what do you think is going to happen as the model scales and enters the 3rd and 4th eras? What about the 7th and 8th era? We are on a path where models are learning the underlying language of nature. Biology is becoming programmable. Era after era, the models are becoming more useful and more valuable.

When you read Google’s papers from 2023, PALM-E and RT-2, where they show that an autoregressive transformer, trained on next token prediction, can control a robot, it may be easy to miss what’s coming. But when more recent work, from Physical Intelligence and DeepMind, show that generalization and capability improve as data scales, it’s hard to ignore that we’re walking down the same path. Yes, it’s a toy today. Yes, it’s not useful today. Yes, it’s far from valuable today. But we’re in the first era with robotics and it’s working the same way it worked for language and code. The trajectory will be no different.

The miracle is behind us. It’s all going to work. We just need the data.

Elicit - Automating Research

Fraser — Wed, 26 Feb 2025 14:40:15 GMT

Einstein said the important thing is to never stop questioning, a truth that fuels scientific discovery, the heartbeat of progress. The potential for AI to positively shape the future is perhaps greatest in accelerating research itself – enhancing our ability to ask better questions, find relevant information faster, and draw connections that might otherwise remain hidden.

Elicit is building an AI Research Assistant that is used by over 400,000 researchers each month to accelerate their research and automate time-consuming tasks like summarizing papers, extracting data, and synthesizing findings.

Today, Elicit is introducing new features that use AI to automatically generate Systematic Reviews and Research Reports, delivering in minutes what has traditionally taken months of manual work. For a given research question, Elicit searches for potentially relevant papers from a corpus of 125 million, writes its own screening criteria and filters the studies based on the inclusion criteria, extracts key metrics and data from the filtered set of papers, and then synthesizes the findings into a report. All automatically via AI.

Automating the ability to understand what is known is just the first step towards Elicit’s mission to radically increase good reasoning in the world. The plan is to help automate discovery and the scientific method while generalizing scientific rigor to many more domains.

The future envisioned by Elicit's co-founders, Andreas Stuhlmüller and Jungwon Byun, sees Elicit being used by medical researchers to quickly synthesize findings across thousands of studies to identify promising treatments. By policy makers who rapidly assess the evidence base for interventions. By scientists who can navigate vast literature landscapes to discover overlooked connections and accelerate breakthroughs. By removing the friction from the research process, and guiding humans to better decisions, Elicit transforms how knowledge accumulates and flows, turning what was once a painstaking process of discovery into something that unfolds in real time, with positive implications for human progress.

To bring this vision to reality, Andreas and Jungwon have built a talented team across engineering, machine learning, and product design. We were impressed by the quality of everything Elicit does - from their product to their job descriptions, the team ships quality, quickly.

Spark is proud to support Andreas, Jungwon, and the team at Elicit as they work to automate research and build the infrastructure for good reasoning at scale. Together with Footwork we are co-leading a $22M investment in Elicit joined by Fifty Years, Basis Set, and Mythos.

With this funding Elicit will expand beyond academic research, bringing systematic, evidence-based AI decision-making to organizations across healthcare, policy-making, scientific research, and many other domains. In a world increasingly shaped by artificial intelligence, Elicit’s approach – using AI to help radically improve reasoning – represents the kind of thoughtful innovation we need. We’re excited to join Elicit on this journey.

Hallway Chat

Fraser — Fri, 31 May 2024 16:07:18 GMT

A decade ago or so Nabeel and Bijan started recording informal conversations about startups, technology, and products. It was authentic and unpolished, their passion and personalities a joy to listen to.

A few months ago Nabeel and I revised Hallway Chat. Our goal is simple - to share some of our discussions as we try to make sense of this moment and explore products and technologies.

Here are some of my favorite episodes:

Launch stories of ChatGPT where I shared some stories on the one year anniversary.

Exploring how AI will impact software development.

Thinking through AI models and when there may be differentiation vs commodification.

Profluent - Scaling AI in Biology

Fraser — Thu, 16 May 2024 15:12:22 GMT

In 2019 I sent a rather cold email to Ilya Sutskever. In hindsight it’s rather humorous — I thought he’d be interested in reading Rich Sutton’s The Bitter Lesson along with my thoughts on how dramatically I believed the world was going to change as people scaled AI. He responded warmly asking to meet at Stable Cafe.

This was part of a two week stretch that is among the most electric of my career. I would read research papers late into the evening only stopping to eagerly tell my wife about the elegance of masked language modeling and what it pointed to. I tinkered with fine-tuning BERT, joyously taunting the best ML Engineer I knew that he was going to be unemployed soon (he’s now at OpenAI).

We have seen what happens when the scaling hypothesis is applied to language.

Last summer I had the joy of experiencing this feeling again when I met Ali Madani and learned what’s happening as the scaling hypothesis is pursued in biology.

In January 2023 Ali and his team published research detailing for the first time how large AI models trained on biological data can generate complete proteins that don’t exist in nature yet function as well as proteins that evolved over millions of years. This model was equivalent in size to GPT-2, a model that was released in 2019.

Large AI models in biology exhibit similar traits to what has been observed in language models — the base model can be fine-tuned to improve performance in specific domains and new capabilities emerge with scale.

Last month the New York Times covered a recent release from Ali’s company, Profluent. Profluent has trained an LLM on a massive amount of biological data and have an AI model that can design gene editors that do not exist in nature. Profluent has now shown that a gene editor completely designed from scratch by AI can successfully edit human DNA. And they are open sourcing an AI-generated gene editor for free use. This is truly remarkable for many reasons, but particularly for the future it points to as the scaling hypothesis plays out in biology.

I am so delighted to be involved with Ali and Profluent as they advance the frontier of AI research in biology and work to make biology programmable. You can read more about our investment on Spark’s site.

The case for open AI

Fraser — Fri, 05 May 2023 17:24:43 GMT

The memo that leaked from Google (which is real) makes the case for an open future for AI. I have increasing conviction that aspects of this future will unfold but I disagree with the conclusion. I believe we are heading to a world where there’s an oligopoly for the largest, most capable models accompanied by a thriving open ecosystem of smaller models.

Let’s get this out of the way upfront: I’m an OpenAI shareholder; I’m now actively investing into the future I foresee as a VC at Spark. I’m biased, have conflict etc etc.

AGI or Bust

I don’t see how there’s anything but a small number of groups providing the largest, most capable models. At this point it seems clear that it will be OpenAI and Anthropic and then tbd on whether anyone else will join them.

The window to recreate the dynamics that have led to these groups emerging is closing. These are technical and research problems on the scale of history’s largest projects, requiring the capital requirements of such projects, while being executed at the velocity of a ruthless startup.

Even in terms of “recreating the dynamics” it’s worth remembering that OpenAI and Anthropic are essentially the same, evolving via a mitosis-like event (what a scene this moment will make when the book is eventually written). Which is to say, despite an oligopoly forming, there remains zero evidence that anyone is able to recreate the required dynamics. Further, the talent required to train models of this nature is becoming scarcer, as the flow of talent accelerates toward a small number of orgs.

On some cadence - tick, tick, tick - OpenAI and other AGI or Bust companies will release the world’s largest models, capabilities will step forward, and the world will gasp as AI accomplishes previously impossible feats.

There’s no chance open source will move in lock-step, let alone surpass these private labs when it comes to training the largest models. But open source will still play a critical role in the future of AI.

Open AI

Today, the state of open source AI is rather dismal, in the sense that while performance on benchmarks against comparable models is good, performance in production settings is not. This will all get fixed, and relatively quickly. Here’s why I’m optimistic.

Open source efforts are just getting started and accelerating quickly.

Llama was released just two months ago. Pipelines, tools, data sets, etc are all evolving quickly. Performance looks good on academic benchmarks because these models originated from research-motivated researchers and evals for production aren’t really a thing (yet). But product-motivated researchers and developers will continue to iteratively move the ecosystem toward solutions that work well in production.

The world is the R&D lab for open source.

Talent may be flowing toward a few private labs, but they will only ever have a sliver of the total talent available across the world. The permissionless innovation that open source enables means that professors at Berkeley, alongside pseudo-anonymous students in Bangladesh, are all advancing the state of the ecosystem. There will be an explosion of experiments, creating toolsets and datasets, but also integrations with the weights themselves, that tailor open models for very specific use-cases.

Many product use-cases will have a ceiling in terms of the AI capabilities they require.

As GPT-5, -6, -7, … arrive, new use-cases will be unlocked by the advancing capabilities. But with every generational step forward, some use-cases will no longer be able to absorb the increase in capability and other measures will dictate which solution is optimal.

We’re already seeing this. While GPT-4 performs breathtakingly well on standard exams, it wouldn’t make sense to use this model in production at scale for a simple classification task. Replit is showing us that the same may already be true for more complex tasks such as code completion. Sometimes it will make sense to pay up (in every sense) to use the most general, most capable model. Many times it won’t, as your needs will be best met by a narrow model that’s tailored to your specific use-case.

AI-enabled products will provide the optimal user experience by going ‘full stack’.

History has shown us that delivering the best user experience matters. And the best user experience is the result of every product detail being considered and constructed with intention. In AI-enabled products, this means that control, and the ability to make opinionated decisions across all aspects of the model, will be important for delivering the optimal product experience. As the open ecosystem evolves, it will increasingly be easy for product teams to go “full stack” with their AI model to deliver the best experience.

For example, the nature of the training data - such as form, style, length, etc - impacts what the AI generates and what the AI generates is increasingly a large part of the user experience. Product teams will start to care deeply about the training data because it will have a material impact on the user experience.

The likely future

We’re heading toward a world where a small number of players have a defensible oligopoly on the most capable, most general models. Product teams will use these models when required. Adjacent to this will be a vibrant, open ecosystem of smaller models, tailored and customizable to specific product needs.

There’s a likely future where the aggregate volume from open models is orders of magnitude larger than the private ones. What’s unclear is how much of the value will accrue to the open ecosystem.

Misc

I joined Spark Capital and am leading investments in seed and series a startups. I love entrepreneurship and technology. I love that a small team can ship and change the world. As a former founder, I’m excited to support founders building remarkable products.
I’m helping Yohei with BabyAGI, an open source effort to build AI agents. I believe in a future where AI assistants autonomously support us across a broad and diverse set of tasks and I want to help support the open ecosystem working to realize this future.
We’re hiring a hacker in residence at Spark. Come build with me.

Novelty vs utility, science vs belief, and where we are with AI-first products

Fraser — Wed, 01 Feb 2023 19:02:15 GMT

The AI Crowd is Mad by Tim Daubenschütz is worth a read. I agree that the LLM discussion could benefit from more nuance. I agree that costs need to come down (but of course they will). I find the contemplated future for AI odd. And, in terms of startups and products, I want to use the post to frame where I believe we are with modern AI and what we should expect.

Novelty vs Utility.

Tim argues that we’re collectively experiencing survivorship bias: seeing the algorithmically promoted highlights of ChatGPT on social media that are “truly stunning” is not giving us a reasonable sense of performance.

I agree. Further, much of what we’re seeing is novelty rather than utility. But even if the highlight reel is predominantly novelty, it is indeed stunning:

However, scroll through a twitter search for what ChatGPT is doing with code, or appreciate what Intercom did with ChatGPT in less than 8 weeks, and it’s clear that there’s meaningful value adjacent to the fun.

The degree to which the highlight reel is stunning and the mix of novelty:utility matters, but what matters more is the rate of change for both and how much further the technology can improve.

Belief vs Science.

In November 2019 OpenAI released the full version of GPT-2. I remember playing with Write With Transformer, an accessible UI for interacting with GPT-2, and using Colab to fine-tune the model to create toy-like experiences. It was pure novelty and not very compelling. But I was stunned. The model was completing sentences that - sort of? - made sense. It didn’t take an overly active imagination to see what could come next.

Tim’s post highlights this tweet from Sam, taking issue with Sam’s use of ‘trust’, and stating “it is clear to me that these opinions are not anchored in reproducible observations”.

One may look at the progress over the past 3 years, from GPT-2 to ChatGPT, and decide that they need more evidence before they’ll extrapolate. Fortunately, that evidence exists in the form of scaling laws (initially published by OpenAI in 2020, revised by DeepMind in early 2022). Much like other technology trends, things are going to get better in forecastable ways for a while.

Where We Are and What’s to Come

Tim references Carlota Perez’s framework for technological revolutions and financial capital, which is useful as a way to think about our place in time and what’s to come.

I believe we’re in the midst of the irruption phase and that the frenzy is still to come.

Apparently half of the current YC class is building with ChatGPT, which is exactly what we should expect at this point (as Replit’s Amjad rightly argues). We’re still figuring out what’s possible with this technology and developing the tools and infrastructure that make it easier to build with it. All while the underlying technology is rapidly advancing. Which is to say, it’s still early.

AI-enabled products will dramatically change the world, but not as soon as some think. While most applications built in the short-term will be inane, some will become transformative companies. AI is already good enough to deliver meaningful utility with certain product experiences. And while the degree of novelty remains high today… trust the exponential.

Soundtrack while writing.

Follow me on Twitter: @fraser

Join 1,400 other readers of Entropy and follow along as we try to make sense of modern AI and AI-enabled products:

Subscribe now

Startups and the technique behind ChatGPT

Fraser — Thu, 26 Jan 2023 19:05:21 GMT

Hello to all 1,151 of you and welcome to the 166 who signed up in the past two days! If you haven’t yet, subscribe to follow my efforts to tell the story of modern AI:

Subscribe now

🎵Soundtrack while writing

Tomorrow is the anniversary of OpenAI’s “Aligning Language Models to Follow Instructions”, a seminal paper that I co-authored (just kidding – I had no reason to be listed as an author but Ryan is as kind as he is intelligent).

The technique pioneered by Ryan, Long, and team was used to create ChatGPT and understanding what was done is useful in appreciating the opportunity for startups building differentiated product experiences with AI.

What was done

First, they basically did GPT Cosplay. Humans pretended to be the AI and wrote “good” replies to different ways people may want to interact with the model. For example, a human would be shown a news article and asked to summarize it. That person would read the article and write a summary. All of these examples were then used to fine-tune a model.

They then had people look at multiple AI-generated replies for a given query and rate them from best to worst. “Here’s one way AI could respond, and another, and another, … which do you like best?” These ratings were used to train a second model, which predicts how “good” the AI-generated content is.

One model generates content and, almost like a coach, the other modeI evaluates how well it performed (“you did pretty good on this one!”). This feedback is then used to improve the model that generates content.

Results

The proof is in the pudding with ChatGPT. But the results from the original paper are profound: a model trained this way outperforms a model that’s 100X larger.

Why this matters

The amount of data used to achieve the results in the paper was relatively small. They had people write ~10,000 “good” responses and make ~30,000 ratings. And since the data was spread across a range of use-cases – from copywriting to Q&A, summarization to classification and others – there was an even smaller amount of data for any given use-case. This technique is obtainable for startups.

A product is the result of many opinionated decisions that need to be considered together to create the optimal user experience. This is one area where the difference between research and product is felt. What does a “good” response mean, absent from the product experience it’s enabling? Certainly a good summary for an app targeted at medical professionals would differ from what’s best for an app targeted at readers of the news. And the best summary for either would account for the product’s UI. Startups that care deeply about the user experience will care deeply about the data used to fine-tune their model.

Scale is already offering the infrastructure and human contractors to do this. Inevitably there will be robust open source tools for these pipelines. A small amount of data, distributed across a range of use-cases, led to a 100X improvement. As startups embrace this technique, in a focused and opinionated way, we’re going to see transformative products emerge.

Follow me on Twitter: @fraser

If you’re building a startup with AI, email me!

Testing by Vibes & Control Over the UX

Fraser — Tue, 24 Jan 2023 19:40:28 GMT

Subscribe now

Hello to the 985 of you who have signed up! As mentioned, Entropy is an experiment and I’ll be iterating based on what I like and what I hear from you all. Please do share feedback.

🎵Soundtrack while writing

Testing by Vibes

Even more than AI safety, the state of benchmarking for AI-enabled products feels akin to the practice of bloodletting: something we’ll look back on and wonder what we were collectively thinking.

Here’s a passage from a discussion last week between W&B’s CEO Lukas Biewald and Replit’s CEO Amjad Masad, moderated by Replit’s Ornella Altunyan. It’s lengthy, but it’s worth the read as it touches on another important topic – the role of open source in AI-enabled products.

Ornella: Something that is so interesting about this boom that we’re having right now is that open source is keeping pace with the large research universities and Google’s of the world. Why do you think that is?
Lukas: ML, unlike other software development, has this incredibly strong academic lineage where open source is the default. So much of the ML is coming directly from research and everyone in research is opening up their stuff… This is a fantastic thing. It makes everything available to everyone in the world.
Amjad: I agree with that. This generation of models lends itself really nicely to open source. They're essentially stateless programs, so you give them input and you get an output. They're very portable, they can run on any machine. Especially diffusion models are very small, so that really lends itself nicely to open source. I think it's been hard to replicate the really large models. Some of them are doing well on benchmarks but then when you try to use them they're actually not very good.
Lukas: Which ones are you talking about?
Amjad: All of them! GLM, Bloom. All of them. I respect everything they’ve done but they’re really not that great. Maybe they’re still iterating on them. Maybe here the data advantage starts to show at very large parameters. That being said, I’m pretty sure they’ll catch up pretty soon.
Lukas: I also think there's a lot to make a model between one that runs well on a benchmark versus one that does something actually useful for a real human. That's not a trivial undertaking.
…
Ornell: Speaking of open source, I was wondering if Amjad could walk us through the approach to leveraging open source to build tools like Ghostwriter.
Amjad: When we started working on it, it wasn't really clear that there was going to be this big open source movement that would help startups like ours build Copilot-like experiences. One of the main reasons we wanted to self-host it, and not just rely on an API, is because we really want to control the latency and really want to control the quality, want to be able to improve it over time, want to be able to collect data and feedback, and iterate on it.
We felt like for some use cases, very large models or chat-type use cases, it's fine to use an API. But for something this close to the user … it was very important for us to own the experience end-to-end.
At first we literally tried every model out there. And we found Codegen from Salesforce. When you’re evaluating some technologies they’re never perfect for your use-case but you see a glimpse, “ok I can see this working”. Then it was like two weeks of manic work trying to make it faster. … and it started feeling like something we could use. Eventually we did some fine-tuning and figured out how to improve the model on the fly based on user feedback. Now we’re training a model from scratch. But that’s after we shipped. We shipped a product, people love it, we’re collecting a lot more data, we know what’s missing, we know what features we want, and we have our first model trained from scratch, and we’ll have something to announce in a month or two.
[discussion on the sweet spot model size for Replit’s product experience, balancing capabilities and latency]
Amjad: The tricky thing is again what we talked about with benchmarking is that you could do well on benchmarks and then go try in the real world and it's actually made the model performance worse. We had one case, for example, where it was doing better on python after some fine-tuning but then it forgot how to write jsx. We call it testing by vibes.
Lukas: [laughing rather hard] Oh no! That doesn’t sound like a best practice, Amjad.
Amjad: We kind of check the vibes of the model and if it passes the vibe check it goes into an a/b test to check the acceptance rate. If it inches upward, we’re doing something well. If it’s neural, maybe we didn’t do something useful. Would love to get more objective about it but we haven’t found a way, other than building up more and more benchmarks over time.

From a benchmarking perspective, there are a number of interesting things here. It’s rather wild to consider that state of the art is “testing by vibes”, yet this is true. You sort of poke around to get a ~feeling~ and once you’ve spelunked enough you yolo into production.

Another observation is that academic benchmarks don’t adequately measure performance for a given product experience. And even within a single product experience, performance of a new model can vary. With one fine-tuned model, Replit saw performance improve in one area, python, but diminish in another, jsx.

Finally, an important aspect highlighted by Amjad is the idea that capability is only one measure of performance that matters when building a product. Latency and cost matter too and the optimal capability/latency tradeoff is specific to the product experience being delivered.

All of this is to say that evaluating a model’s performance today is a nuanced, labour-intensive, error-prone process. The incentive to solve this problem is high because evaluating performance is critical to controlling the product experience.

Control Over the Product Experience

“It was very important for us to own the experience end-to-end.” - Amjad

To craft the user experience exactly as they wanted and to iterate and improve the ux over time, Replit self-hosted an open source model (amusingly, an open source model from Salesforce). They shipped this product, gathered feedback, insights, and, importantly, data. And now they’ve trained their first model from scratch. While the degree of complexity has undoubtedly increased, the degrees of freedom that they have in controlling the product experience has too. To paraphrase Alan Kay, perhaps people who are really serious about product should make their own models. The open source world is increasingly enabling this for those who want to take on the cost and complexity.

Building in the open

I was happily surprised by the response to my Tweets yesterday. As of now, there have been 347,000 views, 985 subscribers on Substack, and 192 people who have expressed interest in investing with me in founders who are building with AI. My plan is to tell the story of modern AI and those who are building ambitious products with it. By doing so, I hope to better understand where the world is going and to help create a better future. Thanks for your support!

Coming soon

Fraser — Mon, 23 Jan 2023 13:36:02 GMT

This is Entropy.

Subscribe now