Testing by Vibes & Control Over the UX

Entropy #1

Jan 24, 2023

Hello to the 985 of you who have signed up! As mentioned, Entropy is an experiment and I’ll be iterating based on what I like and what I hear from you all. Please do share feedback.

🎵Soundtrack while writing

Testing by Vibes

Even more than AI safety, the state of benchmarking for AI-enabled products feels akin to the practice of bloodletting: something we’ll look back on and wonder what we were collectively thinking.

Here’s a passage from a discussion last week between W&B’s CEO Lukas Biewald and Replit’s CEO Amjad Masad, moderated by Replit’s Ornella Altunyan. It’s lengthy, but it’s worth the read as it touches on another important topic – the role of open source in AI-enabled products.

Ornella: Something that is so interesting about this boom that we’re having right now is that open source is keeping pace with the large research universities and Google’s of the world. Why do you think that is?
Lukas: ML, unlike other software development, has this incredibly strong academic lineage where open source is the default. So much of the ML is coming directly from research and everyone in research is opening up their stuff… This is a fantastic thing. It makes everything available to everyone in the world.
Amjad: I agree with that. This generation of models lends itself really nicely to open source. They're essentially stateless programs, so you give them input and you get an output. They're very portable, they can run on any machine. Especially diffusion models are very small, so that really lends itself nicely to open source. I think it's been hard to replicate the really large models. Some of them are doing well on benchmarks but then when you try to use them they're actually not very good.
Lukas: Which ones are you talking about?
Amjad: All of them! GLM, Bloom. All of them. I respect everything they’ve done but they’re really not that great. Maybe they’re still iterating on them. Maybe here the data advantage starts to show at very large parameters. That being said, I’m pretty sure they’ll catch up pretty soon.
Lukas: I also think there's a lot to make a model between one that runs well on a benchmark versus one that does something actually useful for a real human. That's not a trivial undertaking.
…
Ornell: Speaking of open source, I was wondering if Amjad could walk us through the approach to leveraging open source to build tools like Ghostwriter.
Amjad: When we started working on it, it wasn't really clear that there was going to be this big open source movement that would help startups like ours build Copilot-like experiences. One of the main reasons we wanted to self-host it, and not just rely on an API, is because we really want to control the latency and really want to control the quality, want to be able to improve it over time, want to be able to collect data and feedback, and iterate on it.
We felt like for some use cases, very large models or chat-type use cases, it's fine to use an API. But for something this close to the user … it was very important for us to own the experience end-to-end.
At first we literally tried every model out there. And we found Codegen from Salesforce. When you’re evaluating some technologies they’re never perfect for your use-case but you see a glimpse, “ok I can see this working”. Then it was like two weeks of manic work trying to make it faster. … and it started feeling like something we could use. Eventually we did some fine-tuning and figured out how to improve the model on the fly based on user feedback. Now we’re training a model from scratch. But that’s after we shipped. We shipped a product, people love it, we’re collecting a lot more data, we know what’s missing, we know what features we want, and we have our first model trained from scratch, and we’ll have something to announce in a month or two.
[discussion on the sweet spot model size for Replit’s product experience, balancing capabilities and latency]
Amjad: The tricky thing is again what we talked about with benchmarking is that you could do well on benchmarks and then go try in the real world and it's actually made the model performance worse. We had one case, for example, where it was doing better on python after some fine-tuning but then it forgot how to write jsx. We call it testing by vibes.
Lukas: [laughing rather hard] Oh no! That doesn’t sound like a best practice, Amjad.
Amjad: We kind of check the vibes of the model and if it passes the vibe check it goes into an a/b test to check the acceptance rate. If it inches upward, we’re doing something well. If it’s neural, maybe we didn’t do something useful. Would love to get more objective about it but we haven’t found a way, other than building up more and more benchmarks over time.

From a benchmarking perspective, there are a number of interesting things here. It’s rather wild to consider that state of the art is “testing by vibes”, yet this is true. You sort of poke around to get a ~feeling~ and once you’ve spelunked enough you yolo into production.

Another observation is that academic benchmarks don’t adequately measure performance for a given product experience. And even within a single product experience, performance of a new model can vary. With one fine-tuned model, Replit saw performance improve in one area, python, but diminish in another, jsx.

Finally, an important aspect highlighted by Amjad is the idea that capability is only one measure of performance that matters when building a product. Latency and cost matter too and the optimal capability/latency tradeoff is specific to the product experience being delivered.

All of this is to say that evaluating a model’s performance today is a nuanced, labour-intensive, error-prone process. The incentive to solve this problem is high because evaluating performance is critical to controlling the product experience.

Control Over the Product Experience

“It was very important for us to own the experience end-to-end.” - Amjad

To craft the user experience exactly as they wanted and to iterate and improve the ux over time, Replit self-hosted an open source model (amusingly, an open source model from Salesforce). They shipped this product, gathered feedback, insights, and, importantly, data. And now they’ve trained their first model from scratch. While the degree of complexity has undoubtedly increased, the degrees of freedom that they have in controlling the product experience has too. To paraphrase Alan Kay, perhaps people who are really serious about product should make their own models. The open source world is increasingly enabling this for those who want to take on the cost and complexity.

Building in the open

I was happily surprised by the response to my Tweets yesterday. As of now, there have been 347,000 views, 985 subscribers on Substack, and 192 people who have expressed interest in investing with me in founders who are building with AI. My plan is to tell the story of modern AI and those who are building ambitious products with it. By doing so, I hope to better understand where the world is going and to help create a better future. Thanks for your support!