Turning AI into “Electricity”: Interpreting, Evaluating, and Making LLMs Easy to Use
Introduction:
Welcome to the podcast version of Entrepreneurs of Life, dialogues with distinguished immigrants in Silicon Valley, along with AI insights. This is our inaugural episode. With two exceptional guests, we'll discuss a common challenge for AI builders today – understanding how transformer models work, evaluating them, and optimizing their performance while minimizing costs through something called model routing. We'll also explore the broader AI ecosystem and imagine its future.
While this is a technical topic, we'll break things down and keep them digestible for everyone.
Our two guests today are both experts in AI and data, with experience across North America and Asia:
Our first guest is Yuzheng Sun, data scientist at StatSig, the leading platform for A/B testing backed by Sequoia and serving clients including OpenAI and Character.ai. Previously, Yuzheng had various DS and economist roles at Tencent, Meta, and Amazon after his Economics PhD from Cornell. Yuzheng is also a prolific blogger on tech, data, career, and personal growth with 0.2 million followers across platforms. For our Chinese listeners, you can find him online as "课代表立正".
Our second guest is Jason Hu, founding engineer at Martian, a fast-growing AI startup backed by NEA. Their first product, an LLM router launched last year, is already adopted by developers from 300+ companies, including OpenAI and Amazon. Before Martian, Jason did research at the Chicago Human+AI Lab and ByteDance and graduated from the University of Chicago. He leads one of the largest AI communities in the Bay Area, aligns.ai, with speakers and community members from OpenAI, NVIDIA, Stanford, and more.
Let's dive into today's topic: LLM interpretability, evaluation, and routing.
(You can either watch the YouTube video below, listen to the podcast on Apple or Spotify, or keep reading the transcript, slightly edited for conciseness and clarity.)
I. Why We Should Understand Transformers, and How
Sophie:
As we know, AI models based on deep learning and neural networks are notorious for being black boxes. It’s difficult to understand how they make particular decisions or predictions. This became increasingly concerning as state-of-the-art transformer models like GPT acquired emerging capabilities such as reasoning.
As we approach a future with artificial general intelligence or AGI, if we cannot interpret model behaviors, it raises questions about whether we can guarantee the models are working in humans' best interests. Major AI labs, including OpenAI, Anthropic, and Google, view model interpretability as crucial and are actively working on it. Jason's company, Martian, shares this goal. In fact, that's how you chose the name Martian to begin with.
So Jason, why don't we start there? Why the name Martian, and how does that tie to your goal of making models interpretable?
Jason:
Thanks Sophie, it's a pleasure to be here.
The name Martian comes from a group of Hungarian-American scientists known as the "Martians." This group includes luminaries like John von Neumann, the prolific mathematician Paul Erdős, and Leo Szilard, the father of the atomic bomb. Interestingly, all these people were born and raised in a small community in Hungary.
At the beginning of our company, the cofounders – Etan and Shriyash – were curious about this phenomenon – why was so much intelligence concentrated in such a small circle of people? This is closely tied to our main research and business goals: making intelligence understandable and utilizing it effectively. That's the origin of the name Martian.
Our first product is an LLM router, a middle layer that allows users to intelligently select and use multiple LLMs simultaneously. Happy to dive into it later.
Sophie:
Great, we'll get there in just a second. I noted that you tied together interpreting models and leveraging their power.
Model interpretability is a general term with many layers to it. My next question is for both of you. What does model interpretability mean to you? And specifically, what do we most need to understand about transformers? Let's start with Yuzheng.
Yuzheng:
Starting with the first question: what does model interpretability mean to me?
My training was in economics, and in economics, we were obsessed with model interpretability, because we cared about cause and effect. That's why we usually use the simplest models possible, like linear regressions instead of machine learning when we can.
However, with deep learning, we lose all interpretability; and as a specific architecture within deep learning, transformers are no exception. You can get pieces of information here and there, but we don't understand how they work.
ChatGPT's emerging ability to reason is even more puzzling. It prompts me to think about humans – do we understand why we are conscious and how our brains work? We know little about human intelligence either.
So I have to think about “interpretability” at a pragmatic level. For example, this is the first time the three of us talk, right? I don’t assume I know how your minds work. I wouldn't even assume whether you're human, aliens, or language models. But we’re having this exchange of information, and I can assess whether the exchange is informative, be it a product of our intelligence or just a result of the words used. I don't need to worry about interpretability the way we worry about model interpretability.
On the second question: what do we most need to understand about transformers? I hope someday I can decrypt how they work and why they produce certain information. I’d love for that to shed more light on how our brains work as well – probably hard, but that would be the ultimate goal. So far, the more we learn about machine intelligence, the more we realize how little we know about human intelligence.
The Guardian published 20 big unanswered questions for humans in the 21st century, like "Why do we dream?" "What is consciousness?" or "Why are we fundamentally different from monkeys?" These remain unsolved mysteries.
Sophie:
Jason, what does model interpretability mean to you? What elements does it involve?
Jason:
I also studied economics in college, so I can totally relate to what Yuzheng said. There's always a tension between a model’s interpretability and power. For example, in theory we could build an extremely complex model with trillions of parameters to perfectly predict or describe some economic phenomena. But this complex model doesn't tell us anything about the fundamental laws of the market.
What really impressed me as a student of economics was how simple many classic models are, like a one-line equation for humans optimizing their utility. Yet by deriving from this simple mechanism, we can explain or even predict many complex economic behaviors.
So far in AI, facing the dichotomy between model interpretability and power, most people have favored power. For example, the "Bitter Lesson" argues that the biggest driver for AI development is increased compute and data. Basically, the idea of "大力出奇迹" in Chinese (Great Effort Leads to Miracles) – we don't care if the model is interpretable; we just need more compute and more data, and trust that magic will happen. This is basically what we've seen. Even Ilya Sutskever has said that how deep learning works is a miracle beyond human comprehension.
Before ChatGPT, everything was okay, because AGI felt like a distant dream – we couldn't imagine a model better than humans in general. But after ChatGPT, we realize that ignoring interpretability can create huge issues. We can have incredibly powerful and knowledgeable models that eventually could do anything with increased compute and data. Interpretability is key to controlling these super-intelligent entities and preventing them from destroying humanity or something else terrible. In other words, interpretability is crucial for aligning AI and ensuring its safety.
What do we need to understand about how transformers work? We understand perfectly well their mathematical formulation at the bottom level. We also understand their top-level performance – they can act and talk like humans. But there's a missing middle layer connecting top-level intelligence to the bottom level: the individual neurons and computations within the neuro network.
One academic field dedicated to bridging this gap is called "mechanistic interpretability." Researchers at MIT, Anthropic, and elsewhere have developed methods to identify neurons responsible for certain words, causal reasoning, and so on.
At Martian, we're aiming for something more systematic and comprehensive. We want to understand the whole transformer system, not just individual neurons. We want to turn them into code, into computer programs, so that they are instantly interpretable as a whole.
II. Evaluating and Choosing Foundational Models
Sophie:
Given how little we know about what happens inside those large models, it is hard to assess their individual strengths and weaknesses. That is problematic for millions of developers building on top of large models today, because they have too many options. The number of models hosted on Hugging Face is pushing 500,000, almost a 5x increase over just the past year.
While only a handful of models are considered “state-of-the-art”, they aren't always the right choice for an application builder given costs and/or latency. So choosing the right foundational model has become a headache for many AI developers – an issue too important to ignore, as it directly affects application performance and economic viability.
Yuzheng, you speak to many AI builders. From what you’ve heard, how are they approaching this choice today? Is there a difference between those at large enterprises versus startups?
Yuzheng:
It's complicated, because the big players not only want to access good models, they also want to own one. To them, it’s not all about choosing the one with the lowest cost or best performance, but also about being strategic.
Speaking of evaluating and choosing LLMs, fun fact: if you visit llmevaluation.com, it takes you to my YouTube channel. I bought this domain over a year ago because I believed LLM evaluation would become an important issue, and it did. I'm glad companies are trying to solve this. If you're interested in this domain, let me know!
Sophie:
You were ahead of the game thinking about this. The popularity of the Google search term "LLM evaluation" went up 30 times year over year as of January.
Yuzheng:
Evaluating LLMs is more complicated than most people think. Traditional ML evaluation is relatively simple – model performance is defined on the loss function you want to optimize, which comes down to a single number. Then there's the gap between model performance and production performance, requiring us to also look at user behavior metrics; for example, does this model actually increase click-through rates? You can validate those questions with A/B testing.
But LLM evaluation is almost like evaluating humans – you can't judge them based on simple outputs alone. That's the challenge. I'm glad Martian is tackling it head-on. If Martian solves this problem with an excellent router, then we should just trust the router, because it tries to optimize model performance given a certain business goal. Even if we can't fundamentally interpret model behaviors, we can still evaluate them as long as business goals and costs can be defined.
There's another layer to this. In the long run, my personal view is that AI builders shouldn’t worry too much about evaluating and choosing LLMs. We're at the beginning of the exponential growth curve for generative AI, and practitioners shouldn’t anchor on where OpenAI is today – because they are also growing, probably faster than most companies. This means that everyone else has a moving target that's becoming harder and harder to catch up with. Sora is an example of this challenge.
So my personal, biased opinion is: just go with OpenAI. When GPT-5 is released, everyone will realize they should have built their ecosystem on top of ChatGPT all along. But I do recognize that LLM evaluation is a tough challenge.
Sophie:
We love unfiltered opinions on this show! There's a consensus that GPT-4 and a few other models are leading in performance. The situation gets murkier when we look at other factors, such as cost and latency.
Even nontechnical users of GPT probably noticed that GPT-4 is much better than 3.5, but also much slower. It becomes a trade-off between performance on one hand and cost and speed on the other.
While we're still on model evaluation, how is it typically done? As we know, there's both human involvement and automation, and sometimes even AI evaluating AIs. How do automation and human effort each play a part in model evaluation today?
Jason:
As Yuzheng mentioned, evaluation is quite tricky. I’ll keep using the analogy between evaluating LLMs and evaluating humans. Evaluation methods vary greatly depending on the nature of the work.
For basic tasks, like producing simple goods in a factory, we can use size, weight, or other easy-to-measure metrics to evaluate outputs systematically and automatically.
When the tasks become more complex, evaluation can get very subtle, even involving politics - like code reviews or performance reviews. Evaluating LLMs is similar.
For example, one of the simplest and most widely adopted evaluation sets is MMLU (Massive Multitask Language Understanding) – it's basically a multiple-choice test for models. MMLU has tens of thousands of such questions covering many different subjects, so it's a comprehensive, fast, and easy way to evaluate model reasoning capabilities and knowledge across fields.
We have things slightly more complex. For example, we can evaluate a model's ability to do mathematical reasoning and calculations by analyzing whether the number it outputs is correct. Or we can have the model output code and then execute the code to see if the output is what we expect.
But these methods don’t apply to non-deterministic tasks such as article summarization or writing. They’re trickier to evaluate as we don't have enough human labor to do so. Therefore, one common alternative is using an LLM as a judge. Let's say we want to compare GPT-3.5 versus LLaMA-7B in a summarization task. We give the outputs to GPT-4 and tell GPT-4 the criteria for a good summary, such as being concise and covering all the key points. Given those criteria, GPT-4 can tell us which output is better. With many statistical techniques, we can ensure the LLM judge's output is robust and statistically significant, eliminating noise.
And of course, we can also collaborate with data labeling service providers like Scale AI, or label data in-house. In short, evaluating LLMs in production is nuanced and complex. There's a whole array of different methods we can use.
Sophie:
In production, what are some of the common challenges that AI developers encounter when choosing the best foundational model for their needs?
Yuzheng:
Because all the LLMs are essentially still in beta mode, I don't think we have a mature ecosystem yet. The common challenges are similar for industries in their infancy – the landscape is evolving so fast that solutions can quickly become obsolete.
If you want to use LLMs well, you first need a robust data pipeline to supply your model with the correct knowledge. You also need strong development. I remember everyone in the industry was talking about vector databases at one point, and then the interest moved on to RAG (Retrieval Augmented Generation). Now we hear people claiming that long context windows are the real solution that would make RAG unnecessary. This evolution in thinking shows that there's still a long way to go to unleash existing models’ full potential. But models themselves are also improving fast. I believe we'll see the whole ecosystem play catch-up with frontier model capabilities for quite a while, before we converge on a mature and sustainable pipeline for building with AI.
Sophie:
Jason, I know you guys work with a lot of developers struggling to choose between models. What pain points are you seeing? Any patterns or differences you're noticing between large enterprises and AI-native startups?
Jason:
I totally agree with what Yuzheng said: the speed of progress in AI is imposing real challenges on developers. For example, right now, one of the best open-source models is Mistral's 8x7B model. But many developers simply don't know that it can achieve performance as good as GPT-3.5 Turbo in many scenarios, at half or even a third of the cost. It's a new model from a company not as well-known as OpenAI or Anthropic – some people don't even know about it.
And models like this are popping up every single day, literally. It's impossible for developers to stay up to speed on all the latest developments and keep their AI pipelines updated.
A more nuanced challenge is that different models have subtle differences in their capabilities and knowledge areas. For example, our test found that Anthropic's Claude Instant model is better than GPT-4 at certain foreign language understanding tasks. It's possible that Anthropic has trained on more relevant data for those tasks than OpenAI did with GPT-4.
These relative strengths and weaknesses are subtle and task-specific. It's hard for application builders to know these nitty-gritty details and choose the best model for their use cases.
Yuzheng:
Jason just reminded me of one point: Right now, we're in a transition period. All our existing development infrastructure was built around accuracy. If you had one extra space in your code, the whole program could break down. That’s just how traditional programming works.
But we now have this new way of understanding and producing fuzziness. Large language models can handle fuzzy inputs; on the flip side, they are not designed to produce accurate results.
The transition is going to take a while and present many challenges. But I think the direction is clear – once we have a whole set of new, AI-native infrastructure, where functions can take fuzzy inputs and produce fuzzy (but still helpful) outputs, then we can unlock countless new use cases.
III. Optimizing Outcome with Model Routers
Sophie:
We're in the early innings of making models more reliable, steerable, and aligned. I know we've talked a lot about the challenges, but let's switch to talking about solutions.
Jason – you and your team at Martian have built an LLM router to take over the thorny task of evaluating and choosing the right models for your clients. In the simplest terms, what is a model router, and what does it do?
Jason:
At a high level, a router analyzes a given prompt and routes it to the optimal LLM for execution, based on cost, performance, latency, and other relevant metrics.
One common source of confusion – we're different from some other "router" companies like OpenRouter, which provide what should really be called a “model gateway.” A model gateway is basically a unified API that allows you to call different LLMs through one endpoint. You can use OpenRouter to call GPT-4, Claude, or some other model.
Martian has our own gateway, but we're building on top of that. We're building what we call a dynamic router. So it's not like "for this API, I'll always use GPT-4, while for that I'll always use GPT-3.5." When users call our API endpoint, they don't know which LLM we'll route them to. We build an algorithm to intelligently analyze a prompt and, based on our understanding of the user's objective, automatically route each prompt to different LLMs.
For example, let's say the user is building a text summarization application. When summarizing something simple and straightforward, we don't need GPT-4, which is slow and expensive. Smaller models like Claude Instant can do the job well and cost less. But when our client needs to summarize a complex math paper, for example, we recognize that you actually need GPT-4-level reasoning capabilities to do a good job, so we would route to GPT-4.
Basically, we build task-specific routers to help users achieve better latency, lower cost, and better performance that surpasses any individual large language model.
Sophie:
I love the dynamic approach; the optimal model differs not just across applications, but sometimes even across specific tasks within the same application. I also understand that for a complicated task involving multiple steps, it's possible to route each step to a different model best suited for that step, and then chain the interim results together for the best end-to-end execution. Is that right?
Jason:
Exactly.
Sophie:
That's a super cool concept – like having the Avengers team of superheroes and dynamically calling their different superpowers.
In a recent paper, you and your co-authors, including Professor Kurt Keutzer from UC Berkeley, showed that even simple model routers, much simpler than what you're using for Martian, can achieve meaningful improvements over benchmarks compared to individual LLMs. Can you tell us more about your work there and your findings?
Jason:
Absolutely. We basically proposed a benchmark for evaluating LLM systems. You can plot any LLM on a set of x-y axes, with cost on the x-axis and performance on the y-axis. We always want lower cost and higher performance, which would be the upper left corner of the graph. But in reality, we usually see models that are either more expensive and more performant like GPT-4, or cheaper but less performant like many open-source or smaller models.
What we found in the paper is that, first of all, a theoretical "Oracle Router" that has perfect knowledge of which LLM is best at what task would be very powerful. It would achieve performance as good as the best LLM and cost as low as the cheapest LLM. This means the potential benefit from routing is huge.
We then constructed a few simple, illustrative routers using KNN or MLP algorithms. Although for some complex tasks they weren't that impressive, for other tasks such as the RAG dataset, the simple routers were able to beat even the best LLMs out there.
Keep in mind that we were only demonstrating very basic routing algorithms, and there's a huge design space for potential routers, so it's a very promising field. We built a benchmark so other people can build new routers and test them against it.
Sophie:
It's exciting to think about the potential here. Jason, let's talk about the specific router you built at Martian. According to your website, the Martian router beats GPT-4 over nine out of ten times against a set of evaluation benchmarks put out by OpenAI – performing at least as well but at lower costs. And on average, your router saves 20% in cost when optimizing purely for performance. That is impressive!
At a high level, what can you tell us about how Martian picks the best model(s) for a given client request? Like what input do you need from the client, and how does Martian then do the optimization? How do you evaluate model performance?
Jason:
Some caveats – we used OpenAI's evaluation benchmarks when hitting the numbers our website cited. Those static sets are a bit easier to optimize for; in production, the situation is more complex. So it's not to say that in any use case, we can achieve the same results; but in most scenarios, we can give our clients a significant value prop.
I can share a bit on how. One key thing: we don't do a generalized router – it's not like applying any single model in any scenario can magically produce great results. Model selection is customized for each scenario, because each client has different cost, performance, latency, and compliance objectives.
Some clients really care about performance – they want performance better than GPT-4. So the router we build for them will be very different from the router for clients who don't care as much about performance but want really low cost and low latency. We need to build a whole pipeline – from model selection, sometimes even fine-tuning and quantization, to building and training a good router, and even modifying prompts. It's a holistic engineering effort to optimize every part of the stack.
What’s more, a surprisingly high percentage of the time, people don't know what they're optimizing for – they don't really have performance metrics. Like Yuzheng said earlier, people are still adapting to the new “fuzzy input, fuzzy output” paradigm. We do not yet have standardized quantitative metrics.
So we often need to work with our clients to define such metrics first – like what output is considered good or bad – and then optimize against those metrics.
Sophie:
I like how you help clients figure out which direction they should aim before they fire, which sounds crucial.
My next question is about how you choose your candidate models that your router dynamically calls. I know you have a curated set of several hundred models, including both open-source and closed-source ones. Given the vast number of options out there – such as the 500K Hugging Face models we mentioned earlier – I'm curious how you came up with that curated set. In addition, as models evolve and new ones come out, how do you update your set?
Jason:
First of all, most of those 500,000 models are irrelevant – they're either tiny, non-transformer-based, or old models such as fine-tuned versions of BERT, in each case not really worth considering. But still, there are at least several hundred relevant LLMs out there, like different fine-tuned versions of LLaMA.
How exactly do we select them? That's honestly a big challenge, and there's no silver bullet. The initial sourcing is more experience-based – you need to live in the AI world to keep the pulse on model advancements. We are well connected with model providers and other industry players that can provide us with the latest insights.
When we have an initial list, we use a more quantitative method to evaluate the candidates. We recently released what we call a "provider dashboard" at leaderboard.withmartian.com, where we basically build evaluation metrics for different model hosting services for open-source models, such as Replicate, Fireworks, Anyscale, or Latent AI. We assess them based on cost, rate limits, throughput, and other metrics people care about. That's one example we've open-sourced to the public.
We also have a number of internal benchmarks that we constantly run different models against. This way, we form a high-level understanding of each model’s strengths, cost, and other characteristics. Whenever a new model is released, we develop a pretty good, quantitative understanding of it, and then we integrate it into our router.
For a specific client, it's a more tailored process - we run custom benchmarks for their tasks, analyze throughput versus cost trade-offs, and so on. There's no general equation that applies everywhere.
Sophie:
Sounds like it takes a lot to figure out the best practice, and once somebody like you guys has figured it out, it doesn't make economic sense for all development teams to reinvent the wheel.
IV. Making Intelligence A Commodity
Yuzheng:
Great work, I have a follow-up question.
Do you have a "main quest" for this routing and evaluation effort? I mean, do you think everything is going to evolve towards one direction, or is it still too early to tell?
OpenAI, for example – has a main quest for AGI, and they believe the path to AGI is the "shortest program" that Ilya often talks about, achieved through the Scaling Law. The idea is that, once you find the simplest program to represent data, that program is also the most generalizable. [Host note: What Ilya meant by the “shortest program” is that, by striving to create AI models that can perform given tasks using the minimum amount of computational resources and complexity, researchers can develop more streamlined, robust, and generalizable AI systems.]
So when they (OpenAI) looked at transformers and self-attention, they immediately realized, "Okay, this will help us find the shortest program." As a result, they chose language as the initial focus area and leveraged a gigantic amount of labeled data to find the most generalizable program.
So I think the “shortest program” is OpenAI’s core technical insight all along. Do you think there's something similar with model evaluation and routing? Is there a "main quest"?
Jason:
If our main goal is what you're asking about, the "end game" here is to commoditize intelligence.
What does that mean? A commodity is something interchangeable, like water, electricity, coal, or oil – standardized products where the different providers don't really matter. It's easy for you to access those commodities, because they have a standardized “interface”.
Right now, LLMs are not commoditized. There are many different and new ones, and using the intelligence from LLMs is complex. Ultimately, we want to make LLMs as easy to use as electricity.
Routing is one means to that end – not the end itself. Routing is crucial for model selection to optimize the whole pipeline. But there are many other things we can also do – add new LLMs, modify them one way or another, improve the prompting layer, etc. We're also researching model interpretability to develop insights into models that will allow us to build better LLM pipelines.
Our goal is to create a world in which, unlike today, LLM application builders don't need to spend half a year first learning to use LLMs or optimize their pipeline before they can build a product. It should be simple and straightforward. Artificial intelligence should become a commodity.
Yuzheng:
This reminds me of John D. Rockefeller – he owned oil refineries but also railways and pipelines for transporting oil. He was both the largest resource producer and the monopoly of its infrastructure.
Jason:
Yeah. [The LLM industry’s value chain] is a fascinating topic – we're already seeing different layers emerge, such as chip providers, LLM trainers, LLM inference providers, middleware services like routing, and end applications.
It's fascinating to think about who will accrue the most value. There is a very high chance that organizations like OpenAI would end up with a big chunk, but I don’t think it's 100% guaranteed. If there're a million models, I think OpenAI will likely always own the SOTA (state-of-the-art) model; but via routing, we are actually driving a lot of traffic to smaller but cheaper models, which in many use cases might be a better choice than the SOTA model. By serving as an interface that helps people access various models in a smart way, the middleware layer can potentially accrue a lot of value.
One good analogy is Google: it is an intermediary between a highly decentralized internet of content and everyone who wants access to such content. Google Search is essentially the routing layer that connects the two. With this two-sided network effect, Google managed to accrue tremendous value. I think it’s possible for AI middleware to achieve something similar.
Another interesting trend is vertical integration. For example, OpenAI is going to build their own chips. We might see rivalry between leading vertical and horizontal players. Drawing analogies from the past: Apple does everything from hardware to software, whereas Intel provides a standardized, commoditized chip layer to all hardware builders. We might see similar dynamics over time for generative AI.
It's not yet clear what the future LLM industry will look like, but it's very interesting to think about.
V. Compound AI Systems
Sophie:
I agree. I recently wrote a blog post on four big questions about the future of the AI stack, from hardware to infrastructure to applications. It was inspired by an illuminating article from Elad Gil discussing his own observations and questions. There are so many moving pieces.
This is a perfect segue into our final two questions today, on where the AI industry is heading.
One trend we're seeing is a shift of focus from just the foundational model towards compound AI systems. It is of course important to prompt, fine-tune, and retrain LLMs to push the frontier capabilities; but meanwhile, AI developers are increasingly using a carefully composed system – including both LLMs and other tools – to achieve the best end goal.
For example, Google's AlphaCode uses multiple calls to LLMs to generate a large sample of code and then choose the best version. Model routing is also a type of compound AI system, where you try to dynamically choose from different models to accomplish each task optimally.
My question for you is, why is this trend catching on? Why are compound AI systems often preferred, or in some cases even necessary? Let's start with Yuzheng.
Yuzheng:
I think it's a temporary phenomenon. Today’s LLMs are very new – it’s only been a year and a half since the release of ChatGPT. GUIs and traditional ML models are still the most accurate, reliable, and predictable way to accomplish most specific tasks today. But they are more or less as good as they can get, while LLMs are like toddlers that are only going to get better.
I like using the analogy of mountains versus oceans. The current ML models and functions are powerful like towering mountains. But large language models, adaptable and versatile, are like the ocean. When the ocean rises, the mountains will gradually become islands. And eventually, there’s nothing left but water.
So I think over time, we'll see an increasing amount of complex tasks being handled end-to-end by large language models alone. The compound AI system approach makes sense right now, because AI alone is far from perfect; and I think in the next few years, AI systems will become more prevalent, as AI becomes more capable. But eventually, I think it will make more sense to just use larger language models to handle things in general.
Sophie:
Fascinating debate. Are tools and methods in the compound AI systems just crutches for nascent LLMs before they become sufficiently powerful? Or could they be fixtures for a long time? If so, how long? Jason, what's your take on this?
Jason:
I love this discussion too, but I would respectfully disagree with Yuzheng. I think compound AI systems are here to stay, and eventually a large percentage of the industry’s value will accrue to the layers on top of LLMs.
Some context: the term "compound AI system" comes from a recent blog post by the founders of Databricks and a few professors at Berkeley. They argue that the best performance, without exception, requires additional layers on top of raw LLM APIs, whether it's something simple like chain-of-thought prompting or few-shot prompting, or more complex combinations like AlphaCode or AlphaGeometry. These are what we call "neural symbolic systems", where you connect an LLM to a more symbolic, traditional inference machine. The interplay between the two can often yield better results than the output from either component individually.
We could think of LLMs as a genius human being - let's say von Neumann, one of the Martians. If I wanted to hire von Neumann to work at Martian, I would need to train him for a while – he's super smart, but he doesn't have the specific context for modern day AI, nor does he know how to code in Python or collaborate effectively with us.
So OpenAI is producing “genius humans” in a commoditized fashion – every user could have access to a “von Neumann”. But the genius does not have the specific knowledge, context, or incentive structure necessary to perform well in a modern-day company. The compound AI system is basically building the structure and giving that “von Neumann” everything needed to unleash his potential. When the "von Neumanns" get increasingly commoditized and cheap, what complements them will become more important.
My biased view is that we should strive for a world where all the layers on top are more important than the foundational LLMs. Otherwise, OpenAI could one day take over something like 90% of the world's GDP, and we could end up in a dystopia controlled by a tech giant. I don't want that to happen.
But if, let's say, 50% of the value accrues to compound AI systems that build specialized versions of AI tailored for different use cases, the profit sharing will be more decentralized, and the world will be a more open and equitable space.
VI. The Future of AI
Sophie:
This is an intriguing question. Will OpenAI (or whoever is the first to achieve AGI) "eat the world" for better or for worse? Or are we going to see a thousand flowers bloom for both models and other pieces in a compound AI system? Perhaps 5-10 years down the road, we'll see which way we're heading; maybe there’s a third direction that none of us are thinking about yet.
My final question for you both: what do you think the future of the AI stack will look like? Will most general LLMs be abstracted away by intermediaries, whether model routers or some other form of unified model interface? What is your prediction on this? Let's start with Jason.
Jason:
I think about this constantly. Is there similarity between foundational AI labs, like OpenAI or Anthropic, and chip producers? They're making general-purpose products, increasingly standardized, that can do any sort of computing and which other companies are building on top of.
But I do think players like OpenAI and Anthropic will probably exceed Intel in size, though not necessarily Nvidia.
I think this is exactly the industry dynamic – regardless of models, systems, or applications, whichever layer captures the most value would depend on its relative economy of scale. We have already seen that LLMs could have incredible scale advantages – when you invest a hundred billion in a model, it can be orders of magnitude better.
But one caveat: based on what we’ve seen, the scaling law works logarithmically. Basically, when you put in one more billion on top of the 100 billion, the performance improvement won’t be as big as what the first billion got you. If this pattern persists, it should be easier for other folks to catch up with OpenAI over time.
On the other hand, potential economies of scale aren’t yet clear for the application and middleware layers. For example, if we manage to create a data flywheel just like Google did, where more customers and more data flowing through our system keeps making our router better, the economy of scale could be huge – we could potentially be even bigger than OpenAI, just like how Google is bigger than any individual internet content provider.
Realistically, I don't think the potential scale benefit here is quite as strong as Google's. It certainly depends in part on future industry dynamics. So we'll see where we end up. Yuzheng, what do you think?
Yuzheng:
I want to challenge this a little bit. A router is a way to make LLMs a commodity. But from the end users’ perspective, they also need to choose between different routers, right? They probably won't trust a single router.
Jason:
You’re correct. Although one way we differentiate ourselves is through building custom versions of the router. It's not a homogeneous product for everybody.
Yuzheng:
Speaking of Google, this reminds me of a discussion I had with Howie Xu. [Host note: Howie is a 20+ year Silicon Valley veteran entrepreneur and educator. He is currently the Senior Vice President of Engineering AI/ML at Palo Alto Networks, a global leader in cybersecurity.] When Howie came across Google for the first time some 20 years ago, his initial reaction, surprisingly, was "Another search engine? The last thing we needed was another search engine.” There had already been so many search engines in the market, that nobody expected Google to be any different when it came out. But look at where they are today!
Similarly, if Martian becomes the next Google one day, it wouldn’t be simply because it is a model router. It'd be because Martian has done something marvelous, fulfilling real user needs, winning customers’ love, and growing constantly.
To that extent, I don't think there’s any guarantee for any layer to “dominate everything,” whether it’s LLMs, middleware, or super apps. Rather, it comes down to whoever can build killer products with the strongest customer loyalty and growth momentum.
Jason:
To your point, I once read about how Jeff Dean [Host note: Chief Scientist of Google DeepMind and Google Research] and early Google engineers handled a highly unusual bug while building Google Search. They tirelessly debugged, delving all the way down to the chip level. Eventually, they uncovered this “supernova explosion”-style, extremely rare event which had caused a floating point calculation error in the chips; that error had further caused the incorrect search results.
To me, this demonstrates Google’s greatness – its sheer density of talent and capacity for this kind of no-stone-unturned, hardcore engineering were instrumental to their success, in addition to the huge economy of scale inherent to Google's business model.
Sophie:
I love that we have a healthy range of opinions here. One fascinating thing about AI is that we can’t yet tell who will emerge as the biggest winner at the end of the day. When ChatGPT first came out, the predominant expectation was OpenAI. But since then, views have diverged – the future shape of the industry has become harder to predict.
Wonderful observations today on LLM interpretability, evaluations, routers, and the future of AI. Thank you both for joining the show!
Jason:
Thank you, Sophie!
Yuzheng:
Thank you!