AI Code Generation: Past, Present, Future
Key Points
- The episode frames code generation as the year’s biggest AI story, noting rapid shifts in software engineering from tools like Cursor, Windsurf, and Vibe Coding.
- Adoption has moved beyond early adopters; even former skeptics now rely on AI for project kick‑offs, and hiring processes are beginning to assess candidates’ proficiency with AI tooling.
- Standardization efforts, such as the use of agents.mmd files, are emerging to give projects a consistent way for AI systems to interpret and act on codebases.
- Despite growing usage, panelists stress that current models remain “lazy” and untrustworthy for end‑to‑end tasks, with notable limitations and occasional catastrophic failures.
- The future is seen as a blend of maturation—where AI becomes a staple in daily workflows—and ongoing turbulence as the technology continues to grapple with reliability and capability gaps.
Sections
- Lazy AI Models & Code Generation - In this Mixture of Experts episode, Tim Hong and three AI engineers examine the evolution, shortcomings, and future impact of AI‑driven code generation, from model “laziness” and context‑window limits to its reshaping of software development.
- AI Tools Bridge Advanced Coding Gap - The speakers discuss how generative AI, exemplified by Claude 45 Opus, is moving beyond day‑to‑day or junior coding tasks to solve deep, performance‑critical problems in low‑level systems like llama‑CPP.
- Learning the Limits of AI Tools - The speaker reflects on their difficulty forming accurate intuitions about what AI systems can and cannot handle, emphasizing that the gap is rooted more in engineering culture and evolving norms than in inherent technical shortcomings, and that this understanding will improve with experience.
- Model Differentiation in Code Generation - The speaker explores whether AI models such as OpenAI and Anthropic are developing distinct code‑generation behaviors, and how any emerging differences might influence programmers’ roles, tool preferences, and community identities.
- Cloud Code Simplifies Optimization Workflow - The speaker praises cloud code for automating file selection and parallel execution, emphasizing its powerful tooling layer that outperforms manual, mathematically driven attempts at optimization.
- Debating Claude Code's Minimal Toolset - The speakers discuss how Claude Code operates efficiently with very few built‑in tools, questioning its design choices, context‑window usage, and potential future enhancements.
- Autonomous Deep‑T Expert Agents - The speaker predicts that as AI models become more capable, specialized “deep‑T” expert agents will require minimal oversight, enabling parallel deployment without constant supervision.
- Challenges Integrating Open-Weight Models - The speakers explain that despite the power of open-weight models, most code‑generation tools don’t support seamless plug‑and‑play use, instead relying on hybrid agent architectures, and cite “continue” as an example that works well with some models (e.g., Gemini) but poorly with others (e.g., Granite).
- Open‑Source Integration Challenges and Optimism - The speakers contend that open‑source solutions demand significant configuration and vertical integration, limiting their plug‑and‑play appeal, yet they remain confident that community effort can ultimately overcome these constraints.
- Subscription Lock‑in vs Inference Costs - The speaker highlights how subscription‑only AI coding tools force users to pay per‑token API fees, arguing that only smaller, locally‑runnable models could break this costly lock‑in.
Full Transcript
# AI Code Generation: Past, Present, Future **Source:** [https://www.youtube.com/watch?v=oRRDAZtJLmk](https://www.youtube.com/watch?v=oRRDAZtJLmk) **Duration:** 00:35:09 ## Summary - The episode frames code generation as the year’s biggest AI story, noting rapid shifts in software engineering from tools like Cursor, Windsurf, and Vibe Coding. - Adoption has moved beyond early adopters; even former skeptics now rely on AI for project kick‑offs, and hiring processes are beginning to assess candidates’ proficiency with AI tooling. - Standardization efforts, such as the use of agents.mmd files, are emerging to give projects a consistent way for AI systems to interpret and act on codebases. - Despite growing usage, panelists stress that current models remain “lazy” and untrustworthy for end‑to‑end tasks, with notable limitations and occasional catastrophic failures. - The future is seen as a blend of maturation—where AI becomes a staple in daily workflows—and ongoing turbulence as the technology continues to grapple with reliability and capability gaps. ## Sections - [00:00:00](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=0s) **Lazy AI Models & Code Generation** - In this Mixture of Experts episode, Tim Hong and three AI engineers examine the evolution, shortcomings, and future impact of AI‑driven code generation, from model “laziness” and context‑window limits to its reshaping of software development. - [00:03:03](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=183s) **AI Tools Bridge Advanced Coding Gap** - The speakers discuss how generative AI, exemplified by Claude 45 Opus, is moving beyond day‑to‑day or junior coding tasks to solve deep, performance‑critical problems in low‑level systems like llama‑CPP. - [00:06:38](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=398s) **Learning the Limits of AI Tools** - The speaker reflects on their difficulty forming accurate intuitions about what AI systems can and cannot handle, emphasizing that the gap is rooted more in engineering culture and evolving norms than in inherent technical shortcomings, and that this understanding will improve with experience. - [00:10:36](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=636s) **Model Differentiation in Code Generation** - The speaker explores whether AI models such as OpenAI and Anthropic are developing distinct code‑generation behaviors, and how any emerging differences might influence programmers’ roles, tool preferences, and community identities. - [00:14:39](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=879s) **Cloud Code Simplifies Optimization Workflow** - The speaker praises cloud code for automating file selection and parallel execution, emphasizing its powerful tooling layer that outperforms manual, mathematically driven attempts at optimization. - [00:18:15](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1095s) **Debating Claude Code's Minimal Toolset** - The speakers discuss how Claude Code operates efficiently with very few built‑in tools, questioning its design choices, context‑window usage, and potential future enhancements. - [00:21:45](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1305s) **Autonomous Deep‑T Expert Agents** - The speaker predicts that as AI models become more capable, specialized “deep‑T” expert agents will require minimal oversight, enabling parallel deployment without constant supervision. - [00:24:59](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1499s) **Challenges Integrating Open-Weight Models** - The speakers explain that despite the power of open-weight models, most code‑generation tools don’t support seamless plug‑and‑play use, instead relying on hybrid agent architectures, and cite “continue” as an example that works well with some models (e.g., Gemini) but poorly with others (e.g., Granite). - [00:28:39](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1719s) **Open‑Source Integration Challenges and Optimism** - The speakers contend that open‑source solutions demand significant configuration and vertical integration, limiting their plug‑and‑play appeal, yet they remain confident that community effort can ultimately overcome these constraints. - [00:31:47](https://www.youtube.com/watch?v=oRRDAZtJLmk&t=1907s) **Subscription Lock‑in vs Inference Costs** - The speaker highlights how subscription‑only AI coding tools force users to pay per‑token API fees, arguing that only smaller, locally‑runnable models could break this costly lock‑in. ## Full Transcript
The models are really lazy. Here's my
favorite Claude called code one in a
moment. Uh, you know, uh, due to context
window limitations, you know, I'm I'm
stopping right now. You're like, dude,
come on, try harder. Do you know what I
mean?
>> We were right in the middle of it.
>> Imagine a junior engineer just went,
"No, it's it's 4:30 in the afternoon.
I'm going to knock off in 30 minutes, so
there's no point me looking at this. I'm
going home now."
>> All that and more on today's Mixture of
Experts.
I'm Tim Hong and welcome to Mixture of
Experts. Each week, Moe brings together
a panel of the smartest minds in
technology to distill down what's
important in the crazy world of
artificial intelligence. Joining us
today are three incredible panelists.
We've got Chris Haye, distinguished
engineer, Olivia Boozek, lead developer
advocate for AI, and Gabe Goodhart,
chief architect, AI open innovation. Um,
this is going to be a fun episode. Uh
it's one of the our end ofear episodes.
So we are basically departing from our
usual uh news story format. And I wanted
to get this group together specifically
to talk about the past, present and
future of code generation.
I think in my opinion code generation is
basically one of the biggest stories of
the year for AI, right? starting from
January to now like the work of
engineering has changed in a very
significant way and you know from cursor
and windsurf to clawed code and sort of
the rise of vibe coding as a thing. Um
really if you're like where's the most
salient impact of AI happening right now
it's it's kind of in software
engineering and code generation. Um and
so I guess Olivia maybe I'll start with
you. I think the question is like what
do you think comes next right? Are we
like now entering a mature space or is
next year going to be kind of like as
crazy and tumultuous as 2025 was in your
opinion?
>> I think it's a little bit of both. So
what I've seen over the last year is
that even the AI skeptics are starting
to use uh AI in their work almost every
day. Um so when you're starting a
project, you definitely use AI to to get
your things off the ground. Um we have
the evolution of things like the
agents.mmd files that people are putting
in their projects so that you have a uh
a standard way that your particular
project will be interpreted by the AI
and you know in hiring processes we're
seeing people actually checking to see
whether or not you know understand how
to know or how to use the um the AI
tools. So all of that points in a strong
direction of this thing is here it's
here to stay. Um at the same time I
think we see a lot of limitations as
well. So um so far I have yet to hear
anybody say oh yeah this is as capable
as a human. I hand off all sorts of
tasks to it. I literally just tell it to
look at my board and take take off the
next task and take care of things for me
because it's just not there yet and it's
just not that trustworthy yet. Um and we
have seen a few catastrophic failures
over the course of the year.
>> Right. Yeah. Yeah, and I did want to get
into that as kind of like it's almost a
little bit barbell shaped in my mind,
right? I think Gabe, one of the things I
wanted to talk to you about was, you
know, I'm not a day in dayout coder. I I
used to be, but I'm terrible at coding,
so I like stopped it doing it and moved
to like a different profession uh
podcast host. And uh I guess the
question for you is like at least for
me, these tools have kind of
revolutionized the game because I can
just sit down and start having fun. Um,
but I guess the question to Olivia's
point is like do we still feel like
there's kind of a gap to using this at
like the frontier, the most complex, the
hardest software applications? I guess
the question is whether or not like this
is still more in the realm of like yeah,
if you're doing day in dayout coding
work or you're kind of a junior
engineer, that's where most of the
action is happening. Do do you buy that
as a premise?
>> I would have said yes last week. Uh,
what changed? I used Claude 45 Opus to
crack a problem that I have been trying
to crack for months and it nailed it in
under an hour. Um, and this is a problem
that is deep in the guts of llama CPP.
Trying to get better performance out of
the uh recurrent models, optimizing the
metal kernels, understanding the shape
of uh, you know, grid layouts and thread
group dynamics and SIMD group memory
sharing and just the gnarliest corners
of gorpy bits. And let me just be clear,
the official internet documentation for
Apple Metal programming is a 2,000page
PDF. That's it. Uh there there is no
like you go to CUDA and the internet is
full of good information. I expect AI
models to nail CUDA, but for metal like
I was blown away at how strong it was.
So to that end, I will say I think the
barbell-shaped analogy or something
something shaped that is not nice and
uniform, whatever physical shape you
choose, is exactly my experience right
now. These models can do some amazing
stuff and they can fall really, really,
really flat in what should be really
simple use cases. So the opposite end of
the spectrum, the reason I would have
said this a week ago was I also heavily
used cloud code to try to build out a
CLI for a pretty straightforward REST
API. And it did a fantastic job of
cranking out a beautiful CLI with lots
of nice pretty colors and inline JSON
highlighting and all sorts of awesome
stuff that was 100% code coverage, too.
Like the tests were great. They all just
mocked everything and didn't actually
test anything and nothing worked, but it
was so cool. And then I spent a week
having to I somebody I think from the
continue team coined this phrase of
chiseling. Um so you you basically like
use it to create the rough block and
then you have to chisel out the shape of
what you actually want your statue to
look like under the underneath the the
very rough block that you just like
splatted out with a code assistant. So
this was my experience. It actually
probably took much longer. Now the the
end product might be prettier. I
probably wouldn't have come up with all
the the latest coolest um CLI libraries
myself. But the process of actually like
fixing all of the stuff that it just
mocked away in the unit tests and said
sure if I pretend that this is the right
answer then I got the right answer when
I did this in my code uh was really
frustrating. So it's it's in some ways I
would have expected exactly the opposite
experience that generating a CLI against
a well- definfined REST API is bread and
butter like that should just be for
point and shoot forget about it. Uh, and
then, you know, deep in the gnarly weeds
of metal optimization is where it would
fall over and have no idea what to do.
But it so
>> it's actually like the reverse. Yeah. I
think it's right.
>> Yeah. You're pointing out like the
intuitions are flipped, right? You're
like, how could you not get this right?
>> For me personally, what it points to is
that I just don't quite have a good
intuition about what it's going to be
good at and what it's not going to be
good at. And I've tried to build that
intuition which means that there's still
some
misalignment between the capabilities of
these tools and sort of the day-to-day
sort of mental model that at least I as
a developer have around the complexity
of a task and the difficulty of a task.
So that's right still some learning on
my part to do. Well, and I do want to go
to that point, right? Because I think
Chris, typically when we've seen these
like high-profile failures, I think
people are like, "Haha, look at the
terrible AI." And I'm kind of of the
view, it's just like maybe those kind of
get ironed out over time as engineers
understand like what these systems are
good and bad at. So, it's actually it's
less of a technical problem and actually
more of like a engineering culture and
norms and understanding problem. Like,
we don't actually we've got this hammer,
but we're still not really sure what
hammers are good for yet. And so we're
kind of swinging it around being like,
"Oh, it wasn't really good for that."
Um, and like maybe those problems kind
of like disappear with time as we kind
of get a little bit more mature on how
to buy these like use these
applications. Um, do you buy that?
>> Yeah, I think so. I think one of the
questions I like to ask myself with the
with the coding models is who is the
architect?
And and and I asked that question for a
second because if if the if the coding
assistant is the architect, then it's
going to choose the framework. It's
going to choose which libraries. It's
going to choose whether it's going to
mock or not mock, etc. Right? You're
putting all of the decisions onto the
model. And and that is okay. I I mean if
you don't really know a language or you
don't know the frameworks or you you're
not a UI person or whatever then you
don't really have much of a choice right
so you're saying actually I don't quite
know what I'm doing here so I'm more in
the vibing world so go do that and and I
think therefore it can make bad
decisions in that sense and and and Gabe
to your point the models are really lazy
right if they think they can get away
with just mocking something up or they
can just go ah I can't you know here's
my favorite cloud code one at the moment
uh you know uh due to context window
limitations you know I'm I'm stopping
right now you're like dude come on try
harder do you know what I mean in the
middle of it
>> imagine a junior engineer just went no
it's it's 4:30 in the afternoon I'm
going to knock off in 30 minutes so
there's no point me looking at this I'm
going home now you know you'll be like
ah you're fired you know but um so I I
think there's a fair point but but
asking who is the architect in this
case? And and in sometimes it's okay not
to be the architect, right? You you're
vibing, you're prototyping, you're doing
whatever. Um, but I think you're in a
different paradigm when you want to
start productionizing and that's where
you have to really use things like the
rules you have to use like if you're in
cloud code your cloud MD or your agents
MD you if you're in cursor you need to
use rules or client or whatever to
really guide the model and say this is
the architecture that I expect these are
the standards that I want you to follow
and uh and therefore you probably have
to put as much effort into architecting
um as you would normally do with
architecture. So I I think maybe I think
that is the big paradigm shift that is
probably happening which is architecture
is going to become more important but
actually being able to write your
architecture which is AI friendly agent
friendly as opposed to in a word
document or a UML diagram sitting
somewhere on the cloud. Right? It is it
is really about um orchestrating with
the AI and then you're going to get
really fast feedback back loops.
>> Well, and I think that's actually one of
the things I I do want to talk a little
bit more about this kind of like the
evolving role of like the engineer or
the programmer in all this. Um and you
know I think one of the things I'm
really interested in is how all these
models are kind of differentiating with
time, right? Like we live in a world I
think Gabe you might have made this
comment on a previous episode. you were
like we live in like model abundance.
There's like all these models and
they're all really really good. Um and
you know I guess I'm kind of interested
in like if you're seeing differentiation
and living maybe I'll toss this question
back to you is you know just take open
AAI and anthropic for a second like do
you feel like these models are
approaching codegen differently? Like is
the kind of code they're producing in
flavor different or is are they is
everybody kind of converging on the same
kind of code generation with time? And I
asked that a little bit because it's
kind of like you can imagine being like,
"Oh, I really understand what OpenAI is
good and bad at, but I have no idea what
Anthropic is good and bad at." And that
really has big implications for for how
almost these models become a kind of
like programming language of time,
right? Like that it's almost like a
tribe that you say like, "Oh, I'm a I'm
a Pythonista." Um, I'm wondering if that
kind of thing is on its way or or if
you're you're not seeing that. Currently
I'm seeing a lot of people just in an
experimental phase with a whole bunch of
different ones because um I don't know
that we have solved that
characterization. That characterization
may evolve more over time. I also think
it's in some ways less about the models
themselves and more about the agent
architecture that's underlying those
code assistants and I think that is
making a much larger difference. Um so
for example when I'm playing with a a
truly when I'm just playing with the
model itself on something like continue
um in my uh in my IDE then I'm not
getting that agent experience and so
because it doesn't have very many agents
to it like it almost doesn't matter what
what model I throw at a particular
problem it can only do so much. Um where
there you see a huge difference though
is in uh the actual planning for a task.
So like in one uh assistant it'll be
like oh I I you know my planning tends
to be focused more around like security
and optimization problems and so it'll
get stuck on that part and another agent
will be more interested in like I really
want to do this mocking thing that Gabe
is talking about. Um, and so you'll see
like tendencies because of the agent
architecture that's underlying it, which
is of course completely opaque to the
user other than um the way that you sort
of start sort of start feeling it out.
>> Yeah, that's actually really
interesting. You're you're almost saying
that this is like less of a function of
the model, but just kind of agent
orchestration is producing these like
differentiations of time.
>> Gabe, I guess you're you're nodding
shaking your head if you want to jump
in. Yeah, I'm I'm doing this weird nod
shake head thing at the same time
because uh Olivia when you said that
that's exactly the comment I wanted to
make here as well is that I've said this
on many episodes but
>> the user experience of any one of these
AI uh tools is a combination of the
quality of the model and the quality of
the system that is built around the
model. And in this case, I have seen
tools, literally multiple different
tools using the clawed family of models
behave extremely differently with
exactly the same flavor of problem
thrown at them. And it comes down to the
implementation of simple things like
context compaction. What do you do when
you get a 20,000 line C++ file thrown at
you? Do you just explode or do you
carefully read it in chunks of 100 lines
and keep going? What do you do when uh
you know you are unable to find an
answer on the internet or when you try
an experiment and it fails? How do you
back up and try again? So these things
are all at that orchestration layer. And
I think this is where the actual
individual tools are going to
differentiate themselves. It's why I
keep coming back to cloud code because I
think of all of the tools I've tried,
they have this experience of it just
works nailed. um everything else has
just required so much more finagling
from me and sort of babysitting from me.
Whereas with cloud code, I don't have to
select what mode it's in. I don't have
to, you know, carefully choose uh, you
know, oh, I I'm going to only send you
files that have context that I know you
can handle. I don't have to do any of
that stuff. I just point it at files on
the internet. I point it at files on my
local machine. It asks me at the right
times when to do what operations and it
goes to town. Um, so that's my personal
favorite these days. Um, and I really
think this uh, you know, this is the the
tooling layer is really important. That
said, the reason I was doing the funky
shake the head thing is that uh, you
know, again using this example from the
last couple of days, you know, I did
this work on the metal optimization with
cloud code 45 opus and was blown away
literally in parallel. I had
>> You did it, Gabe. I mean, you just said
you knew you said you said you knew
nothing about it and it was nice opus
figured it for you. How much did you
actually do?
>> Let Okay. All right. We we'll unpack
that one. So, so actually yes. Uh and
this was actually to to your comment
Chris. I love your framing of you know
who's the architect here, right? So, I
have been banging my head against this
problem for months now. I've been trying
to tackle this from the mathematical
perspective of reformulating the SSM
scan operation as SSD following the
Mamba paper blah blah blah blah and
turns out I was looking in the wrong
place. The right place to look was the
very inefficient SSM con implementation
that didn't take advantage of thread
grouping. Who knew? So I actually was
the one that figured that out myself by
carefully commenting out chunks of code
and realizing that if I took away the
SSM comm operation, I got double the
performance, which was the light bulb
for me that said, "Ah, shoot. I've been
looking in the wrong place." Then I went
over to and I've I've read this kernel
many times myself. I have not seen
anything that says this is clearly a
problem because I don't know the ins and
outs of how the metal GPU is
architected. So, I got all the way to
the point of I found the problem, but I
don't know what to do with it. I pointed
Claude at that and said, "Claude, here's
what I'm experiencing. Here are
literally the commands I've been running
to isolate this. Here's the line I had
to comment out to get to this point of
discovery. Please take it from here."
And it was able to say, "Oh, I read that
code. Thank you for the pointers. The
problem is right there, this line." And
that was where we got to. So, I did a
lot of the work to get there. Claude did
the work to actually solve the problem.
And actually, Gabe, I I think to your
point, I think you're sort of really
stating where we are just now, which is
if you don't know what you're doing at
all, you will get so far. Um, it won't
be the most maintainable code. It will
be a bit muddy, whatever, etc. But
today, you still need the human part of
that loop, right? you you know you need
to guide Claude and and and actually to
your point in Claude code um it really
does deal in a couple hundred lines at a
time. So it's a kind of very very narrow
window and yeah you can you can direct
and push it to different places. So if
you need that broader view and you need
it to look at the larger context, you're
either having to do some thinking
yourself or or or you're going to deal
the claude web interface and you're
typing in there going you think a little
bit further for me. But but the point is
um you do need to do that thinking um
today. I I'm not so sure that is going
to be so necessary in the future if I'm
on
>> what I was going to ask is like I mean
is the hundred line thing is that that's
a design choice by anthropic right?
>> Yeah. Yeah. Yeah.
>> So I guess how do we read that right
like I think I mean the most generous
interpretation is they actually want you
to do some thinking but uh I don't know
if you would read it that way.
>> I think they just don't want you to burn
your context windows. Do you know what I
mean? I I I really think it's as simple
as that. But it is actually
>> remarkably
efficient at it, right? I mean, if you
look at the tools that claude code has,
it actually has very little tools,
right? It's got uh it's got fetch, it's
got gp, it's got bash.
>> I mean, what more do you need? What more
do you need?
>> I'm going to be honest, that that's
actually good enough. So,
>> yeah, exactly.
>> Said,
>> yeah, it Well, you get them via Bash,
Gabe, so you know.
>> That's true. That's true. Yes. So, so in
reality, it actually has very little
tools, but it it is incredible in the
way that it's able to execute. So, I I
sometimes question my lifestyle choices
in in building MCP servers every so
often going am I wasting my time here
because uh you know, Cloud Code does so
well with so little tools. Um but I I I
I just think that it reflects kind of
where we are today though. Um but I I do
believe in the future that that the
tools are going to get more efficient.
They're not only going to be using GP
they're you know if you look at things
like cursor for example they are
indexing your code base. I'm sure um
cloud code is on that path already. In
fact I think they released something uh
recently and therefore I think a lot of
those constraints we're uh we're talking
about going to go away. Ultimately I do
think though that you are still going to
be part of that loop. you still want to
be that architect. Um but but I I I
think the progression is actually I
think it's just another treat it like
another compiler. I hate to say it this
way but um you know we went from you
know punch cards to assembly to C to uh
C++ to Java and then we're Pythons to
JavaScript to typescripts etc. So you've
went up and up and up the up the stack,
but do you I mean apart from Gabe and
his story about looking at Apple Metal
there, but do we really go and look at
the assembly code that often now? No,
because we kind of trust the tools to do
that job. And I and I I just wonder, in
fact, I don't wonder. I'm pretty sure
we're in that kind of paradigm shift.
And um but it it's you still need to
know what's going on below the hood,
right? But I but I think we're at
another higher level abstraction going
forward.
>> Well, yeah. And to to wrap up that you
know where we are today point you know
what I was going to say is in parallel
to doing this with 45 opus I did a
simpler task with 45 sonnet and it
needed way more oversight than than what
I had to give opus. Opus actually did a
great job of looking at git history
looking at adjacent files looking at uh
all the pointers I gave it both on the
web and locally. um and needed
ultimately very little oversight in
solving a complex complex problem. Uh
sonnet on the other hand uh I pointed it
at a gnarly problem that's very hard to
test because it crashed my terminal
every time it it it uh was triggered. Uh
but but literally the whole terminal app
just died which was a pain. Um but uh
Sonnet claimed success like three or
four times and I had to keep going back
and saying no I'm pretty sure that's not
right. I'm pretty sure that's not right.
So there is a model capability here and
I think the difference we're going to
see and I think I I haven't tried this
against um Gemini 3 or the latest
versions of the OpenAI 5 codeex models
um or any of the other latest gen ones.
But I suspect that that's the capability
difference we're going to see is
essentially how much oversight do you
have to give this sort of deep tea
expert that you're pointing at a
specific problem. the the thing that I
think is going to be really interesting
for next year um is to see if that
individual task oriented deep tea expert
when I say deep tea I'm referring to
like deep the T-shaped skill sets right
like I think what we're seeing right now
is that if you give one of these models
a wellressearched problem in a domain
that you are not yourself a deep tea
expert that you could go do in great
depth um it it can actually do a very
very good job of of solving that, but
the more capable the model, the less you
have to supervise that solution. I think
going forward, we're going to see this
paradigm that we see peing out of from
under the covers with Google's
anti-gravity of I've hit a point where I
can actually reliably trust that that
deep tea expert is going to get the
problem correct. So now I can start
launching a bunch of these in parallel
and not babysit them. I think that's the
the holy grail to get to next year.
We're definitely not there yet. From
what I hear from anti-gravity users,
from other attempts at fleets of agents
and becoming an agent manager, I don't
think we're there yet. Um, but I just
smell test based on the capability gap
from 45 sonnet to 45 opus. I'm curious
if we will actually I I feel at least
some optimism that we will get to that
point next year where you can actually
queue up large quantities of tasks with
independent operation on them and
basically only uh supervise them when
they come back and tell you they're
completed.
>> Um I'm going to move us on to a final
topic um particularly I think fitting
given the folks on this panel but I do
want to just take the last few minutes
of this episode to talk about open
source. So, one of the big kind of
narratives, I think meta narratives in
2025 is like open source is like
continuing to catch up, right? It used
to be like, oh, give it a few months and
then, you know, open source will have
what the state-of-the-art had, you know,
in a few months. And then now it feels
like basically where we've gotten is
like it's now kind of at par or even
like getting ahead of the proprietary
models. Um, and I guess Olivia, uh, just
to kind of hear from you on on your
experience with this, like I guess the
question I have is like whether or not
we're going to see that pattern also
happen in the codegen space. Um, where
you see these open models start to be
able to do code generation at like a I
don't know like kind of a clawed level.
Um, is that is that in the offing? Why
or why not?
>> Yeah. So I think um we have to make a
little bit of a distinction here between
um uh open weights and o opensource uh
frameworks that are being used to to do
codegen. Um so I mentioned uh continue
which is an open- source uh framework
that you can use for codegen stuff. Um,
as I mentioned though, they haven't
really leaned into agentic pieces yet.
And so, um, you're kind of on your own
in terms of making that model highly
highly performant. Um, and but then a
lot of these models that we're talking
about are in fact open weights models
where you can in fact download the
weights and and put them behind a whole
bunch of different things. What we're
not seeing yet, I think, is an openness
within the the most common tools to just
use any open open weights model on the
on the uh open market, right? So, we're
not seeing uh every single uh code
generation tool saying, well, you can
just pop in whatever open weights model
you want. they they end up doing this
this hybrid synthesis thing of an open
weights model combined with an agent
architecture that is designed for that
particular um uh model basically and so
I think we're seeing we're still seeing
a lot of combinations being more
successful than the open weights models
themselves but that doesn't mean that
the open weights models aren't powerful
it just means that they need a lot more
guidance than being able to be used just
off the shelf.
>> The one delta I would say on that is
that continue actually has leaned in
heavily to agents. But continue is like
many open-source tooling layers trying
to split the difference between running
against local models and against hosted
closed models. So their agents work
great if you plug in quad or Gemini. Um
and I spent a bunch of time last week
trying to get it to work with granite
for small uh and it does not work very
well. Um there are others out there like
open code which I also tried extensively
with granite for small uh to similar
effect. Um now part of this could be
simply the nature of the size of these
models. I haven't tried running it
against a really large sort of frontier
level open model um because I can't run
that on my dev box. uh but uh I also
think there is an inherent advantage of
closed ecosystems to be able to
co-evolve the model and the tooling
together so that you're not trying to
keep this sort of level of separation
between the model's capabilities and the
the actual agentic patterns around it. I
mean all of these uh agentic pattern
tooling layers for coding or otherwise
involve a great deal of prompt
engineering and a great deal of sort of
manual tuning of ah I've seen that it
tends to fail in this corner case so
either I need to code my way or prompt
my way out of that corner case and that
sort of thing and it's just really hard
to do that in a model agnostic way. So I
think that's one of the big advantages.
I haven't personally tried like Quen's
uh direct Quen 2.5 coder like local CLI.
I probably should give that one a shot
um because I think that's an example of
an an open ecosystem trying to do this
where they have a modelspecific
uh open
uh tooling layer. The one I have tried
is pointing OpenAI's codecs at GPTO
OSS120B.
Uh and I would say that is a solid step
up from running continue or open code
against granite for small. Um again
model size is a big element here but
also the the pairing of the model
capabilities with the agent side. So I
don't have a clear decisive answer here
but I do think you know you're spot on
to point out that this really has to do
with the software layer and that that's
probably where there's the most catchup
to be done on the open side relative to
the model capabilities. Um, but I think
it's still because of sort of the the
loose coupling in open source, I think
we're going to see it a little bit
harder to get to those peak performance
capabilities.
>> Yeah, Olivia, this almost feels like
it's a like a story of like vertical
integration a little bit where it's
almost like maybe there's a structural
advantage. The dream of open source is
like you take a bunch of components off
the shelf and we kind of click them
together and like with a little bit of
spit and polish it works. Um, but it
feels like here the amount of work that
still needs to get like get into it to
get sort of like the model and the
software all to work together is still
something where like almost structurally
open source kind of has like a problem
doing it's strong at other things but in
this particular case it might feel like
it has some some limitations.
Is that does that mean we should be a
little bit pessimistic about kind of
like sort of you know kind of fully open
ecosystems for codegen or do you feel
like there are there are things the
community will do to kind of like you
know deal with this?
>> So I'm I'm still heavily optimistic
about it. Um I just don't think that we
can draw the conclusion that open source
is ever an offtheshelf off the shelf
solution. So um you're always and I
think this is true in every space you
know like if I you know looked at like
an Ubuntu guey I can do anything but I
have to know what I'm doing right like I
have to even to this day like if I'm
using Linux for something like it's
going to require more configuration but
I can configure the heck out of it and I
can get exactly what I want from it. So
I think we'll see more of that. So I
think if you can imagine a world where
um you end up having a lot more control
over that agent architecture and you you
know get to choose also your o your um
open weights model as well then you can
basically say well I do these particular
types of tasks like this is you know 90%
of my work looks like this and turns out
like claude code or doesn't necessarily
get me there or codeex doesn't
necessarily get me there But I'm doing
this particular type of task all the
time. I'm I'm not going to speculate
about exactly which task that that would
be required for, but I would just
believe that that that's going to exist.
And you know, these open uh ecosystem
things are going to enable that to
happen. And I also think that the
development of this open ecosystem is
allowing rapid innovation sharing and
being able to make sure that we're
always working at the state-of-the-art.
um which is not something that you get
when everybody is like fully closed. So
in a world where uh everybody is
completely closed, we can never make
those comparisons of is this model
actually making the difference or is
this agent architecture making the
difference. Once you once it's that
tightly um held together, then you end
up in a world where you're you're only
able to look at uh whether or not the
two together are are succeeding. You can
never say, "Okay, I'm just going to
change out the open weights model
underneath. I'm going to change opus to
sonnet or even um and uh and then I'm
going to check out GPTOSS." Like if if
you're completely unable to make those
comparisons, you'll never know. Is this
caused by my agent architecture or is
this caused by my model?
>> Hey, cool. And I think the thing I would
say is the biggest problem in my mind is
the cost of inference. And if we really
analyze this for a second, the folks
that are using claude code, we're all
sitting there on our various plans. I
sit in my max pro plus whatever plan I'm
on, right? And therefore, I'm never
worrying about the cost of tokens. I I
would not be paying uh the API cost in a
million years, right? And so therefore,
I'm using claude code and the only way I
can use my max plan is is through claude
code. I can't use an open-source tool to
go and connect to that. That's not
allowed. And they're not the only ones
that do that. Gemini is the same. Codeex
is the same. Even Quen and Kimmy K2,
they all offer similar uh plans there.
But but you are doing a subscription
plan and you can only go through those
tools. So you are locked in to that
tool. Now you can go and talk to the
other models but but you're going to be
paying the API costs and and that is
that is a problematic element. So
whether you're a cursor or wind surf
etc. That's why they're all developing
their own models because they need
something that can satisfy the
subscription plan because people don't
want to pay per use they want to pay per
subscription. So when is this in reality
going to change is I the models need to
be much much smaller. So if you have a
coding model that is 3 billion
parameters right 7 billion parameters at
a max or whatever um can run on your
machine and is as capable as Opus 4.5 as
it is today then
>> at that point all of the open source
tools can go wild at that point but but
until then the cost of tokens I just I I
think you're kind of you're in that
vertical stack and it and it's hard to
mix and match Technically, you can do
it, but economically it just doesn't
make a lot of sense. Now, Gabe, to cover
your kind of points about the capability
of the models, the good news, I I have
played with a bunch of those models. I'm
I'm I'm like a model connoisseur. I love
playing with different models and and I
genuinely when I did my Kimmy K2 video,
I was genuinely surprised at how good
that model was. I was like, whoa. you
know that I mean it's not a COD uh four
or five sauna even or Opus levels um but
it was pretty darn good. It was pretty
darn good and and I would say the same.
I was playing with the Deepseek V32 um
uh the reasoner model at the weekend and
and again incredible model but do you
know what openweight model can I can I
run it on my machine? No. I am, you
know, it's like I can download it, sure,
after a few days, but I can't I got
nothing I can run it on. Do you know
what I mean? So, I think I think
inference needs to be sorted out. Um,
and until then, we're going to be
sitting on these vertical stacks.
>> Yeah, absolutely.
>> 100% agree.
>> Yeah, same.
>> Well, on that note of uh unonymity,
>> um, Gabe, Olivia, Chris, this uh this
panel is fire. I wish I could bring it
together once a quarter. Uh, it's
amazing to have you all on the show. Um,
and that's all the time that we have for
today. Uh, so, uh, thank you to all your
listeners. If you enjoyed what you
heard, you can get us on Apple Podcast,
Spotify, and podcast platforms
everywhere. And we'll see you next week
on Mixture of Experts.