Gemini 3 Launch and AI Hallucinations
Key Points
- Gemini 3 was unveiled with dramatically higher benchmark scores—especially on tough humanities exams and ARC‑AGI tests—signaling a major performance leap for Google’s model.
- Early user feedback notes that Gemini 3 still tends to “hallucinate” and prefers to give an answer rather than admit uncertainty, though it appears less aggressive about making false claims than earlier versions.
- This week’s AI roundup highlighted big moves: Microsoft and Nvidia teaming with Anthropic on a $15 billion infrastructure pact, CMU researchers finding AI agents fail ~70 % of real‑world corporate tasks, IBM launching a live‑alert AI platform for UFC events, and OpenAI releasing a ChatGPT variant for K‑12 teachers.
- The expert panel—Marina Danielewski, Gabe Goodhart, and newcomer Marve Univar—discussed the implications of Gemini 3’s capabilities and its lingering hallucination issue, reflecting a mix of excitement and caution about the model’s real‑world reliability.
Sections
- AI News Roundup & Expert Panel - The Mixture of Experts podcast introduces its guest panel and reviews the week’s AI headlines, covering Gemini 3, a Claude attack, the Microsoft‑Nvidia‑Anthropic $15 billion deal, CMU’s finding that AI agents fail 70% of corporate tasks, and IBM’s AI‑driven UFC live‑alert platform.
- Assessing Gemini 3’s Ecosystem Edge - The speaker sees Gemini 3 as Google’s move to strengthen its AI moat by offering novel ecosystem tools—such as an anti‑gravity editing platform and a management‑of‑agents framework—rather than just a superior model, and remains uncertain about recommending it.
- AI Creates Custom Workout Dashboard - The speaker describes using Gemini on the Antigravity platform to quickly generate a personalized workout plan and an interactive Streamlit dashboard, highlighting the model’s multimodal code, UI, and artifact generation capabilities.
- Balancing Generalist and Specialist AI - The speaker discusses the trade‑offs between a single all‑purpose model and specialized agents, noting precision‑recall tension, the value of multimodality, and how task‑specific automation may reshape AI deployment.
- IBM's Kuga Enterprise Agent Development - The speakers discuss IBM's newly announced Kuga generalist agent, outlining its progression from simple domain-specific bots to a multi‑agent, task‑decomposing architecture designed for enterprise readiness.
- Agents Becoming the New API - The speaker predicts that by 2025, building AI agents will be as routine and standardized as creating REST API services, with open, configurable tools and built‑in management handling deployment, security, and scalability.
- Spokes and Hubs Metaphor - The speaker argues that even with accelerated technology, effective problem‑solving still relies on a hub‑and‑spoke structure of specialists and managers, and outlines future AI agent initiatives (Kuga, Altk) aimed at benchmarks like WebArena.
- AI Benchmarking and Economic Impact - The speaker directs listeners to Kuga’s resources, then explains that discussions about AI’s economic impact have moved from alarmist job‑loss predictions to systematic evaluation, highlighting OpenAI’s GDP Valley benchmark that tests AI against human experts on real‑world professional tasks.
- Human vs AI in Benchmarking - The speaker critiques AI benchmarks that rely on approximations, human graders, and proxy models, highlighting the paradox of using humans to evaluate AI while aiming to eliminate human labeling.
- Evaluating AI Benchmarks and Real-World Tasks - The speaker critiques headline‑driven metrics, urges readers to examine paper appendices and prompt‑based evaluations, discusses the complexity of multi‑model assessments, and highlights the need for deeper analysis of AI’s impact on jobs.
- Measuring AI Impact in Clinical Conversations - The speaker highlights the difficulty of quantifying benefits from extracting insights in doctor‑patient dialogues—such as best‑practice identification, supply‑chain optimization, and personalized follow‑ups—and stresses the need for realistic benchmarks, human evaluation, and KPI‑driven ROI tracking.
- Observability as Security for LLM Agents - The speaker argues that, since perfect alignment is unlikely, embedding robust telemetry and monitoring into LLM systems can provide transparency, enable rollback, and build trust—especially in controlled enterprise settings.
- AI-Driven Exploit Scaling - The speakers explain that AI can rapidly automate attacks on known, unpatched vulnerabilities, underscoring the urgent need for tighter patch timelines and built‑in safeguards in defensive AI systems.
Full Transcript
# Gemini 3 Launch and AI Hallucinations **Source:** [https://www.youtube.com/watch?v=7T_TjH6P8CE](https://www.youtube.com/watch?v=7T_TjH6P8CE) **Duration:** 00:46:32 ## Summary - Gemini 3 was unveiled with dramatically higher benchmark scores—especially on tough humanities exams and ARC‑AGI tests—signaling a major performance leap for Google’s model. - Early user feedback notes that Gemini 3 still tends to “hallucinate” and prefers to give an answer rather than admit uncertainty, though it appears less aggressive about making false claims than earlier versions. - This week’s AI roundup highlighted big moves: Microsoft and Nvidia teaming with Anthropic on a $15 billion infrastructure pact, CMU researchers finding AI agents fail ~70 % of real‑world corporate tasks, IBM launching a live‑alert AI platform for UFC events, and OpenAI releasing a ChatGPT variant for K‑12 teachers. - The expert panel—Marina Danielewski, Gabe Goodhart, and newcomer Marve Univar—discussed the implications of Gemini 3’s capabilities and its lingering hallucination issue, reflecting a mix of excitement and caution about the model’s real‑world reliability. ## Sections - [00:00:00](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=0s) **AI News Roundup & Expert Panel** - The Mixture of Experts podcast introduces its guest panel and reviews the week’s AI headlines, covering Gemini 3, a Claude attack, the Microsoft‑Nvidia‑Anthropic $15 billion deal, CMU’s finding that AI agents fail 70% of corporate tasks, and IBM’s AI‑driven UFC live‑alert platform. - [00:03:14](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=194s) **Assessing Gemini 3’s Ecosystem Edge** - The speaker sees Gemini 3 as Google’s move to strengthen its AI moat by offering novel ecosystem tools—such as an anti‑gravity editing platform and a management‑of‑agents framework—rather than just a superior model, and remains uncertain about recommending it. - [00:06:33](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=393s) **AI Creates Custom Workout Dashboard** - The speaker describes using Gemini on the Antigravity platform to quickly generate a personalized workout plan and an interactive Streamlit dashboard, highlighting the model’s multimodal code, UI, and artifact generation capabilities. - [00:09:40](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=580s) **Balancing Generalist and Specialist AI** - The speaker discusses the trade‑offs between a single all‑purpose model and specialized agents, noting precision‑recall tension, the value of multimodality, and how task‑specific automation may reshape AI deployment. - [00:12:50](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=770s) **IBM's Kuga Enterprise Agent Development** - The speakers discuss IBM's newly announced Kuga generalist agent, outlining its progression from simple domain-specific bots to a multi‑agent, task‑decomposing architecture designed for enterprise readiness. - [00:16:34](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=994s) **Agents Becoming the New API** - The speaker predicts that by 2025, building AI agents will be as routine and standardized as creating REST API services, with open, configurable tools and built‑in management handling deployment, security, and scalability. - [00:20:07](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=1207s) **Spokes and Hubs Metaphor** - The speaker argues that even with accelerated technology, effective problem‑solving still relies on a hub‑and‑spoke structure of specialists and managers, and outlines future AI agent initiatives (Kuga, Altk) aimed at benchmarks like WebArena. - [00:23:39](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=1419s) **AI Benchmarking and Economic Impact** - The speaker directs listeners to Kuga’s resources, then explains that discussions about AI’s economic impact have moved from alarmist job‑loss predictions to systematic evaluation, highlighting OpenAI’s GDP Valley benchmark that tests AI against human experts on real‑world professional tasks. - [00:27:12](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=1632s) **Human vs AI in Benchmarking** - The speaker critiques AI benchmarks that rely on approximations, human graders, and proxy models, highlighting the paradox of using humans to evaluate AI while aiming to eliminate human labeling. - [00:32:04](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=1924s) **Evaluating AI Benchmarks and Real-World Tasks** - The speaker critiques headline‑driven metrics, urges readers to examine paper appendices and prompt‑based evaluations, discusses the complexity of multi‑model assessments, and highlights the need for deeper analysis of AI’s impact on jobs. - [00:35:18](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=2118s) **Measuring AI Impact in Clinical Conversations** - The speaker highlights the difficulty of quantifying benefits from extracting insights in doctor‑patient dialogues—such as best‑practice identification, supply‑chain optimization, and personalized follow‑ups—and stresses the need for realistic benchmarks, human evaluation, and KPI‑driven ROI tracking. - [00:39:19](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=2359s) **Observability as Security for LLM Agents** - The speaker argues that, since perfect alignment is unlikely, embedding robust telemetry and monitoring into LLM systems can provide transparency, enable rollback, and build trust—especially in controlled enterprise settings. - [00:43:59](https://www.youtube.com/watch?v=7T_TjH6P8CE&t=2639s) **AI-Driven Exploit Scaling** - The speakers explain that AI can rapidly automate attacks on known, unpatched vulnerabilities, underscoring the urgent need for tighter patch timelines and built‑in safeguards in defensive AI systems. ## Full Transcript
It was interesting to me that a couple of people
reported that it's still hallucinating and it still really likes
to give answers rather than say that it doesn't know
the answers, although it's not so psychopathic about it. But
it still really likes to give answers. So it's an
interesting combo. All that and more on today's Mixture of
Experts. I'm Tim Hwang and welcome to Mixture of Experts.
Each week MOE brings together a panel of the finest
minds of technology to distill down what's in important in
the latest news in artificial intelligence. Joining us today are
three incredible panelists. We've got Marina Danielewski, Senior Research scientist,
Gabe Goodhart, Chief Architect, AI Open Innovation. And joining us
for the very first time is Marve Univar, Director, Agentic
Middleware and Applications Research, AI. All right, lots of interesting
topics for today's episode. We're going to talk a little
bit about, of course, the drop of Gemini 3, some
attack using Claude. But first we've got Eily with the
news. Hi, I'm Illy McConnell, a tech news writer for
IBM Think. Here are this week's AI headlines. Microsoft and
chipmaker Nvidia will partner with AI startup Anthropic in a
$15 billion AI infrastructure deal. Carnegie Mellon researchers discovered that
AI agents fail a shocking 70 to of the time
on corporate real world tasks. IBM and the Ultimate Fighting
Championship have launched an AI driven live alert platform that
delivers real time records and milestones during UFC events to
viewers. OpenAI announced ChatGPT for teachers, a version of its
popular chatbot for K12 educators. For more subscribe to the
Think newsletter linked in the show notes. And now let's
see what our experts think of Google's Gemini 3.0. First,
I want to start with the big news of the
week which is the launch of Gemini 3. So long
rumored, long teased, but finally out. And it's a remarkable
model. I mean from some of the benchmarks Google is
reporting explosively good performance on. I think what has been
considered some of the most difficult evals and benchmarks that
are out there. So huge leaps on humanities last exam,
really big jumps on arc AGI. But I guess maybe
let's just kind of start with like the vibe check
I guess. Marina, have you had a chance to play
with the model yet? I'm curious about what you think
about it and if it feels like substantively very different
from what came before I. Haven'T played with it. I've
looked at a few digests about it. It does seem
like there's a lot of interest in making the more
complicated benchmarks be something that's handleable. It was interesting to
me that a couple of people reported that it's still
hallucinating and it still really likes to give answ rather
than say that it doesn't know the answers. Although it's
not so psychophantic about it. But it still really likes
to give answers. Right. It's still making mistakes. Yeah. It
doesn't like to admit that it doesn't know something and
maybe that's saying something about this new set of models.
Yeah, for sure. And Gabe, quick question. Would you recommend
Gemini 3? Have you played around with it yet? Yeah,
I've played just a little bit with it this morning
and I don't know. I think my take on this
is we're really starting to see sort of the ecosystem
moats evolving. Like I think this is a necessary step
for Google in their AI ecosystem to have a model
that is at par or better than all of their
competitors so that they can truly claim to be running
ahead in the front of offerings here. But what really
struck me about the announcement was that they actually took
a swing at a piece of differentiation because frankly a
really great model is not that differentiated anymore. It's like
that line but. But for what? I want to use
a model for what most people want to use a
model for. We don't need something better than what we've
already got. I thought the thing about the Anti Gravity
editing platform was really interesting because it actually looked like
something novel that they're adding to their ecosystem that you
can't get anywhere else. And in particular, you know, the
idea of an agentic IDE is not at all new.
There are well known startups out there doing that. There
are open source ways to pull that together on your
own. The part that I think was novel here was
the intentional transition to framing it as a management of
agents problem and tying this all back to would I
recommend the model itself? I haven't played with this capability,
but a colleague of mine has already. And the idea
of being able to launch a fleet of delegate worker
agents that can all work on separate tasks in parallel
and you can manage them is something that I think
has some real compelling chops. So if the model can
actually hold up to that level of independence and you
know, parallel analysis, it could be a real breakthrough there
on a net new capability that you couldn't get with
any other ecosystem. So I'm really excited to try that
out and to see where it goes. But from a
pure model perspective, I'll play with it. Yeah, for sure.
Yeah, I think that's kind of the one. The really
interesting things coming out of all this is like, you
know, I think we used to marvel even earlier this
year. We're like oh my God, new model, incredible benchmarks,
like look at all this progress. But here we are,
you know, sitting in November of 2025 and we're like
eh, it's the benchmarks, whatever, awesome. The science is amazing
and I can accomplish the same set of tasks that
I had before with a much, much smaller model running
on my laptop. So Marva, do you want to talk
a little bit about antigravity? I did think that this
was sort of like the big interesting differentiator on the
announcement was to say, hey, you know, we acquired Windsurf,
we're going to do kind of our ide and it's
going to be an agentic ide. What's your take? I
mean, where is this all going? And I guess if
you want to give our listeners an intuition of like
why Google is even investing in that kind of differentiation,
I think Gabe alluded. A little bit to it, right.
The ecosystem play. But I think this, I haven't played
with antigravity. I play with the model and I'll tell
or share my experience. But I think they're aiming for
the advanced tool use so the whole agentic applications and
making tool calls more robust and also increase the modalities.
Right. Like they're claiming you can do editor, terminal, web
browser and like many different execution modes. And this means
you can plan code, execute, verify different tasks more autonomously.
And I think agents in anti gravity will also generate
artifacts. That's what they claim. So this can task lists,
plans, screenshots, I think browser recordings and they can also
when you want to take agents to next stage beyond
benchmarks, from academic benchmarks to put it out there to
reality. So it's quite promising. But I did play with
the model, I didn't play with the antigravity platform yet.
So is Marina said it's not hallucinating fully, but just
like other big models, it's really, really good at with
the initial prompt, like the way you describe your first
set of things. And they have a build section. Right.
Like you can build artifacts, you can build UI elements.
So I asked Gemini to create me a workout plan
and an interactive dashboard to track my workout sessions. And
I told my weight, height, age to customize for myself.
And the very first UI that was really nice and
literally like I was multitasking in a meeting and it
like locally I was able to build in streamlit in
less than two minutes. Then I asked it to add
some more, you know, personalized pictures, like motivation pictures, customize
with my name. And then I realized it added a
reminder section saying that I should eat high nutrition food
after the workout to grow. To grow. I'm a mother
with two children. If this was for my kids, I
think it would make more sense. But for me, I'm
not growing. It totally messed that part up. It already
knew my age, so it was way past, I'm way
past growing age. So again, and the overall performance I
think on benchmarks are quite impressive and the claims they
are making. And I think this excited me because I
think this is the largest capability jump we've seen in
a few months. Right. Like, it's nice, but it has
some flows which I personally experienced in my first UI
dashboard that I built with. It when Marina, hopefully you
can help me kind of square this circle here because
I think it's really interesting that your first reaction to
this model is still hallucinating. And we kind of have
this very funny sort of split screen experience of these
models where they are just kind of performing better and
better against these benchmarks. But yet our kind of everyday
that? Is it true that the more powerful models just
simply hallucinate more or is this kind of just like.
I'm just kind of curious about. They're seemingly really strong
in some things, but remain kind of like amazingly weak
in other domains. Right. So I'm going to agree with
what Gabe said, which is the whole point of, well,
plenty of tasks. I don't need this really large model
and I can do a better thing with a smaller
model because hopefully we're trying to finally getting beyond this
idea that we're going to have one model to rule
them all. This never should have been a goal. It's
never going to work for an LLM with the architecture
that we have. What you want is a suite. Either
you give them different instructions or different preferences or whatever.
If you wanted to do better on the these very
complicated benchmarks which really want the model to think through
a lot of Things generate a lot of thoughts and
attempts and whatever. Then you don't want it to be
reticent and sit there saying, I don't know anything. I'm
going to twiddle my thumbs because I don't have a
citation to offer you. These are different tasks. This is
the same thing as if you were to have the
statistical tension between precision and recall. You're going to do
better in one, you're going to do worse than the
other. This is going to be a consistent thing. Use
them for different things. And the fact that the more
interesting thing now is automation, the more interesting thing now
is really the multimodality. Yeah. Lean into that because that
really is more interesting. Having one model is always going
to do the best job at giving you the right
information. Why? Review your Karl Marx, review your division of
labor. Let's reinvent normal civilization of people working together. That's
going to be more effective. It always will be. Yeah,
yeah. And I think it's kind of funny where this
is all resolving too, because I agree, one of the
things that was like, oh man, this is going to
change everything was one model to rule them all. But
if we end up in a world where, I don't
know, we have very specialized agents for very specialized kinds
of tasks, are we back to Appland again? Are we
back to software again? In some ways we're kind of
reconstructing applications like specialized software again, which I guess is
like everything old is new again in some sense. I
think it's a healthy tension. I mean, I think frankly,
one of the reasons we're in the AI moment. We
are, is that the pendulum really swung with the introduction
of Transformers and suddenly you didn't need a complicated suite
of Software to get 80% of the solution. And that
was a real game changer. Right. I think what we're
seeing here is there's. I mean, the general purpose populace
is still going to use one chat window. Right. They're
going to jump to a chat window and they're going
to enter some things. Now if that chat window becomes
an increasingly complicated software machine behind the hood, the user
doesn't need to know. So the interface change of one
model to rule them all I think is sticky. I
don't think that's going anywhere but the actual implementation behind
the scenes. I think we've already seen that with the
GPT 5 series. I think we will almost certainly see
it with other Frontier model offerings, let's call them that,
and I think we'll see the open equivalent of a
software Stack emerging that allows you to ensemble models for
specific parts of your workflow and specific elements of how
you want this all to work together, exposing that nice
single entry point that users want to interact with. So
I think it's a healthy tension. The nice thing here
as a software architect you get an abstract interface which
is your chat box and then you get to implement
it however you want. So we'll iterate on that implementation
because we're software folks and we'd like to do that.
But I think we'll swing back and forth a little
bit on the complexity behind the scenes. Yeah, for sure.
I think it'll be so funny if we end up
in a few years people being like, well rather than
a chat window, what if we had like a desktop
with like icons you can click on and it's just
like we'll be back to where we were. So. Yep.
Two very interesting announcements out of IBM recently about agents
life really the last few months and I think particularly
on Gabe's last comment. I understand one of the announcements
is around a project called Kuga C U G A
which is billed as an enterprise ready generalist agent. So
do you want to talk a little bit about kind
of like what the team was working on and thinking
about for this, for this launch? Sure. Happy to. As
you said, this has been my life since the launch
and it's been quite. I think we got good feedback
as well. We're trying to become enterprise ready general estate
and it's not an easy, easy task to take on.
But where we started is like everybody else who starts
to build enterprise ready agents, you start from some simple
traditional ways, build a domain specific agent. So maybe React
Kodak the simple pattern that you take and then you
start evolving it to oh, my task is too complex
and my single agent cannot handle this. So let me
go and build a task decomposer on top. Oh, this
is now becoming this multi agent architecture where you have
the layer up top that, that picks the right sub
agent to do. It's a, I mean classical engineering design
principles because it's easier to distribute to the sub agents.
We believe it's going to work faster. And then what
we realized is we're not the only one that does
this. Like my peer groups in IBM Research, when they
build sophisticated agents they go through this experience as well.
Let's start simple and then it all of a sudden
becomes this very complex. Exactly. And then we stepped back
and we thought like maybe we can create a generalized
version of this where people can jumpstart with using KUG
architecture rather than building all these things by themselves. Right.
So we can give Kuga, which is this multi agent
supervisory layer already embedded and a multi agent architecture and
people can configure it for their own domain and users.
So Radiodon is a traditional way like build domain specific
agent, evolve it, do some custom benchmarks and a long
bring your own domain onboard your own tools and configure
your own domain to do your own benchmarks and then
deploy. So that's our vision. It's open, it's outside in
the open now for people to try and give us
feedback and see if it works for their domain. So
we're very excited that we launched in the open so
we can capture if what we experimented in research actually
can be mimicked in the real world application uses. Yeah,
that's really exciting. And I think one of the things
that we've been watching really closely here at MOE is
what I love about the kind of agent competition world
is right now we're very much in the world of
norm setting. We're doing it this way. We hope you
do it this way as well. And I think there's
various projects that are more or less successful at attempting
to build those standards. I think what's really intriguing, and
Gabe, I'm curious if you want to talk about this
in the context of Kuga, is it seems here what's
really intriguing, Marvi, I'm hearing you right, is that everybody
starts by building an agent and they all discover exactly
the same problems over and over again. And everybody's going
through that process of rediscovery right now. I guess. Gabe,
that's pretty promising from the point of view of. Okay,
let me just shortcut this. Here's a standard framework. Yeah,
I've been also thinking a lot about this, having a
lot of conversations with different teams building different components. And
one of the things that really seems to be true
is that there are emerging slots for an abstract architecture
for agents in open source and presumably in closed source.
But we don't know how those tools are implemented necessarily.
The generalist agent is absolutely one of those slots. And
having an open offering that's configurable and permissively licensed is
a really awesome place for people to start collaborating and
building. On top of this tool management is another big
piece of this and it just seems to be coming
up over and over again that this sort of emergent
architecture is there and to the point about refining and
iterating on the actual agentic architecture itself. The analogy that
I keep coming to when I talk about this stuff
is if I asked anybody out there working at a,
well, really any company, please go build me a REST
API server for X, Y and Z. I wouldn't have
to tell them the architecture of that thing. I wouldn't
have to tell them what programming language to use. I
wouldn't have to tell them like it's just kind of
a well established pattern that everybody knows how to do
if they've ever touched cloud software. Agents aren't there yet,
but as many people have said, you know, 2020, the
year of the agent. I think by the end of
2025 we're going to be close to actually hitting that
point where we can just say, hey, build an agent
for this. And everybody just knows what you mean. And
as you exactly described it, Merve, I think the decomposition
is exactly that step from I got it running in
flask on my local machine with HTTP to now I've
got a server that has middleware for authentication and serves
TLS and can actually be horizontally replicated. Those are the
steps you take when you're building, building a microservice after
you get your demo app running. The same thing you're
describing with Kuga is exactly what people are hitting after
they get their first react agent off the ground. You
mentioned like, oh, here is the agent, go use it.
We're really trying not to put people to like, okay,
this is Kuga is this and you have to use
it. It's also flexible. The configuration piece makes it, I
think easier for people to okay, I need this, but
I can configure it this way. And also what we
did is, which is I think maybe it's a good
time to introduce the altk, which is the Agent Lifecycle
development toolkit that we also released in the open. We
componentize Kuga and we build different components to support Kuga's
different capabilities like memory guardrails and other things that makes
Kuga function in the real world. But some people may
not want to start from Kuga. They may still have
their own sophisticated agent that they built and they don't
want to move it to Kuga. So they can reuse
these components under this Agent Lifecycle toolkit and if they
want the memory piece and they can take it and
apply to their agent and this is Again, like democratizing
and not really pushing people to use. This is what
I have. This is it works and use it. No,
there's flexible design. You can take the different components and
apply to your current agentic implementation if you want to
improve certain aspects of your agent. Marina, I want to
go back to kind of the comment that you made
a little bit earlier when we were talking about Gemini
3 and kind of this movement to sort of like
more specialized agents. Over time it kind of strikes me
that we will almost reproduce human org structures in agents
because it's generally this agent is kind of like it's
the middle manager. Right. Its role is to manage other
agents. Do you think that's kind of where we're headed
ultimately is like we're moving away from one agent to
rule them all. But there will still be these kind
of generalist agents and really their role will be sort
of that middle manager, I guess, in the org chart
biology to software, you have this combination of hubs and.
And spokes. There's a real reason that you end up
settling. And maybe it's going to be a different number
of spokes, more hubs, fewer hubs. But sort of like
as you figure out the way to solve a particular
problem, that's still the place that you solve. You need
some specialists and you need somebody doing the managing and
the planning. So yeah, what's interesting about this era is
that we are able to go faster, further than we
thought. But if you take that 10,000 foot view, it's
still a. All right, I've got a task I maybe
can do. I'll think the spokes part a lot faster.
But you still somewhere in there need a hub where
you say, okay, this is what you do next. This
is how you know that you're done. This is what
you try. It's very natural and very correct. The technology
to get us to go faster is great and it's
very exciting. But yeah, this is the normal pattern of
problem solving. Yeah. Like the future is kind of like
figuring out how you staff your project with different kinds
of agents. It almost feels like. And maybe a couple
of people in there just to keep an eye on.
There's some actual humans in there. So I guess maybe
a last point, Marv, where are you headed next with
all this? I know you had said this is your
life since the launch, but where does Kuga, where does
Altk go next? So just like Gemini 3 benchmark results.
So we started with Kuga and then we went out
there and found the most challenging and most representative benchmark
that we can go after, which was WebArena and AppWorld.
So we were number one on both of them for
a long time. And we kept like, oh, let's keep
our position as number one. But no, I think it's
very different to have of keep our position in benchmarks
as number one versus putting it out there and hearing
directly from the users where it breaks. How latency is
for example, a problem right now. Apparently like when we
built Kuga, we really focused on the accuracy and how
good Kuga behaves like listening the task. But latency is
one of the requirements. For example, for us, that came
from the real users when we launched outside that said
like, okay, this is too slow for me to use.
So we have a bunch of things that we capture
from the community that we would like to incorporate. But
also when I mentioned like the altk, which is the
core components, that helps, I think agents or agent builders
boost their agent performance. There are a couple things that
we're working actively on. One of them is the memory
that I mentioned. And I'm not talking about storage and
data structures. I'm talking about what can you make out
of this memory, like what you want to remember, what
you want to forget. And what is the middle ground
that you want to keep learning from. Because some tool
combinations may never work and then you already did it
and your trajectory is you have this like it's saved
in memory and can you bring it up and do
some self learning for Kuga or other agents? And the
other one is the consistency, like this is extremely important
and there is not a single definition of consistency in
the literature. When you look, people define maybe sometimes with
repeatability, but in the I think enterprise setting, especially also
in the consumer, setup is important. Like you don't want
your agent to do something in a way one day
and a very different way in another day. So how
can you bring this consistency to the real world agents,
therefore they are within their own world, consistent with their
behavior and they don't throw really ridiculous answers one day
when you ask the same question. So these are the
two main topics that we are working. I know. And
we're excited that we're making progress towards getting the real
feedback from the community and also advancing the capabilities of
Kuga with these components. That's great. Well, we'll have to
track the project. How do people find out more about
it if they want to keep up with Your work?
Sure. There's kuga.dev. this is the Kuga website where you
can go to GitHub and learn all the blog posts
and other things. And we have ALTK AI. So that's
the individual components that constitutes and helps KUGA perform better.
So if they go to these two websites, they're all
good. Nice. Yeah, those are the solid tlds right there.
Well, I'm going to move us on to our next
topic. This is kind of a recurring theme for us
in 2025 and I think it's kind of a sort
of interesting ongoing stuff set of discussions about how AI
will impact the economy. And I think overall, I would
say the discussion has matured, I think over the course
of the last 12 months. I think in the beginning
of the year we were still very much like all
the jobs are going to disappear because of AI. And
I think now we're getting more to the mode of
like, well, let's do an eval on that. We're approaching
it in a very machine learning way. So OpenAI announced
a benchmark that they call GDP Valley. And essentially what
they're trying to do is to say, well, we have
all sorts of benchmarks on trying to evaluate AI capabilities.
But one count against a lot of these benchmarks is
that they don't tend to be very realistic. Not frequently.
and solve complex math theory problems. What they do is
they basically curate a set of tasks from a number
of actual professions and they evaluate whether or not AI
is able to kind of produce outputs on par with
a human expert. And they run this as a way
of kind of trying to get an assessment of, well,
what are the effects of AI going to be on
the economy, particularly against these sort of economically valuable tasks.
And there's some interesting results. I think the big headline
is that even though it's a benchmark for OpenAI, they
discover that Claude Opus 4.1 is the strongest performer against
these tasks and in some cases are able to reach
sort of near parity with human experts. And so I
guess maybe, Gabe, maybe I'll turn it to you. I'm
curious what you read from these types of results. Are
we still kind of back where we are like December
2024, which is like, oh God, AI is going to
replace all the jobs. These are certainly really impressive results.
But how do you parse through it? Yeah, I mean,
I think in professional settings the promise of AI has
been take Away the stuff I don't want to do
so I can spend more time on the stuff I
do want to do. I think it's really easy to
poke holes at benchmarks because measuring things is really difficult.
So I want to upfront say, like, this is a
really good stab at a new aspect of benchmarking. And
I think especially the reliance on human experts is important
in this space. The holes that I saw immediately looking
at this were they're still doing this as basically one
shot artifact creation as the, the benchmark, which I don't
know about you guys, but most of the time I
spend at my job doing things I don't want to
do is stuff that involves like investigation, asynchronous this, that
and the other thing. And in fact, when it comes
time to create artifacts, that's the stuff I do want
to do. Right? Writing code is my happy place, even
if I'm using an assistant for it. But what is
less happy is walking around trying to find the correct
way to implement a little corner edge on the Internet.
Or we're still trying to look through a giant pile
of corporate docs to figure out the right official approved
tool to do a certain piece of my job. So
I think every benchmark makes approximations of the space so
that it can actually translate from a fuzzy human space
into math. That's just the nature of benchmarking. And in
this case they've made some approximations that while valuable, still
have some, some holes in it. I think the other
part of it that I saw that was interesting was
that they actually used human graders, at least as the
gold standard for evaluating. Because, you know, even if you
put aside the fact that these are sort of canned
problems, you know, very well curated canned problems, but still
canned problems, it's very hard to say is the answer
here. You're right, because at the end of the day
you need a thumbs up or a thumbs down or
at best a, you know, value between 0.0 and 1.0
to say how well did this thing do? And that's
much harder the more complex the thing that it's trying
to do is. So using a human grader is in
one sense the right approach. But the whole reason we're
in this Genai boom is that we figured out ways
to not have humans labeling data. And this sounds like
we're back to humans evaluating data. So they also created
like a proxy to this with a another model that
could proxy what their human graders did. But now you've
got AI evaluating AI and you're kind of in a
recursive loop there. So I think it's really interesting to
see this try to tackle real world problems. The other
piece that they didn't really articulate very well was amongst
the classes of problems that a given profession has to
tackle, is this tackling the hardest ones or the simplest
ones? Right. Responding to email, responding to slack is not
the most mentally challenging thing I do but it's a
lot of what I do all day and I'm sure
the same is true for a lawyer or a doctor
or a nurse or anyone in a professional capacity spends
a lot of their time in the long tail of
less mentally taxing work. So I'd be curious if there's
any way that they sort of evaluate these tasks against
is this the low lift or is this the high
lift that they're trying to evaluate against Reyna, you're. Smiling
through Gabe's explication of gdp. Val, curious about your take.
What I'm hearing from Gabe is better but maybe not
good enough right at like I can think assessing what
is ultimately like a very complex thing like what is
a job, you know. So yeah so I really found
it interesting diving into this. First of all props to
whoever in their comms or marketing or whatever team thought
several months ahead of what the write up headlines were
going to be be because the headlines of AI can
now do half of our jobs. Great. It's not what
you're supposed to get from this but like fantastic how
they chose the jobs, the fact that they actually went
to the BLS types of jobs and how they went
about it. Very, very nice choice. So I like what
they're trying to do very much so if you actually
go and read through the data points which they made
available on Hugging Face very good with the transparency it's
mostly planning tasks, summarization tasks, you know the kind of
things that actually alums are pretty good at. And reading
between the lines it did seem to me that they
had a lot of different submissions and they sort of
narrowed them down, narrowed them down, narrowed them down until
they had a set of maybe similar kind of looking
tasks. Even though it was across a number of jobs
and it's we're still only talking about a couple hundred
tasks that really got made. So I think that that
is, that is first of all a point that it's
only going to be be so many and I completely
agree with Gabe. There's going to be some tasks that
are harder, some tasks that are easier. When you look
at the prompts themselves as I Read them. They looked
to me very detailed, very refined, somebody with a very
clear idea of what they want. So I think a
lot of sort of pre work already goes into. Before
you ask, hey, write me this summary, Write me this
schedule. I have prepared some files for you. I have
prepared some reference Excel, some sites that I need you
to go to. So there's a lot of sort of
pre planning and then this does the. All right, fine,
put it all together. Probably this still helps a good
amount. And it also, if anything, would give an example
to people who don't understand, what does it mean? What
kind of jobs can the AI be kind of decent
at, which is, look at these kind. Please read these
200 and take a look at what is the kind
of thing that this is actually pretty good at. So
I like that part of it. Now, as far as
the valuation goes, look, it's pairwise comparisons. Is it the
most sophisticated, detailed thing you could dive into? No. If
I had to guess, having, you know, done evaluations for
years and years is they probably tried a variety of
other things and got a lot of noise because the
artifact that you get out of each of these is
going to be very detailed and very noisy and very
difficult even for a human to say. This part is
really better. This part is really not that much better.
No, it's going to be like this, like this, like
this. So they finally just ended up going with a
win rate. So I wouldn't really listen to the headlines.
I would look in interest at the paper, especially the
appendices, where they really go into the process of how
they did the evaluations. And again, what does it mean
with the automatic grader? Not the automatic grader. I'll add
one more thing. These are all prompts. So there is
not a lot of breaking it down. Exactly what Gabe
said, Exactly what we're just talking about with Merve of
first do this, then do this. Here's an agentic plan,
here's this, this, this, this, this kind of thing we
can do. This is a prompt and it primarily would
like the model to present something akin to maybe a
plan that perhaps you would then execute on downstream. So
again, this is a very particular type of task. It
may be interesting in the future to say actually, let's
have multiple models thrown at this with multiple capabilities, maybe
even specific models, et cetera. Now, is evaluation of that
going to be harder Exponentially, which is probably why their
path at this benchmark was the way that it was.
It was very intelligently done, very ready, ready for headlines,
ready for write ups ready for all of that. That.
But I do like the interest in real world tasks
and the lens that this can hopefully show on. When
we talk about AI taking our jobs, which jobs. I
wish there were more write ups that dove into the
actual data. I find that more interesting than the final
evaluation. Yeah, absolutely. That was a great analysis. Marvit, maybe
a final question for you. It strikes me that there's
really interesting tension in what Marina said, which is you
want the eval to measure real world tasks risks, but
it turns out the real world is really, really messy.
And so methodologically it's just very hard to do a
good eval here. It's kind of a part of me
that's like we're going to spend a lot of time
in the next few years trying to develop increasingly sophisticated
evaluations to measure the economic impact of AI. But also
the economic impact of AI is just going to kind
of happen to us as well. And I'm kind of
curious about what you think about the value of this
kind of exercise. There's almost a point of view which
is is we can spend a lot of time trying
to create better and better proxies, but we're also just
in the middle of it right now and I guess
the effect of AI on the economy we're just about
to see. And so I guess the kind of question
is, do you think these evals are more important with
time, less important with time as kind of a research
area? Where should we be going with this kind of
work? Well, I think just when we were discussing this,
it made me remember I had a client conversation. Another
client is a friend of mine last week in Istanbul.
He is one of the CEOs of one of the
most leading hospitals in Istanbul. And he told me that
it makes him very uncomfortable to see that each doctor
has a medical secretary attending patient visits. This is one
of those 44 occupations they include in the GTP. Well,
this is exactly the description of the they are ready
the people who wants to adopt LLMs because there are
certain tasks. This is a perfect use of LLM. We'll
listen to conversation and it will summarize. Summarization is a
very popular use case. And also if you just take
this task, fine, we can benchmark and evaluate. But there
is implication on other things. There are other added benefits.
Likely a bench would vary almost impossible to measure. If
you consider these other aspects of this particular example that
I give that you can drive insights from doctor patient
conversation. You can extract, for example, the best practices like
One doctor may be doing something that is working more
for specific certain diseases for subset of patients. You can
do better process optimizations like supply planning for a surgery
if you know what's happening in a conversation or customize
follow up contents. Like many many different other added benefits
you can try, but each of them it's so difficult
to measure. To Marina's point, even non human evaluations are
not that easy. Now we are adding this human component
and derivatives of these one One occupation that I gave
as an example, but it's real. I had this conversation
last week with a person who is looking to implement
this in their hospital. So now it comes to where
the real world is heading versus what do we benchmark?
How do we benchmark when you implement these systems in
the real? You can also track the edit benefits as
a roi, like how much better you are in your
supply management for your surgeries and so forth. So you
can have KPIs that can help you understand I guess
the added benefit. But on the paper, just like pure
scientifically looking and trying to mimic and understand what the
combination of things that these tasks can lead and then
try to benchmark against all of them with some data,
whether it's human annotated or AI annotated to me is
going to be extremely difficult and overwhelming exercise. But I
do like this because it's helping us from getting out
of our academic mindsets when we compare the models and
usages to really real world examples and where industries are
starting to adapt and change. Yeah, I love that. It's
kind of like almost like the eval is sort of
also just a useful exercise in terms of being like
how do we decompose this task anyways? Like what do
we do with our days, you know, is like an
important part of it. Great. I'm going to move us
on to our last topic of the day, which I
think is actually like now that I think of it
weirdly related to what we were just talking about. So
the final news is this really interesting story that was
actually tackled in more depth in the Security Intelligence podcast.
Chris Hay, a frequent MoE panelist, was on there debriefing
it. I'll give the summary of the story and then
I'd actually love to link it to specifically what we
were just talking about. Anthropic disclosed basically that they discovered
that an actor, which they believe to be a state
actor, was misusing Claude to launch a sophisticated cyber attack.
And they have this long blog post which is kind
of very interesting, breaking down what they discovered about the
Attack hack. But I think the thing that really kind
of stood out to me was this quote which I'll
just read, which is quote, the threat actor was able
to use AI to perform 80 to 90% of the
campaign with human intervention required only sporadically, perhaps four to
six critical decision points per hacking campaign. And as far
as I know, this is kind of the first real
dead to rights example of what people have been kind
of theorizing as like vibe hacking. Sort of the idea
that all of this agentix technology at some point point
gets used for good stuff, it also gets used for
bad stuff and I guess to link it to what
we. Were talking about earlier. Marvey, maybe I'll kick it
back to you kind of as our kind of agent
expert here. It does feel like if we had not
GDP val, but for evil tasks it really does seem
like AI is making a real impact in cybersecurity, cyber
attack operations right now. I guess the question to be
asked is do you think agents are going to favor
illegitimate cases faster than legitimate ones? Well, so even though
I work on agentic systems, I'm not a security expert,
but I did chat about this one of my co
workers, Ian Molloy, who is leading our agent security work
and his own words were like it will be extremely
difficult, maybe impossible to prevent malicious use of these agents
while preserving their legitimate use because it's designed that way.
Right? We want them to be flexible, we want them
to listen instructions, we want them to do what we
tell them to do. On the other hand, when we
tell the malicious things, like with these current alignment approaches,
I think it's going to be impossible. But what we
can do, I think security researchers have been predicting this
from the beginning of the LLM era and not even
before agents, because you can also do this model attacks
without the agents. But what I love is the anthropic
had I think the perfect telemetry and monitoring to be
able to talk about this and show what happened. And
I think we can instrument our systems with, with this
observability layers or some monitoring capabilities that is going to
be basically showing us transparently what is happening in the
system. And you can maybe revert back or talk about
it and learn about what happened. I think that's, I
think to me we can't control these things. We can
build additional components for security like authentication, other things and
maybe for enterprise settings that might be able to even
to me easier than broad use because you can add
some controls when you implement agents in the enterprise setting,
but in the Broad use, you just do what the
user says. But I think if we instrument the systems
with these components and which one of them could be,
as I mentioned, observability and the robust telemetry and monitoring
systems, that's what I think can bring trust to use
users while they're using these very powerful or small models,
whatever they do to feel comfortable and trustworthy for these
systems. Gabe, I was joking with a friend which is
that there's probably a team at OpenAI which is kind
of relieved that they weren't involved in this attack, but
also kind of jealous that they chose Claude vs OpenAI.
I guess I have a kind of interesting question here,
which is why, why not use open source here? It
feels like incredibly risky for an actor of this kind
to go with a model that's provided through the cloud
is monitored the way that CLAUDE is monitored, it's sort
of very interesting. Do you have a hypothesis on why
that's the case versus saying we're going to run our
own on premise solution, if you will, to run this
attack? My answer is that they probably are. This is
just the ones that got caught, you know, and I
think to your point about OpenAI feeling bummed that they
weren't part of it, maybe they just don't have the
telemetry to notice. Like I think it's, it's. We've seen
one example of this and, and had it exposed to
the sunlight, but that does not mean that this is
the only one out there. In fact, I strongly suspect
and they, they mentioned in the article that it is
not the only one out there. So I do think,
you know, there is generally a, the frontier models are
generally closed. Even now that we have extremely capable open
models like the recent Kimi K2 thinking and minimax models,
they're extremely hard to run at scale. So depending on
the division of labor, if you've got a team that's
the experts in the cyber hacking side of things, they
probably aren't the experts in running the expensive GPU rigs
to run these extremely large models. And I think what
you get with running all of this through a frontier
model is the hands off nature of it versus trying
to put this together more piecemeal. The interesting thing too
is, you know, in the cyber security domain, the attackers
always have the advantage unless they are targeting one very
specific needle in the haystack. Right. If their goal is
just to go get what they can, steal what's available,
there's no penalty for screwing up, right? You just keep
trying, right? So this Same set of actors may very
well also be banging on an open source version. And
one using GPT Star and one, you know, like, just
try them all, why not, right? Like what's the downside
other than it costing money? But the, you know, the
defenders have the much more difficult task of catching everything
that slips through, right? Like you have to have, have
ironclad practices to avoid being, you know, in the spotlight
of these things. So screwing up has huge penalties. The
thing that did strike me as interesting in all of
this, just to, to look at that defender view of
all of it, is that eventually, even though the agent
was taking a lot of the decision making and the
sort of scripting and the basics of how do we
craft this attack, it's still all exploited standard vulnerabilities, right?
Like, it seemed to me like most of what they
were doing was looking for systems that were running exposed
and vulnerable versions of known exploits and then exploiting them.
So if anything, this just puts a finer point on
enterprises needing to stay up to date with their CVE
patch fixes and just all the best security practices that
we've all had hammered into us, just really do them
for real do them like someone's gonna find it really
doable them. So I, I think, you know, the defensive
side of this, there's sort of the, the AI element
of the defense, which is like, how in the world
do we catch these things? What do we put into
our AI systems as we're building them to make sure
that they can't be exploited? And then there's just the
good old fashioned cybersecurity of like patch your software people
like do it for real or you're going to get
hacked. So I, I think it's an all of the
above strategy on the defense side. But if anything, this
just makes it more urgent that that patch fix timeline
tightens up. I love that we're kind of landing on
place, which is just like at the end of the
day these are not very complex tasks that are being
implemented. Marina, I'll give you the last word if you
have any hot takes before we close the episode today.
Nah, I agree with Gabe that again, I'm not a
cybersecurity expert, but it seemed like here it was crazy,
not creativity, but yes, scale, where it was just try
the same well known things, but just try them a
lot faster than a human could do them. Okay, so
probably at least a lot of those things could be
done. One thing I kind of liked was the anthropic
team slipped in there that they actually used Claude to
analyze the logs of what was going on. So something
that then comes to mind as well. Should we maybe
be thinking that the models themselves know what they would
use if they were turned to evil, so they're more
likely to catch themselves themselves than what a human might
come up with? And then the different models have different
biases. So maybe they could be to have OpenAI models,
check the anthropic models and try to see what you
but at the end of the day, it's almost like
the tasks like. I agree with Gabe, just do the
basics. Because now you can just have your basics broken
a lot faster. So you really need to do this.
To get the basics right. Please, please. Well, with that
bit of very good advice, we'll close the episode today.
Marina, Gabe, Marv, thank you for joining us for the
show today. That's all the time we have and thanks
for joining all you listeners. If you enjoyed what you
heard, you can get us on Apple Podcasts, Spotify and
podcast platforms everywhere. And we'll see you all next week
on Mixture of Experts.