Claude 3.5, Text-to‑SQL Benchmark, AI Content
Key Points
- The episode introduces three main AI industry updates: the launch of Claude 3.5 Sonnet, the new “Bird Bench” text‑to‑SQL benchmark, and the current state and future of AI‑generated content.
- Hosts and guests debate how quickly enterprise clients can adopt the rapid stream of new models, questioning whether they constantly update APIs or stick with existing solutions despite frequent leaderboard churn.
- Claude 3.5 Sonnet is highlighted for topping major leaderboards, but the conversation notes uncertainty about its concrete ROI and use‑case suitability for most clients.
- The Bird Bench benchmark reveals which models excel or lag in translating natural language queries to SQL, underscoring both the promise of AI‑assisted analytics and the persistent data‑quality challenges.
- The segment on AI‑generated content examines what’s working today, what falls short, and the desired improvements, while humorously noting the difficulty of assessing “funny” outputs compared to more straightforward technical metrics.
Full Transcript
# Claude 3.5, Text-to‑SQL Benchmark, AI Content **Source:** [https://www.youtube.com/watch?v=FjppatnFxsE](https://www.youtube.com/watch?v=FjppatnFxsE) **Duration:** 00:38:53 ## Summary - The episode introduces three main AI industry updates: the launch of Claude 3.5 Sonnet, the new “Bird Bench” text‑to‑SQL benchmark, and the current state and future of AI‑generated content. - Hosts and guests debate how quickly enterprise clients can adopt the rapid stream of new models, questioning whether they constantly update APIs or stick with existing solutions despite frequent leaderboard churn. - Claude 3.5 Sonnet is highlighted for topping major leaderboards, but the conversation notes uncertainty about its concrete ROI and use‑case suitability for most clients. - The Bird Bench benchmark reveals which models excel or lag in translating natural language queries to SQL, underscoring both the promise of AI‑assisted analytics and the persistent data‑quality challenges. - The segment on AI‑generated content examines what’s working today, what falls short, and the desired improvements, while humorously noting the difficulty of assessing “funny” outputs compared to more straightforward technical metrics. ## Sections - [00:00:00](https://www.youtube.com/watch?v=FjppatnFxsE&t=0s) **AI Podcast Kickoff: Claude, Benchmarks, Content** - In this opening segment of the "Mixture of Experts" show, host Brian Casey outlines the episode’s three focus areas—Claude 3.5 Sonnet, a new text‑to‑SQL benchmark, and the state of AI‑generated content—while introducing the expert panelists. ## Full Transcript
welcome everyone to this week's mixture
of experts I'm your host Brian Casey
every week we cover the top stories from
around the industry in artificial
intelligence today we're excited to
bring you three of the top stories of
across the industry we'll be kicking
things off today with Claude 3.5 Sonet
so even though I have this beautiful
model that does this really well and I'm
super excited that finally we're able to
do this I couldn't see the actual use
case in the ROI to recommend that to one
of my clients second on the agenda today
we'll talk about the new bird bench text
tosql Benchmark we'll talk about Which
models are doing well which ones aren't
the keys to success and what the future
might hold for Aid driven analysts in
the enterprise we have models that are
pretty good at code especially with a
human in the loop they can you know
already add value today um but there's a
lot of challenge with data third on the
agenda today we'll talk about AI content
the present the future what's working
well right now what's not and what we're
hoping to see so most of the things
we've been talking about here are a
little bit harder to understand where
whereas you see a picture of shrip Jesus
and it's funny and you're like all right
that was
funny joining us this week to cover all
the most important topics in AI we have
Varney VP and Senior partner gen AI
in the Americas we have Marina danesi
senior research scientist and we have
Michael glass an AI
[Music]
researcher we've been debating renaming
the show for mix mixture of experts to
to America's Next Top Model uh because
it seems that every week there is an
announcement of yet another model that
is yet in top of yet another leaderboard
uh and today we're going to talk about
two of them we'll talk about Claude
we'll talk about the bird equl lead
leaderboard we have some other topics
but we're going to jump right in um and
talk about the release of Claude 3.5
sonnet uh which recently came out and is
topping a lot of the uh the big leader
boards and to kind of kick things off
show bit wanted to turn it over to you
and just you know
one of the things that I noticed is that
one of these things happens like every
week and I'm just curious like how
Enterprise clients are ingesting this
stuff are they you know waiting on pins
and needles being like when's the next
model coming just so I can update all my
API calls to a new one or are they like
H I just I just did this like do I need
another one like how does this stuff
land for them and like are they actually
able to ingest the pace at which things
are coming out right now Brian what a
world we live in man like from gbt 40
and now with cl 3.5
just the pace of innovation and what my
10-year-old kid has access to is just
insane like just we've got to wait and
appreciate how far AI has come so
quickly so I'm big huge fan of the big
models that we we seeing and the
capabilties just absolutely stunning
from an Enterprise perspective majority
of our clients are still users of AI so
they will go sign up for a large SAS
platform an article or or copilot with
Azure and so and so forth and they'll be
consuming AI through that particular
vendor in the SAS format in which case
now now the vendor is choosing Which
models to use so so for there's a small
subset of teams that are actually
building their own models or they are
configuring their own LM apps end to end
kind of made some choices on going to
bring in so in this case AWS investing 4
billion plus in anthropic gives people
access to 3.5 so if you have an
Enterprise that has already built entire
end to end contract analytics on Azure
they are not going to be able to use the
3.5 goodness that comes from Claude
Contin to use their GD 40's and and 3.5s
and so on so forth so there a little bit
of in the Enterprise space the coolest
model is not what you really get access
to there's a lot of things around what
happens to the liability and
indemnification and does it really I've
spent enough energy in making it
understand my domain so Enterprises
don't quite switch models the way we do
I dropped my uh gyd uh my chat gyd plus
and I switched over to Claude to
experiment more with it right so as a
individual user I'm going to keep
switching around for the best model and
no longer use my Gemini subscription as
well right so I may follow more than
what an Enterprise word that makes a lot
of sense and Marina Michael question for
both of you like the announcement I
thought was interesting because it was a
mix of improved capabilities across a
lot of the dimensions that were
accustomed to people talking about you
know just benchmarks Vision um but it
was also product announcements where
they were talking more about um about
artifacts and projects and actually like
in some ways show it to your point about
like people like ingesting SAS tools
they went from like Claude is like going
a little bit away from just like that
like standard chat interface to like
behaving a little bit more like a SAS
tool um in some of these areas so like
I'm curious if you all thought that the
more noteworthy piece with some of the
capabilities or actually just like where
they're nudging the product from you
know like a like you know the end result
of all technology B2B SAS on on some
level but if like which one of those
dimensions um that you found kind of
more interesting about the technology
and just like where it mean or what it
means in terms of like where things are
going well for me yeah the the business
part of it um maybe didn't land as as
strongly uh I was interested foring the
the scientific progress angle um so you
know the thing with these closed models
you don't necessarily know what they did
that got the performance increase but
you do see the performance increase um
so it it's a it's a interesting mystery
what what's uh what's behind all of that
I think that speaks to maybe a very
slightly different work that Michael and
I do when it comes to the research side
he's certainly more on the science
theory side I end up swimming sometimes
more in the business application both of
which are really valid I actually was
interested in the the artifacts side and
their claims I haven't played around
with it yet myself but the claims of how
it would be used for potentially for for
work collaboration or how to create
things that are richer because I'm
really interested in how these models
actually end up being useful rather than
maybe how they work on particular
benchmarks there though I would like to
know the the science behind why is this
one according to anthropic going to be
better or more useful is it merely
because of the workflow ux that they're
enabling or there something to this
model versus other models and actually
that's something I'm really interested
to dive deeper into so so t on the just
the core capabilities though uh I do
feel that the the artifacts having a
window on the side where they're en now
going to be enabling some team
collaboration that's going to go quite a
bit a long way today each data scientist
has their own space and we continue to
work in that so I'm pretty excited about
what we can do just I was working with
my 8-year-old daughter and we were
trying to figure out how to create a
game that she was playing on her iPad
say let's let's replicate that with
Claud and absolutely stunning how well
it was able to do that and we able to
iterate through it and see things moving
on the right side and you could see that
that uh thing upgrading we have uh when
we look at benchmarks across these uh
different vendors as soon as the
announcement comes in we like oh rah
champagne open the champagne it's such a
great Benchmark right but in the real
Enterprise world we we have to go deploy
that in our own use cases and see if
it's working and stuff right so I would
like to share one example um I was
recently at our in our Bangalore Center
in Bangalore we have our industry Labs
where we have set up different sections
to each industry cpg and utilities and
so on so forth in the utilities U Wing
we had actual dials of uh machines right
so you had these analog dials and
there's whole gradal the needle is
pointing towards the reading right so we
work closely with a lot of our utilities
and Manufacturing clients with Boston
dyamics and whatnot so we get a bunch of
images coming off of those and the image
is of a dial that's at an angle it is
reflecting some sun and a lot of the
markings over time have actually rubbed
off so there'll be marking at one at 3
four five so the middle two is missing
and then most of these dials also have a
green marker that says hey this is the
safe zone for this particular dial
I've been very frustrated so far that
open ai's gp40 the best in-class model
if I give it those images that I'm I'm
working with today was not able to go
look at the reading and I was pleasantly
surprised to see 3.5 Claude take that
image and was able to correctly annotate
and I the way I had asked the question
was figure out all the major ticks and
all the smaller ticks after that and
then figure out where the needle is and
then tell me if it's in the green zone
or not it absolutely nailed it and I was
very pleasantly surprised it didn't get
the green Zone part correct but
comparing Gemini Pro 1.5 to 40 from from
GPT and then now Cloud 3.5 I could see
that but then we started to do the math
on it my team does this when we are
doing millions of these images every day
for our clients and started to do math
our pipeline is a is a classic machine
learning YOLO base model uh and we have
different versions we've done the whole
thing end to end we're using some
segment model from meta to figure out
what's where in the image and stuff the
accuracy that I'm getting there is more
consistent but the price point of
delivering that for this utility is a
fraction of what I'm getting from claws
3.5 so even though I have this beautiful
model that does this really well and I'm
super excited that finally we're able to
do this I couldn't see the actual use
case in the ROI to recommend that to one
of my
clients what was the actual difference
was it a like a factor of 10 or a factor
of two so out of out of say I put
through about 20 images uh through it
and clot got about 18 of them correct
that was pretty high and gp4 and Gem
live was was actually not doing very
well on those particular images and
stuff gp4 was doing a bit better but
getting 18 out of 20 correct that's
quite a bit unfortunately even 18 is not
high enough for any of my clients right
they need to have very high accuracy if
you're talking about plants with like
power plants power Productions happening
you have to be pretty spoton you can't
just have 18 out2 correct do our clients
have a sense of how like they're used
cases that are close you know it's like
the we almost have capable enough models
to unlock you know these sorts of Di and
we're just waiting and like once we get
there like this thing's going to be able
uh we're going to be able to use these
things I know some of the emergence is
often times like unpredictable um and so
you don't actually know when capability
is going to show up um but do they have
like good feel for the things that are
are close versus the things that are
like you know way way farther away in
terms of you know real use cases um I
think that you're actually uh pointing
at something interesting so a lot of the
clients are still not quite at the point
where they know what it takes to trust
these systems even in something that's
text and not only not multimodal but
it's just a single modal uh there's a
feeling of yeah I play with it for a
little bit and it it seems to pass the
initial sniff test but I can still
consistently break it consistently break
it and the risks just continue to be too
high exactly what shba just said of it
messing up one time having either a
viral negative PR moment or a lawsuit or
you know anything else of that so
because there's still this feeling of
actually I I don't get the sense that
there is that level of comfort that uh
where they're ready to Contin you know
really completely put it into production
unless it's fully wrapped by something
else an amount of guard rails and amount
of anything that's able to catch it so
uh certainly not there yet at least from
from my particular experience I don't
know show but if you've seen something
different yeah so there are uh certain
use cases like uh code completion like
any code development that's the biggest
progress we' have seen in the industry
from across all vendors we have done
some amazing work with our Granite code
models that are small tiny understand
the domain beautifully and they can help
with code completion endend life cycle
so far we've been most impressed with
the accuracy that we get out of code
completion especially for the more
popular popular languages right the
second has been around customer care
there are internal facing use cases in
Customer Care not external you don't
want uh to sell a car for a dollar to
end users you want to make make sure
that you're doing something in where
it's internal facing looking for
insights looking for transcribing uh
Speech and figuring out what's H
happening things of that nature creating
test data sets training data sets those
are working out really well agent facing
workflows so those two I would say we
are close enough where we have great
deployments and they're being done at
scale across a large set of AI vendors I
think the the the tip over point is when
it starts to look at data really well so
so far we're talking about text the big
Chasm it needs to cross is structured
data sets uh this is where we're looking
at your data is in sap or in Salesforce
oracles things of that nature and now
all of a sudden my in the middle of my
workflow there's unstructured documents
it's getting there where it can give you
lineage tell you where the answer came
from but when it comes to structured
data sets llms don't inherently
understand data really well so most of
the TS have been made have been can I go
from national language into making a API
call or a SQL statement things that
nature and then I'll go grab some data
that's the area that's going to unlock a
significant amount of productivity once
we understand how to pull in structure
data sets into our national language
[Music]
conversation I wonder if that's a great
pivot to the the second topic which is
to talk a little bit about structured
data and even just the um the bird
sequel um Benchmark uh do you want to
maybe like kick us off there and just
talk a little bit about um what that is
and why it's important yeah I think it
it goes directly to what sh was saying
that you know we have models that are
pretty good at code especially with a
human in the loop they can you know
already add value today um but there's a
lot of challenge with data um so you can
take their strength uh to address that
weakness and create a text to SQL model
so that people can interact with their
data you know using natural language um
and but with a human in the loop you can
examine the the SQL statement uh ideally
with some explanation uh also from the
llm about what it's doing and why uh to
gain confidence that that is really
selecting the right results set I that's
that is the data I wanted to to get out
that makes sense and I think there's
like is there's a few benchmarks in this
space that are actually there's
benchmarks for everything um and we
could probably also retitle this show
Just The Benchmark show um on some level
but I think um there's a few benchmarks
I want to say in the the Texas sequel
area and um I I believe bird is one of
them do you do you want to talk it all
about I read a little I read the paper
um that they published on that but do
you want to maybe just take a minute and
talk a little bit about uh the approach
that they're taking why it's maybe a
little bit different from from some of
the other benchmarks that are out there
in the space sure yeah I think uh for a
long time spider um spider one .0 I
should clarify now in case 2.0 comes out
soon uh that that Benchmark was very
popular um drove a lot of research but
it reached a point where people were
getting 90 plus percent on it uh some
you know the hard questions weren't that
hard uh and didn't necessarily challenge
state-of-the-art llms anymore so with
bird bench you know the uh
organizers still made a variety of
difficulty levels but the the
challenging questions in BD are much
more challenging than the extra hard
questions in spider uh so it it really
moved uh the difficulty level um also
yeah the the databases in spider were
created uh all databases all database
schemas are created and and the ones in
spider were invented by database
students uh but the uh databases and
bird were gathered from Real World um
databases and and sets of uh data sets
uh so they're considerably more messy um
which is going to be the case in in in
the real world you're going to have to
deal with databases that are not not
very clean not well normalized have uh
missing and messy data um so there's a
both of those challenges the
the difficulty and reality of the the
schemas uh as well as the difficulty of
the information need the complexity of
the information need there are two
different things that the bird SQL paper
did one was looking at the actual
accuracy of the SQL statements but then
they were also looking at the efficiency
of the SQL statement itself they are
different ways in which I could join
tables and get the answer out and they
in some of the comparisons one query
could take 42 seconds the other one
could take 5 Seconds right so there's
obviously a efficent and then
Effectiveness score with bir equal to do
you see that in other benchmarks as well
I think that was their Innovation you
know that was one of the things they
brought uh
was I think also may be related to the
complexity of the queries um when you
reach a level of complexity there's now
many ways of answering that um by
crafting many joins or maybe some
subselects um so they can vary
considerably as you say in the actual
efficiency that the time it takes to
execute um so I think that was maybe
something that came out of just
developing uh a more complex uh
Benchmark Michael where do you think
there might be a good direction to go
for the next set of benchmarks of what
are the next set of hard questions that
you would like to see if you could uh
get a benchmark on order what might be
valuable uh I do I do think that there's
a lot of questions and and it was a true
in spider uh it's true in bird that have
a a very simple result set it's just
asking for a particular number uh I
think a lot of what people want to do
with their data is to build a result set
which they can then visualize so it's it
would have many rows uh with some
aggregation over you know often location
and time uh you know that could then be
presented in a
dashboard um that that's not a a high
percentage uh of bird not very many
queries that are suitable for that kind
of of uh visualized analysis so that
that's what I would like to see and
followup queries right once you have
that kind of aggregation you could then
potentially even have turn it into like
a multi-turn type of stuff T I want to
highlight a few things about bird SQL uh
I I was generally very impressed
there're about close to 13,000 unique
questions as part of the test data sets
there's about close to 100 big databases
it represents about what 30 GB plus of
data that we quing right so a whole
bunch of different professional domains
Stu of that and on that Benchmark IBM's
uh Granite models are number one we
crushed it we have the EXL plus our
Granite 20 billion parameter model from
IM research that is currently the king
of bird SQL can you just highlight in a
couple sentences what are some of the
innovations that IBM has done in the in
that space to get the SQL just nailed
yeah I think uh you know it's a
multi-stage uh multi multi-piece uh
model so uh one thing we call schema
linking is just identifying what piece
of this database is actually relevant
for answering your question typically
you going have many tables many columns
for each table uh but only a small
fraction are actually relevant for what
you're asking um so that was a big focus
is first narrowing down uh what piece of
of the database is relevant um then
there's another part where we try to
identify what are the conditions uh you
need to match against in the database uh
so that the values in the database can
be represented in many ways and this is
part of having realistic uh databases
you know the it can be represented by Y
and N for a Boolean column or zero and
one or true and false uh
so dates and places can be represented
so many different ways so just that
ability to understand uh
what you need to include in your query
that will actually match the values in
the
database uh that that was also a
significant part I'll tell you from my
perspective this whole area is one of
the more exciting and interesting like
spaces and opportunities for for folks
who like worked with me a few years ago
once upon a time I used to manage um uh
have I managed analytics teams a couple
different times um in my life and we
have we used to have an analyst ICS
request Channel um within our digital
team that I used to joke and tell people
was my least favorite slack channel in
all of IBM um and it actually basically
functioned as a text to SQL model which
is that people would just dump in
requests and then our analytics team
would just go off and return you know an
hour later or a day later with an Excel
file being like I ran some SQL here's
your answer and I would just sit there
and stew at that channel and just be mad
all day long uh basically cuz I would
look at that and be like you guys should
learn how to do this yourselves like why
are you asking the team these insanely
basic questions like if you can't answer
these questions like how are you even
doing your job like all the time sort of
thing and so like when I think about
this space like the Nirvana to me is
that people the questions that people
would dump in the analytics request
Channel they can just ask to an llm in
natural language but one of the one of
the tricky things i' love the panelist
thoughts on this is a lot of things
people want is like their core kpis um
that sometimes they're measured on it's
like how am I doing on performance on
this thing and like getting the wrong
data in some of those like depending on
the type of kpi like a lot of
consequences uh sometimes of not getting
that data accurate um so you know I'm
just curious to team's thoughts on like
you know do we think we'll be more
productive in this space early on
empowering anal analysts to be faster do
we like you know is that Nirvana stage
of you know just business uniters asking
questions in natural language you know
getting results back do we think that's
closer than it feels um right now like
how like where are we on that maturity
curve we're going to get to a point
where it's useful for a data analyst
first uh that that's going to come well
before it's useful for somebody who who
doesn't understand the sequel uh but I
think that uh you know a business User
it's it's it's on the horizon it's um
with with tools for explainability to to
provide some justification and
verification of you that this is
actually answering what you're asking uh
I think that uh you can gain some
confidence in the output the llm even if
you're not familiar with SQL I I think
there are a few different ways in which
different uh vendors are going after
this Market you have the classic
business intelligent platforms right
analytics and Pi uh tools Gartner just
released theirs this week again
Microsoft Salesforce Tableau or article
Google big queries or the thought spots
of the world top right right so you'll
have all these leaders uh who have been
bi tools that're now not adding natural
queries to it then you have another Camp
of deer brakes and snowflake that's
where the real inje engines are for
queries right so you have query that
comes in you're going to fire off in
real time cross multiple different sets
of of T sets there the speed of inje is
going to be very very quick quick there
they are investing quite heavily in
understanding what is in the data so
given a particular column and there's a
header for that that says abcore XYZ now
what does that name actually translate
to a national language so there's a lot
of work that's been done in
interrogating and discovering the data
creating the right set of met data
attached to it so that when somebody
says hey I'm looking for numbers from
this particular region across X then you
have a good way of me mechanics to go
and translate that where what it means
in snowflakes ice tables or what does it
mean in in different formats across data
breaks so I there's going to be a good
mesh up between the pure bi players
adding national language and then the
data side of the of the world
extrapolating and exposing more metadata
around it so you can nail the actual
data where you need to go bring it from
in the lineage and then all the
governance falls on the left hand side
where which data set has been approved
for what kind of reporting if you're
looking at say creating a report on
sustainability there has to be a data
catalog that's OS so now we're starting
to understand how the data quality and
the data cataloges Discovery there's
there's a lot of effort that's going in
that space and now you're starting to
scale up on the national language inside
of bi bi reporting dashboards that is
still critical Microsoft is attacking
this with powerbi where in the first
iteration of it powerbi co-pilots was
really focused more on the developers so
my team will go and build out a whole
bunch of different dashboards there's AI
baked in for me to go switch out
different panels so on so forth
over time now they're adding uh ways in
which a end user business analyst can
just ask a national language question
but then then you have to go uh ask
follow-ups and say hey did you mean this
region or that one and so on so forth
but it's great to see this blend of data
getting better and the tools that are
analyzing and interrogating that dat are
getting even better as well we release
our own Watson xbi uh tool that again
tries to mesh that together from our
Watson x. data set and on the right hand
side you have a better national language
query system that goes and pulls out as
well but it's a very very hot field for
a lot of motion I was writing an article
about how the dots on the Gartner have
moved on the bi it's incredible how much
movement has happened and so many people
are now moving into the leader space in
the Gartner quadrant whereas we didn't
have those many last year one other
interesting piece and maybe as a last
word Marina I'll throw it over to you
I've also I was also just
reading a bunch of like articles Reddit
threads things like that about people
who are trying to make Texas Equal work
inside of their companies and like one
of the interesting things people are
saying Well turns out documentation is
useful for humans and also for llms U
and so people were talking about
actually building some like rag patterns
that were like just taking the
documentation for their database and
putting it into the context for um for
the llm so it actually like had some
sense of like what some of these fields
um were supposed to mean so I'm just
curious if you've seen you know anything
like interesting or noteworthy kind of
you know just in the rags space around
how it like how it intersects with the
whole kind of Texas equl area I mean I
will add to that that documentation is
useful uh examples of hey I ran this
kind of kpi report for one company for
one thing can you do the same thing but
for something else those kind of
examples are very useful a reason that
you could I think make a lot of progress
and we are making a lot of progress with
this bi space it's very similar just in
uh code creation space you're you're
wanting particular functions in
particular order and you can very
rapidly build up a sequence of things
that you need and check whether it makes
any kind of sense so like what Michael
and CH you were saying that it can be a
little bit hard to go straight to an end
user who doesn't understand but it's
quite easy to go to a data analyst who
looks at it and goes all right I
immediately can see where this is going
off the rails give me this information
this this this that ends up speeding you
up a whole lot but the human can
immediately tell where things are good
where things are bad so what kind of
knowledge is uh helpful to throw in and
especially examples of oh yeah this it's
like this but for this company for this
time period for for this kpi that's
where you're going to get a lot of speed
up because you end up just doing a lot
of the same thing over and over and over
again and that always means it's a place
that's ripe for this this human in the
loop
[Music]
automation every week we have a thread
where we talk just like what are people
seeing in the industry what do we think
is interesting um obviously we covered
some pretty big news so far U the last
piece is we actually there was an
article going around that was talking
about
books for like 299 that were appearing
in Kindle U as like ads and they're
clearly 100% AI generated books and
they're they were nighttime stories for
kids and parents uh essentially which
usually you don't hear that usually the
nighttime story which I think was a joke
in the article nighttime story for
parents is just a book uh right as
opposed to like a bedtime story um and
it would be like you know using your
mind control powers for good and it's
just like all of this like uncanny
valley like ebook content that's out
there and we just we thought it' be a
good idea to just like talk about what's
happening sort of in the content space
um we've been using the term AI slop um
a little bit and there's there's been a
few things that have I think been been
going around the internet I'd love to
just get the the panels kind of
experience with some of this but some of
the most recent ones I've seen is um
there was a video of like Dave Chappelle
um that went Moder viral which was not
Dave Chappelle speaking it was somebody
had one of the llms write like a routine
and then basically piped it into
Chappelle um there was you know there
was a video that was going around of
like Toys R Us's first like attempt to
do an ad uh with Sora there's like way
too many SEO stories of people just like
lighting domains on fire building like
llm generated content and one of the
like maybe my hot take to start this off
and then I throw to throw it to everyone
what else is like content to me and
content generation almost feels like a
red herring um for llms right now it was
the first thing that people latched on
to it's like oh I can get this thing to
like write a blog post or a movie script
and like none of those use cases I don't
know if y'all have seen any of them like
none of those use cases have like seemed
like they produce really anything useful
um right now and there's lots of other
areas where like many interesting things
are happening like most of the
conversation we've had here today but
like content still like takes up weirdly
high percentage of discussion in the
market and there's just you know I don't
know if it's just not there yet but I'm
not seeing like good things happen in in
this space so I'd love to just like open
it up to the group and just one have you
seen anything else and let just like has
struck you of being like how did we get
here to this place and then do you also
like do you disagree with the take that
you know content is like a distraction
almost um in the llm world right now I
don't know if content is a distraction
it is something that is easy to create
and easy to consume so most of the
things we've been talking about here are
a little bit harder to understand where
whereas you see a picture of shrimp
Jesus and it's funny and you're like all
right that was funny all right I'll
allow that there's one good use case
then sh Jes that's funny um slob I would
categorize it as some kind of mixture of
funny annoying and dangerous funny is
strimp Jesus annoying is these uh Kindle
ads where you look at them and you're
like okay the the title is weird the
girl in the picture has eight fingers
okay this is clearly some kind of you
know low grade content that is trying to
get to see if it get my attention then
there's dangerous which is the AI guide
to mushroom foraging which is going to
kill you if you're going to go ahead and
read that that was going around a little
while ago and you're going to get this
again because it is so easy to generate
Amazon lets you complain about these
book sellers and and take stuff down but
it's just as easy to put stuff back up
again uh so it's just going to continue
to come back and and continue to come
back and go and be on uh various social
media platforms because it's funny I
actually think there's something
beneficial to this because it continues
to keep in the public eye a reminder of
just how easy it is to turn these things
into trolls and so that you don't fall
into a sense of complacency of only
hearing about you know oh this is
successful this is useful yeah but as
soon as you turn it to to evil means
it's actually still continues to be very
easy and very very clever so I do think
there's uh there's some use to that
Brian I'm going to disagree with you uh
I think content creation is an amazing
use case for Enterprises so like we
within IBM we've had a big partnership
with Adobe and we do a lot of our
marketing end to end as long as it's
done within constraints you're using
brand approved guidelines you're using
content that's been being wetted and so
on so forth as's a human in the loop the
content generation has delivered amazing
value for us as IBM and we've done the
same for our clients and I'll take a
more hot take on this and this is
something that Yan had shared he runs
all the AI for meta uh he had described
how their using llms to filter content
before people post right so two years
back when they were looking at somebody
about to post something on Facebook or
Instagram and they're trying to validate
if this is hate crime views and so on so
forth they would have one in four
chances of actually catching content now
with their lava models they're able to
get close to 90 94% of those about to be
posted they can flag them as something
that should not be posted right so this
is AI being leveraged for good instead
of saying that hey if I can just
generate a whole bunch of content the
reverse is also true I'm leveraging
these llms for some of my Banks where
we're doing social engineering attacks
security related and we're able to
identify which of these are actually
social engineering hacks right so
there's a flip side of this and I would
say content creation I would think is a
amazing superpower for llms if they're
used rightly the right context in
Enterprise with the right guardling
around it you did say one thing which
I'll just I that I think is
underappreciated and this is kind of
what I meant by the red herring piece of
it is we actually we do have some use
cases on the team where we're using now
and it's like saved us a gazillion hours
so it's like I there are real things
that are that have helped from a
productivity perspective but the second
thing you mentioned that Yan talked
about where like I think they're an
underrated Analytics tool actually um
like it's a different type of like NLP
analytics that like you couldn't do
without human sort of judgment um before
but now you have this other type of
computer that can do a different sort of
analysis um and we have so many use
cases around that sort of activity so
it's like that sort of analysis
summarization like how that plays into
internal workflows like just this like
whole universe of value that that we see
there and that sometimes I think it's
like gets a little bit distracted that's
like oh you can create content it's like
well you can analyze it too um and you
can do all sorts of other things with
language so you know I I do know there
are some there are legitimately some
good good use cases but sometimes I like
I get a little sad that we don't talk
more about like the use case that you
just talked about there yeah I'll say
that uh I think for a long time we've
had an abundance of of some lowquality
content for blog posts and um books
that's true the internet is not like has
not won the Academy Award right I think
also you know it speaks the the danger
might be uh a little overstated that you
know we we've developed defenses against
human created fakes uh which have been
around for much longer than llm or other
models have been able to create any kind
of convincing fake so um I think uh both
the both the danger and the value of
some of this uh content creation that
humans have already created an abundance
of um probably pretty low that's true I
tell you one of the when I saw the
Kindle thing one of the YouTube had all
these sort of interesting challenges
with like they're calling it
algorithmics Lop back before j a i was
even cool uh and you know I look at like
a show like Coco melon uh um this is
probably there's only a very specific
audience that's going to resonate with
this but I'm just every time I hear Coco
melon I just like dye a little bit
inside I'm like put on Bluey like um and
so I look at CoCo melon that just like
feels like an algorithm created like a
child tuned like engagement engine
essentially and I'm like oh I see that
Kindle stuff and I'm like oh no I I can
feel more Coco melons coming and you
know and for like especially for kids so
you know I don't I don't know whether I
would put that Under the Umbrella of
dangerous but you know it's like some of
that it's like I don't know makes me a
little nervous on on some level that's
far away from an Enterprise use case but
maybe as like a parent I see some of
those things and I'm like side iing side
iing a little bit is so interesting by
the way I don't know if you've read
about the company but like it is
absolutely uh based on make take
tracking a lot of kpis like how long
people watch the videos and trying to
game that as much as possible and
maximizing that Bluey which I love is a
wonderful content and it's not for that
at all and Coco Melon really is how long
can we keep the kids engaged they track
like every second what the kids are
watching so there's a lot of AI being
put to use there although it's still
being put to use by people who have
decided that that's their goal yes I
think overall if you're looking at this
the content creation part you touched on
this with the Toys R Us with Sora as
well the ultimate use case would be I
don't like I don't care much about who
created the ad itself right tar did an
amazing job at kan's releasing their ad
fully created by 80% created by by Sora
right amazing work but I really want to
fast forward to a point where the movie
that I'm seeing in the in in my screen I
want to be in there like I would love to
have a scene with Beyonce right I was
that was not where I thought you were
going I was like it's like I'm the star
of the movie imagine right like I would
like if I'm watching Avengers and I'm
big Avengers fan I would want to be
helping Iron Man in that particular
scene fighting Thanos right just imagine
the power of you create a this of an
inserting inside of a movie that you're
watching I think that's the future that
I want to live in don't tell my wife I
refer Beyonce and all those we're gonna
we're just going to cut this part of the
Pod so like it's over but I think it
will be very interesting to see how this
space like shapes up there are places
where it's like I could see like an
algorithm and like a reinforcement
learning type of thing producing like
Coco melon on steroids and parents like
people looking that and be like what is
that stuff but at the same time I think
there's plenty of plenty of places where
like there's a ton of opportunity
there's like real stuff people are doing
today um but there's also other
interesting things like around like
analysis that I wish people would pay
maybe a little bit more attention to so
in any case Marina Michael thank
you for joining us today um it's great
discussion and we will see you back here
next time on mixture of experts thanks
all thanks so much for