AI Shaping Sports, Code, and Personas
Key Points
- The episode kicks off by exploring how AI could transform major sports events like Wimbledon, the Euros, and Copa America, from performance analytics to enhancing fan experiences.
- A new study from the *IEEE Transactions on Software Engineering* examines GPT’s ability to solve coding tasks, raising concerns about over‑reliance on AI tools for novice programmers.
- Researchers have released a paper on generating “1 billion personas” as synthetic data, sparking discussion about whether such massive persona libraries can truly capture human diversity for training LLMs.
- Host Tim Hong is joined by IBM experts Skyler Speakman, Kar El McGrow, and newcomer Aaron Botman to dissect these AI trends and their broader implications.
- The conversation also touches on the practical insights gained from attending Wimbledon in person, illustrating real‑world opportunities for AI integration in sports.
Full Transcript
# AI Shaping Sports, Code, and Personas **Source:** [https://www.youtube.com/watch?v=aMcYxjZMRuA](https://www.youtube.com/watch?v=aMcYxjZMRuA) **Duration:** 00:41:24 ## Summary - The episode kicks off by exploring how AI could transform major sports events like Wimbledon, the Euros, and Copa America, from performance analytics to enhancing fan experiences. - A new study from the *IEEE Transactions on Software Engineering* examines GPT’s ability to solve coding tasks, raising concerns about over‑reliance on AI tools for novice programmers. - Researchers have released a paper on generating “1 billion personas” as synthetic data, sparking discussion about whether such massive persona libraries can truly capture human diversity for training LLMs. - Host Tim Hong is joined by IBM experts Skyler Speakman, Kar El McGrow, and newcomer Aaron Botman to dissect these AI trends and their broader implications. - The conversation also touches on the practical insights gained from attending Wimbledon in person, illustrating real‑world opportunities for AI integration in sports. ## Sections - [00:00:00](https://www.youtube.com/watch?v=aMcYxjZMRuA&t=0s) **AI’s Impact on Sports, Coding, and Synthetic Data** - In this episode Tim Hong previews discussions on AI’s role in reshaping athletic competition, a new study evaluating GPT’s coding abilities, and a paper proposing a billion‑person synthetic data framework to alleviate data scarcity. ## Full Transcript
hello and happy Friday you're listening
to mixture of experts I'm your host Tim
Hong back from the kitchen each week
mixture of experts brings together
amazing group of well experts to tackle
debate and explain the biggest Trends in
the fast-moving world of artificial
intelligence this week on the show we're
going to cover three stories first with
the Euros Wimbledon and the cop America
all in a matter of weeks we talk about
Ai and sports will AI have a role in
shaping the nature of the game if so how
you know they they might just see
gameplay whereas we see data and an
opportunity uh to derive these types of
insights uh to help find that signal in
the noise U to provide those
serendipitous type moments that connect
people to the game second a new study
out in the I E's transactions on
software engineering reveals new data
about gpt's performance on coding tasks
what's it tell us about the future of
coding assistance I worry a bit about
you know this uh over Reliance on AI
tools for problem solving especially as
you're learning in the early stages for
programming and then third a fascinating
paper out in archive entitled scaling
synthetic data Creation with 1 billion
personas from 10cent Seattle lab does it
provide a way forwards for resolving
data bottlenecks and what can we use
personas for in the future how confident
are we about the coverage of these 1
billion personas and do the underlying
large language models you know really
understand being a a musai warrior
as always I'm joined by an incredible
group of panelists that will help us
navigate what has been another
action-packed week in AI so today we've
got three panelists Skyler Speakman he's
a senior research scientist at IBM Kar
El mcgrow principal research scientist
at IBM AI engineering AI Hardware Center
and joining us for the very first time
is Aaron botman IBM fellow and master
inventor welcome to the show
[Music]
everyone first it's been a very busy
season if if you're into watching sports
uh Wimbledon the euros and the cop
America are all happening basically like
this week last week um I know we've
talked largely on the show at mixture of
experts on AI as kind of an internal
business process um but I think you know
particularly with all the sports in the
air I've been kind of thinking that it
might be a really good opportunity for
us to talk a little bit about the ways
in which AI might reshape Sports itself
um and Aaron I want to start with you as
kind of our our new panelists uh on the
show just to pick on you a little bit
you were actually at Wimbledon um and
I'm curious as someone who works you
know day in day out right like why did
you go what do you think and as someone
who works in AI all the time I'm sure if
you're experienc of this kind of tennis
tournament where you're like oh actually
there might be a lot of ways for AI to
apply so just kind of as an initial
place just to to get the report uh on on
uh
wiblin yeah so I mean it's it's always
fascinating you know to watch how we
operationalize lots of these AI
techniques in particular in this case
with um our partner um you know um
Wimbledon and I was lucky enough to go
you know we've been doing this for uh
with them almost 30 years now um and
this year you know we focused a lot on
generative AI um but also don't want to
forget about classical AI right because
both of them are uh very important um
and we use many different uh techniques
in order to ad do it but to actually be
there right during the tennis and being
and in sort of in the thick right of uh
the space it's it's very um interesting
um you know because there's there's many
aspects there's a how's the technology
performing how is the consumer
acceptance you know of the tech um and
then how's the back office um acceptance
as well right um you know so it's it's
always nice right to watch people around
the world you know use it um you know we
get billions of of users every single
year you know that that use you know our
systems um you know in this case for the
generative AI just at the halfway mark
right um people spent um
2,319 hours just looking and reading the
generative content that we we produce
well and I think if I can ask you to
back up a little bit I mean our
listeners won't necessarily be familiar
with what you were working on we'd love
to hear kind of a little bit more about
like what is the technology that you
were mostly focused on this year and uh
what people were doing with it yeah so
um so so we're um looking at um bringing
the game in a personalized way to you
right and so and so what we like to do
is mix in different aspects you know we
like to rank players we like to predict
who might win a match and then we want
to um create content to catch you up
right so that you know if you join um in
the tournament uh we want to create
these digestible nuggets so that you
know the consumers um around the world
can you know View and understand you
know what what's happening uh in the
match right and it and it helps you know
um the you know whereas you know they
they might just see gameplay whereas we
see data and an opportunity uh to derive
these types of insights uh to help find
that signal in the noise um to provide
those serendipitous type moments that
connect people to the game yeah for sure
and I was talking with a friend recently
about this was um so as someone who got
into uh football soccer that is uh you
know during my time like in the pandemic
my experience of the sport has largely
been like a visual experience right and
it is so interesting to me that like you
know having never gone to a game I'm a
huge fan I watch all the time uh but my
primary experience is kind of like
intermediated right uh through you know
social media and what I see on TV and it
kind of sounds like there's been a
similar exercise to try to figure out
like how AI kind of plays a role in that
interface right from like the fan and
the viewer to kind of get more out of
the game um and um I guess I'm Aon I'm
curious like any Lessons Learned um
things that you thought like worked
really well um for this uh this work
yeah um so so we used lots of different
sensors around the course to gather data
you know so we use like a Hawkeye system
that has you know up to nine different
cameras that track the ball track the
players you know we get all sorts of
stats that are streaming to us but
there's just this Deluge of information
right and it's hard for people to just
comprehend it so one of the lessons
learn that I think we had was to create
these digestible narrations of you know
pre-match postmatch about the players so
they can go in and just quickly read up
you know on their favorite players um
and then and then some other um aspects
that that we also learned is that
sometimes it's nice to inject
information that maybe somebody wouldn't
ordinarily you know know about or read
about or even think about right um so so
it's nice you know to to watch that you
know happen and and spread um so that's
you know you know one of the pillars um
I think just very quickly the other
pillar uh would be on the operation side
you know um that it's always great to
have human and and machine and
algorithms working together uh to create
the symbiotic you know experience that
can be um used um where whether it's
mobile uh you might be on site you know
as a fan um so it's uh so it's it's it's
really evolving right into this sort of
Moneyball 2.0 yeah I think uh the uh the
power of AI uh in sports uh really is
transformative and uh and AI here plays
a multifaceted and transformative role
uh espec especially not just on the
commentary side on the user experience I
think there are lots of different
applications
where we can see the power of AI here
like things like performance analysis
athlete and performance training where
with using wearable technology you can
have these sensors that collect data on
the athletes movements the Biometrics
Etc and analyze the data to provide
insights of you know where the athlet
can you know improve uh video analysis
uh you know analyzing video footage of
the training sessions and games
assessing techniques identifying
weaknesses game strategy also is also a
a big application here health and injury
prevention uh with doing things like
injury
diagnosis uh can algorithms also can
assessing diagnosing injury injuries
through image analysis uh fan engagement
and experience of course is the the fun
part of it uh like Aaron was talking
about you know with the personalized
content uh with the chatbot and virtual
assistance and the augmented reality
even you know you can have kind of a
true immersive experience where you can
enhance you know AR and VR experiences
imagine watching a game and like you're
there you know with the AR and VR I
think that can be really a lot of fun
and of course with all the game and
event management you know there are also
areas where you can use AI for
scheduling logistic crowd management
ticketing uh so lots of you know areas
here where Ai and also gen AI can really
play a transformative role and I I can
only see this kind of uh growing that's
right yeah I love the idea that in the
future you'll be able to get like
whatever commentator you want just
generated algorithmically on the Fly you
know like I want you know the uh you I
want George Washington to narrate my my
sports game and to have that audio
generated on the fly would be really
interesting I also think this kind of
point about kind of like the backend is
also really interesting about like all
the operations it will help with and I
know Skyler you were interested in
particularly the idea that kind of you
know maybe teams that will be able to
like really manage all this data will
have this huge advantage in the future
um and like it'll be like a wonderful
world where basically like you know
someone managing a top tennis play in
the future will like also be trying to
get h100s to run their own fine-tuning
runs so maybe actually two questions
along those lines both to Aaron do you
know of any of the tennis players have
they used this are they looking at their
narrative that was generated so you know
has has it reached the player side I
know we're talking about consumer facing
Tech at this point but have any of the
players commented uh and the second one
is when is IBM going to bring this
technology to Esports you know so that
the data is almost already in a more
usable format there but there's can be
just as much hype and excitement and
drama in some of these more recent um
orts that are coming out and I think
there's a great opportunity to bring
this technology um to uh to electronic
gaming that's one of my favorite
pastimes so yeah Aon any comments on
those yeah so so first you know um great
great questions you know and suggestions
um I think think that do you know
players actually use some of our
information and and it's really funny
some of them do some of them don't some
of them are very superstitious right if
they were to look at one of our
predictions right then it would sort of
mentally affect the way in which they
play the game upcoming and so some
coaches do not let their players you
know look at some of our uh features and
then and then some properties were not
even allowed to you know during a game
to show you know any sort of predictions
or um sometimes
even gen content because it might
influence right play as well but on the
other side of the coin some players you
know have you know used it and and they
do look you know at the
stats um you know that that we boil down
and we we also had a project um with um
the US Open um so um you know we worked
with some ATP players and so on where um
we we would help them train you know so
they would see videos of themselves
playing we would find highlights of of
of they played so it was like a
dashboard you know with the
Developmental Center um so so so there's
that that aspect and I I I was curious
to do any of you play tennis or Sports
and and not very well and and would you
use these kind of insights
or I think I would I would try to use
especially if you know maybe trying to
help with my performance I mean I'm not
an active Sports person but I hope you
know to help me maybe improve techniques
and things like that but another thing
is there any downside to this um and I I
I worry a little bit maybe about the
bias and fairness uh with with certain
athletes these are always you know Flags
you know red flags that we could have
with the use of AI uh the AI systems can
inherit biases pres present you know in
the training data could this be lead to
unfair treatment of the athletes for
example biased scouting or algorithms
might Overlook maybe some talented in
individuals from under represented
groups so there are some you know
dangers I don't know arony this is
something that you think the current
algorithms are taken seriously or this
is still early on and we're just
evaluating the technology right now and
maybe you know start looking seriously
into these concerns privacy issue was
bias fairness yeah fantastic topic that
that could take you know hours right to
talk through but um so so yes um you
know fairness transparency and
explainability you know where in gen
might call Chain of Thought to
understand what's output from the models
uh but but a story real quick um you
know in in tennis we used to measure um
the or still do measure the excitement
of videos right so we'll look at for
example um signals like sound right
gestures score um and we quickly found
out that um somebody who's an amateur
right uh who's playing golf they might
have a really exciting shot but because
they're not very popular player there's
not a lot of people around them to make
a very loud cheer right whereas there
might be a you know top five ranked
player who makes a routine shot it's not
that exciting but has a huge cheer
because there just happens to be a lot
of people there right so so we take that
into account and we'll debias with with
postprocessors you know based on
different uh restrictive uh traits that
we have you know uh because yeah it's
it's real right and and we work to make
sure you know that we can debias these
types and and and there's many debiasing
ways and methods right and and the Gen
AI space um you know we're I think just
beginning you know um around that and um
and a question for you all is you know
um we try to balance creativity with
factualness you know with these
different generative content um you know
how how do you think the field can do a
better job right at doing that you know
uh with respect if you think there's
hallucination if you're more creative
you know whereas you need fact Checkers
you know um and and so on and so forth I
think one thing that will become if we
continue to kind of collect this data
you'll be able to ask questions about
how exciting was the current top star
when they were just starting out you'll
be able to go back in time 10 years and
look at that top star when nobody was
following him but he was still making
the great shots we probably don't have
that now because we don't have as much
historic data but um if you can you know
you'll really be able to watch entire
uh I think careers play out over time uh
at least with your ideas of um the guy
who's not popular now but made a great
shot um you'll be able to ask that same
question all right Michael Jordan His
freshman year of college didn't make the
team you know that that type of that
type of perspective but we're not going
to be able to do that with the snapshots
of data we have currently yeah and I am
hoping that some of these tools will
actually help teams kind of like see
around corners right like I think some
of the most interesting times in sports
is when someone comes up with an
entirely new strategy right that kind of
like totally changes the nature of the
game um and hopefully with data there's
like a chance to kind of identify a lot
of things that we might not otherwise uh
in in the past so I'm going to wrap up
this section I guess Aaron maybe
question just to throw it back to you is
you know it seems like you've done a
bunch of work in tennis right so like
Wimbledon US Open um and I'm kind of
curious if there's like as a as an AI
researcher is there kind of like a dream
sport you'd really want to kind of like
apply some of your techniques to or ones
that like you know haven't really been
investig at you know I ass seee one of
the reasons for tennis is like you know
it's a lot more controlled right you can
like set up a bunch of cameras there's
kind of a defined place where it happens
but I'm I'm curious from like almost a
CS Point like what's the next most
exciting you know sport to get aied and
and why yeah so so we focused a lot on
tennis golf um we you know we're doing
um some you know Fantasy Football uh
which borders on e-gaming uh we we did
do some e-gaming uh with with the
OverWatch um which was very interesting
um but I think um exploring um the
intersection of gaming uh with that of a
sport um you know because um I really
enjoy the challenge of - gaming uh
because you know um the the the physics
engine can change you know you can get
new skills and new abilities on the Fly
you can get powerups you know so it's
different um and the and your models
have to adjust very quickly um and maybe
quickly online learn a new her right
that uh you know is transported into the
game so that's that's interesting and
and in a real live aspect one of my
favorite sports to watch is basketball
you know um um I I would love to um you
know um look at that um analyze more of
the team aspects um in play um and then
also look at look at the Olympics um you
know um um I saw an article where uh I
believe it's NBC they're going to be
using generative AI to recap you know
some of the matches so I'm very curious
about uh what they're going to do and
how how that's going to um you know uh
be accepted really you know by um the uh
population but yeah so that's that's my
answer I'm sticking with
[Music]
it there's been of course a lot of hype
around the ability for generative AI to
assist with software engineering um and
a lot of excitement about the idea that
at some point AI might just do the
coder's job entirely from end to end um
co-pilot of course um one of the most
kind of successful I think products of
the Gen AI era is is a great example of
this um and there's a great paper that
came out uh just last week in the i e
transactions on software engineering
entitled no need to lift a finger
anymore question mark assessing the code
quality of code Generation by chat GPT
and basically the idea is to say okay
well we know that you know these systems
can can code
um how good are they at doing it and so
it benchmarks chbt against a number of
different coding challenges to assess
how well it is at generating code um and
I think there's kind of two interesting
findings I wanted to discuss with the
group here today um you know the first
one is basically that it turns out that
these coding platforms have you know
chat in particular has this huge
variance in its ability to do coding
tasks right so it turns out that for you
know tasks in this Benchmark labeled
hard it's only able to get it right
about 40% of the time
and then for easy tasks it's up to like
89% and I'm kind of curious as folks on
the call who all code and presumably use
stuff like you know co-pilot you know I
think there's been a narrative which is
you know okay these coding assistants
are basically just like stack exchange
Plus+ plus right they just help you
search the internet and get an answer
for easy things um but I'm kind of
curious if you all kind of buy the
skepticism of the paper right which is
to say for the really hard tasks we are
still just not seeing you know llms or
generative AI be able to kind of like
really kind of Advance the
state-ofthe-art or accelerate our
ability to solve truly hard CS problems
and um kind of curious about you know
what you all think about that if that's
just a temporary thing or if that is a
kind of ceiling um that we are all
running into and I guess Kar do you want
to kind of respond to that I know you
might have a view on this particularly
when it kind of comes to sort of like
the the coding task and it's also
relationship to Hardware yeah I really
enjoyed reading the paper I think it it
did a very nice study
uh to evaluate you know DPT for coding
challenges uh with which revealed mixed
performance like like you showed so
influenced also by this training data
cut off and the inher limitations of
existing models so for simple tasks
doing fantastic and I think it'll
continue to do fantastic for complex
things it still has limitations you know
gen AI today is still struggling with
understanding the broader context of a
project you know which leads you know to
suggestions for example that don't
really fit the overall design or
architecture so especially when you have
you know these uh design and complex
systems where we have multiple
components multiple apis that need to
interact with each other uh so it's
still because that requires reasoning
and that there are C limitations with
geni when it comes to reasoning so I
think the complexity of the context
understanding the integration challenges
you know integrating also multiple
components together and then those
interfaces how they communicate with
each other uh so that's still I think
it's a it's a bit of a challenge for for
Gen AI um so I think as we're improving
in terms of the contextual understanding
and the accuracy of the Gen AI models we
will see I think better results uh
better integration and user experience
with these coding challenges but I think
we're still there is still a lot of
research that needs to be done so um and
things like best practices in software
engineering just us clear system design
uh and prompt engineering you know can
enhance also you know these tools but I
think we're still in the early stages a
fun uh contextual story for this uh we
just hosted about 40 high school
students here at the lab uh just to kind
of show them what Industrial Research
like looks like uh and they were asking
questions with our software engineers
and they were asking questions where you
know do you use a co-pilot or do you use
generative AI in in your code and they
they kind of a little bit but to to a
person everyone down our line it was
always referencing stack Overflow and so
I think your your your comparison of
what you kind of really want is this
kind of nice integration between the
tools that were really used to like a
stack Overflow sitting inside your IDE
allowing you to code so much smoother it
was it was a great example because the
high school students weren't as familiar
with this thing called stack Overflow
and our software Engineers are saying no
exact exactly and that was this point
where there was kind of the old and new
hitting together and um those types of
resources are so incredibly useful and
it will be interesting to see if these
code generators um how much how much are
they really taking from stack Overflow
and in that paper they made this really
cool analysis they broke down the coding
questions that were before I think 2018
and the ones that were after 2018 um
don't quote me on the date but it did
very well on the old questions and very
poorly on the new questions suggesting
that the llm is not keeping up with the
most recent content kind of experience
on stack on stack Overflow so it was
really cool to see that breakdown where
they did the performance of the llms
doing great on older established
questions perhaps with answers already
sitting on stack Overflow uh and not so
well on the more recent coding
challenges um that that came up after
the training so I think we're going to
see these these kind of comparisons with
what exists on stack Overflow and what's
been incorporated into the llms but a
really cool a really cool place to see
it play out wouldn't that require for
maybe frequent retraining or re you know
readjustments of the models yeah I think
it can be a solved problem I was just
giving hats off to the researchers who
kind of understood that Nuance between
the coding ability of the models and say
wait a minute this model was trained
roughly about this time let's see if we
can ask it coding questions that didn't
uh exist at least in the common uh the
stack Overflow Universe um at that time
and then they give the performance
breakdowns but yes uh retraining and Co
and and constantly uh taking into
account new information would be a way
to try to address that yeah and I think
this is one of the really interesting
sort of challenges that it brings up
because you know like pre-training right
like updating the training data is like
actually kind of like a it's a it's a
cost intensive task right you like have
these models that don't necessarily get
pre-trained you know every single day um
and so there's this kind of weird thing
that the paper sort of suggests which is
that like if you're if you're working
with older languages right you're
actually going to be in trouble with
these models um and kind of the best way
to survive is like you'll actually see
like everybody trying to migrate to
avoid being automated to like there's
more pressure to basically adopt new
languages and then also similarly like
those new languages are simultaneously
ones that the model are like is not very
good at assisting so it kind of imagines
kind of this like bifurcated world where
there's a bunch of these older systems
that AIS can basically automate most of
the coding for and then kind of this
like Frontier of kind of code that like
essentially can't get automated away um
and uh and has really interesting
implications for like where we see I
think the impact of the technology will
will be I I think this dependence on AI
like if we have this over Reliance on AI
tools I think maybe the danger could be
uh this decline in problem solving
skills would that is that going to be a
problem because if especially the young
generations of programmers if for all
these simple tasks you know did you go
ask tgp you know write a code that does
this for me and how is that going to
impact you know because usually you
learn coding from these simple examples
and then you build on top of that to go
to more complex uh problems so I worry a
bit about you know this uh over Reliance
on AI tools for problem solving
especially as you're learning in the
early stages for programming and as you
build of course maybe that's going to
require new skills that we need to
develop as programmers or coders or
sorts of software developers is figuring
out how to use these tools more
efficiently and how to know whether
these are kind of plausible Solutions or
I need to change them and tweak them
another thing I see is what does it mean
for debugging when there are issues if I
have relied heavily on these code Pilots
to write code for me and when things are
failing will I be able to debug things
properly or should I also rely on AI to
help me debug these things so it's kind
of uh it's interesting to see how the
interplay of all these different things
will come to play and the role of humans
here and coders um I think I don't know
all the answers to this maybe if you
have some insights but there is a
downside and to this and of course lots
of adventages in terms of enhancing code
productivity uh but there are challenges
as well we have to think about yeah yeah
yeah I've seen in the field that uh many
people they'll consult different types
of code assistants you know because
there's many different models that are
specialized around different types of
task and you know this agentic you know
architecture where you have a mixture of
experts right of which you bring
together um such as many different large
language models is almost like coding by
crowd you know in an automated way
and so now it seems like uh these
developers and um scientists um and
operations experts they sort of have to
have analytical capability to discern
about what's the best technique with
these different opinions right because
you're going to have many different
opinions and coding Styles and perhaps
even languages you know being sent to
you right um and and I think one
important aspect that I always try to
follow is that that there's no free
lunch right that um there's not a
perfect algorithm suitable to solve
every single problem right it depends on
the context of the problem of which is
at hand and so with that in mind you
know I think the human really
understands the context of Their
audience what they're trying to build
where they can deploy it uh whereas
these code assistants at least today you
know you know know a limited amount of
the context and therefore it's important
to get multiple large language model
opinions as far as what they should or
shouldn't do um you know and and what
when one area that I have a lot of
interest in is this automatic
transpilation of code so say say you're
running you know um an application in
one language Let say python right um
maybe it could be trans transpiled into
rust right where maybe it would be less
memory intensive you know on the Fly um
and or you could have a human sort of
look and say yes I agree no I don't
agree you know and put sort of take that
analytical approach but um but but I
think it's all emerging right and um and
I'm really excited about you know the
future and uh what what we as IBM can
also do with instruct lab right to um
use skill building right um in a way to
help with this as as was mentioned
before but the sort of timeliness right
of data of which it can understand with
in context learning fine-tuning you know
there's many approaches and and there's
going to be many more uh in the future
yeah I think one big thing that llms
will probably play a big role in is um
you know the average state of
documentation for code is very very poor
and I feel like the one enormous use
case is even outside of coding
assistance just taking a piece of code
and make Shing it's it's like well
documented will be this like huge
Improvement in quality of life uh for
this kind of work I I love that because
documentation is always an afterthought
and it
is you never have time to do it so that
would be a huge help yeah that's right
it'll be so funny the biggest thing
won't be like automated code it will
just be like making sure that like
someone's doing a good job documenting
[Music]
everything well great well I want to
take us to our last topic of the day um
there was a another wild paper um if
you're kind of a weirdo like me uh and
just like browsing archive for fun um
this is one of the papers that kind of
popped up recently that sort of caught
my eye and you know the way it did this
is because it basically is people doing
SEO with their papers so the title is
scaling synthetic data Creation with 1
billion personas and so with a name like
that you know got to click it I got to
download it I got to read it um and some
of us need to even print it out uh like
Skyler here um and uh it's actually
pretty simple idea but I think this
particular group of experts would be
really good to kind of tackle it um you
know to just kind of give the overall
background right the idea is um you know
there's a need to generate synthetic
data often right because collecting real
data out of the real world is very
expensive and comes with all these
operational difficulties so people
always kind of trying to come up with
ways of creating data from scratch
um that they can kind of just generate
on the Fly and use it to effectively
train their models because it kind of
like relases that that bottleneck and so
these researchers out of tens and
Seattle lab said oh well maybe one fun
way of doing this is I can kind of
instantiate what they call personas
which is like a personality could be
like you know your job is as a dog
catcher or your job is as a professional
coder at IBM and their kind of
observation is well we can get these
different personas to do different tasks
to Output data for us and these tasks
will generate very different reactions
right so it turns out if you ask a you
know dog catcher to generate code for
you it will look like you know different
from the code that you have if you
prompt the model to say you're an expert
coder and what they kind of make the
argument for is that with all these
personas we have a scalable way of
generating lots of training data and
they do a couple experiments to show
that you can use the synthetic data to
train you know an llm to do math
problems effectively and um and I guess
you know maybe just to kind of kick it
off you know particularly uh csar with
you on the line you know I think the big
question here is like to nowadays like
how much is compute the bottleneck and
how much is like data the bottleneck um
because it feels like here is the world
where it says okay well if you just have
lots of compute you can generate all the
data that you need but you know a few
episodes ago we were just talking about
how difficult it is to come by compute
and so I'm really kind of interested in
your take on like what is the what is
the bottleneck right now in in the
machine learning workflow yeah that's a
very good question um of course you know
with Gen AI compute is a big ball neck
right now with all the especially the
mutm multiplications that takes a huge
amount uh of comput with the the current
accelerators and Hardware uh data you
know when it comes to data bottlenecks
right now I think in certain industries
I mean whether what is the bot neck
today it depends in certain industries
uh we don't have much data especially in
Industries like industry 4.0 where you
have uh you know uh machines and so on
and you need to understand you know
their operations sometimes there is a
lot of noisy data or you know you have
sensors you know and you probably
haven't collected the data for these
sensors for an extended period of time
to be able to have enough data to train
a good model to understand predict
anomalies or do things like that um so
in certain areas in certain industries
there is this huge lack of data uh when
it comes for example to texts we have an
abundance of text right now online
however you know that text sometimes is
not properly formatted or there's a lot
of noisy and redundancy in the text so I
see you know both of them are
bottlenecks and it depends on maybe the
industry the sector the use cases so
data could be a huge ball neck uh in
case you know you don't have enough data
or you have tons of noisy data and you
need to curate the right the right data
the right context Etc to build the model
and in that case you know the synthetic
data generation could be of huge help of
course comput is still a bottleneck you
know especially right now with the you
know shortages of acceler the hardware
shortage that we have in the
accelerators and you know this R you
know with these large models you know we
have all these comput and of course
we've talked in other episodes about you
know new approaches to the m m free uh
approaches or the in memory Computing
approaches and neomorphic and so on
trying to reduce that bottleneck on the
compute
um so I see both of them are bottlenecks
depending on the you know the context
the US the industry yeah no for sure
yeah there's a reaction I had to the
paper which was basically like well this
just makes all the existing bottlenecks
more bottl necky right it just turns out
like actually you know the great way to
get data is more compute it's like okay
well it's just like more pressure like
people want even more chips exactly it's
like chicken and egg problem here yeah
that's right I guess Skyler maybe I'll
turn to you as someone who has printed
out the paper yes uh although someone
who's printed out many papers and not
read them I don't want to imply that you
have read them but u i mean is this have
we solved the synthetic data problem
like how do you like this approach what
do you think about it I I do like the
approach I think it is it's quite
creative and they they scaled it in a
way that I probably wouldn't have gone
with um I've actually used chat GPT to
write uh bedtime stories for our kids
right there alongside them and uh case
in point here is they play Minecraft and
so they will basically say write a story
but make it about Minecraft so now you
basically have just created a Persona of
a Minecraft player who's responding to
the prompt so we've been doing that as
kind of an individual scale and this
paper has now taken it up to the billion
Persona level and they're keeping track
of all those generated stories in order
to try to get that that diversity so
very cool in that angle uh but I want to
spend a bit of time talking about that
very important word at the end there
diversity how how how confident are we
about the coverage of these 1 billion
personas and do the underlying large
language models you know really
understand being a a massai warrior
Messi is a tribe here in Kenya and so
yes you can ask the large language model
to take on that Persona um whether or
not the generated output from the
Persona of a Messi Warrior matches
reality
that that I don't know how well they
they really covered but but hats off to
the authors for having that idea of
let's take a generated text from all of
these different type of personalities
and actually put a number behind it of a
billion uh very cool um we have not
solved the synthetic data question yet
and I think the the the most obvious
question that comes up on this
is how do we know that those personas
are uh well represented in the
underlying model so yeah that's those
are some of my thoughts on that on that
piece yeah for sure and almost we're
kind of in a like an interesting and I
see Aon you about to come in um you know
like we're almost kind of interesting
place where it's like the only way we
could validate whether or not these
personas are accurate is to have real
world data of these like you know and so
there's kind of this weird chicken egg
issue which is like well I don't know
how validated they are but in order to
validate them we might very well get the
data that we we need um Aaron do you
want to jump in yeah no I you know I
just saw this this a really interesting
you know stat where you know the average
human can read about what is it a
billion words or a million words right
in a year um right and then these
algorithms can read about six orders of
magnitude more in a single month right
they're just thirsty right for this for
this data right and so that projects you
know um I mean don't don't quote me on
this but but around 2030 2032 is we're
going to run out of useful data in many
different domains and you know being
able to synthesize data is very
important uh but if you stratify the
data and in so many ways like in the
paper the danger is are you watering
down you know all the different personas
where where we could glomerate and make
them less like prune them a little bit
because they're not really that much
different right because it's almost
close to the human population on Earth
right of a billion personas right so
that that's that's one area um but then
then I do think that um what would be
interesting to hear about would be um
like the notion of the turning test 2.0
right you know um um how do we ensure
that you know these new personas you
know really do pass you know mustard
right and that that that they actually
do produce you know the transparency we
need explainability train of thought
fairness right um because we because we
are going to be splitting up data in
different ways and and there could be
side effects about that so so I was
curious what folks thought I I think the
idea of um trusted personas or pruning
personas Aon like that you mentioned is
very important can we kind of distill
all of these personas into few ones that
we can trust that give us you know the
best maybe accuracy and Pro in away the
ones because there are you know billions
here we're talking about a large number
of personas another thing how do we tie
this maybe in a real uh problem solving
word scenarios or industrial use cases
and what does that mean for example if
I'm trying maybe to build a a person you
know a foundation model for example for
uh a factory uh you know failure
diagnosis and so on does that is there
Persona there uh is there for example
can we maybe talk about different skills
for example the uh the engineer the uh
maintenance person the uh uh I don't
know the uh you know the chip designer
all of those could be personas here that
maybe we bring together so they can
bring different skills and then uh kind
of collaborate together in this llm or
Foundation model to solve a particular
problem so it's like having multiple
experts working together to solve a
problem so I think the idea here of
taking this maybe to the real world to
solve real problems could be profound
here and have lots of implications I
like the idea the scaling that scholar
mentioned uh the diversity aspect but
this needs to be validated especially
the Sol real or problems so final
thoughts I mean I mean I think it's it's
very exciting you know where where we
are and where we're going you know and
the combination of gen generative AI
techniques with classical techniques is
critical you know creating um I've seen
the term floating around but the AI
sandwich you know where you might use
neuros symbolic you know pieces around
you know these generative AI pieces um
neural networks has been around for a
long time but um but but I think um you
know you know one last thought that I
had is that I think uh you know Mother
Nature is the ultimate teacher and we
have a lot to learn
from our own brains um and I'm I'm
excited about what's next great thank
you well as always there's more to talk
about than we have time for um counter
Skyler Aaron thanks for coming back on
the show um and we'll hopefully Happ you
for a future uh future episode so thanks
for joining us if you enjoyed what you
heard uh reminder as always that you can
get us on Apple podcast Spotify and
podcast platforms everywhere uh and
thanks to you all out in radio land