Scaling Generative AI: Challenges and Solutions
Key Points
- Model sizes have exploded from thousands to billions‑and‑trillions of parameters, demanding ever‑more powerful hardware just to train and run them.
- The amount of data consumed by these models is growing orders of magnitude faster than human reading capacity, with synthetic data projected to exceed real‑world data by around 2030.
- User demand is soaring—ChatGPT jumped from 1 million users in five days to 100 million a year later—creating an “unfathomable” compute load when model size, data, and requests are multiplied together.
- To handle this scale, an agentic architecture that orchestrates specialized models is required, focusing on efficient inference and resource management.
- Practical scaling tactics include batch‑based generation paired with CDN caching and edge‑side personalization, as well as cache‑centric approaches to serve billions of requests per second without overloading GPU hardware.
Full Transcript
# Scaling Generative AI: Challenges and Solutions **Source:** [https://www.youtube.com/watch?v=RLdD831I8hk](https://www.youtube.com/watch?v=RLdD831I8hk) **Duration:** 00:07:28 ## Summary - Model sizes have exploded from thousands to billions‑and‑trillions of parameters, demanding ever‑more powerful hardware just to train and run them. - The amount of data consumed by these models is growing orders of magnitude faster than human reading capacity, with synthetic data projected to exceed real‑world data by around 2030. - User demand is soaring—ChatGPT jumped from 1 million users in five days to 100 million a year later—creating an “unfathomable” compute load when model size, data, and requests are multiplied together. - To handle this scale, an agentic architecture that orchestrates specialized models is required, focusing on efficient inference and resource management. - Practical scaling tactics include batch‑based generation paired with CDN caching and edge‑side personalization, as well as cache‑centric approaches to serve billions of requests per second without overloading GPU hardware. ## Sections - [00:00:00](https://www.youtube.com/watch?v=RLdD831I8hk&t=0s) **Exponential Growth Challenges in Generative AI** - The speaker explains how model size, data volume, and user demand are scaling exponentially, creating hardware, cost, and data‑availability hurdles—and predicts synthetic data will surpass real data by 2030. ## Full Transcript
running these generative AI algorithms
at scale can be very challenging
overwhelming and costly in fact there's
three areas that I want to highlight
where there's exponential growth
occurring over time and if I were to log
scale this here then it would look much
like this here but the first one would
be the model
size so at the beginning these models
they were thousands of parameters then
they moved to millions and now billions
and even trillions but this requires
Hardware in order to even run and train
these very large algorithms now the
second one would be data size now the
data size is growing um you know if you
think about Granite um and you think
about llama but a human in turn can read
about a million different words every
single year now a system such as an
algorithm like this can read about a
billion time 10 6 or six orders of
magnitude more in just a single month
we're actually beginning to run out of
data by 2030 I think that we're going to
see synthetic data actually overtake
real world data now the third one is
demand so over time um you you have seen
that these models have become integral
to our daily lives right so when you
look at chat GPT in just 5 days after it
was released it had 1 million users and
if you look about a year later there
were about 100 million users so every
time we think about having to write you
know a piece but we can solve this
what's called a wh space B problem by
using these models to help prompt us or
to tell us what we should think about
and how we should write but now if I
take these and I multiply them together
well this gives us this unfathomable
compute scale that we need in order to
run them right and again this is a log
scale here now what this means is that
we need this agentic architecture to run
these specialized models and to help
solve the problem of trying to make
these more usable so what can we really
do to help right to make this more
manageable and usable systems for
inference and so on well let's go ahead
and find out
next generative AI algorithms could be
scaled across hundreds of gpus in fact
you can put them on v100s A1 100s or
even look at the different H series that
Nvidia uh provides or even other vendors
but even so with hundreds of thousands
of different types of requests per
second this can strain the system but it
could also strain the underlying
Hardware so to help out let's look at a
couple of strategies that we can take
now the first one is called a batch
based generative AI system now here what
happens is we want to create um these
very Dynamic fill-in-the blank sentences
that come from the output of these large
language models we then store them on a
Content delivery Network and this is
really cached all around the world and
then on the edge we will then pull in
all of those fill in the blank sentences
and we'll insert the personalized
information and then serve it over to
the user so this becomes a very
personalized um experience now the
second one is Cash
based generative AI now here the whole
strategy is that we want to cach as much
content as we possibly can on servers
that are around the world on a CDN and
what we do is we want to find the most
common cases that we can generate
content for and then push that up and
that means that the least type of
content that we generate we want to do
that on demand and then in turn serve it
but this gives us kind of The Best of
Both Worlds where we can cut in half
maybe 90% of the request per second and
then the other 10% is created on demand
now the other approach would be what's
called an agentic
architecture and this type of
architecture it's emerging um but it's
where you take these very large complex
models and you break them down into
smaller models so that they're
specialized so it's almost like a
mixture of experts and these agents in
turn can communicate to each other
um one example would be having a large
language model judge the output of
another large language model um you
could Al even have another large
language model be self- introspective
and then pass the output off onto
another specialized model that could
transform that information that's then
in turn served up but these smaller
models then require smaller footprints
that then can run um and be scaled
across these hundreds of different types
of gpus now these types of models they
may not run on these commodity machines
or any of the available gpus that you
might have for your team such as a 32 GB
GPU chip now some of them can there are
some Granite models that are smaller
even some llama models that could fit
and other vendors also have uh pieces
but the vast majority of some of the
most powerful models you need to get to
run and other types of Hardware um now
to combat some of this one of the
options is a technique that's called
Model
distillation um but the whole idea
around this is that we want to be able
to extract the information that really
matters to the domain of which we're
working on we can take that information
and we could do in context learning uh
but traditionally you would want to
teach another smaller model through
gradient update so it becomes much more
powerful and fine-tuned um and can be
applied to your problem in a very
accurate way now the second method that
we could do is called a student teacher
approach
and here instead of looking at the data
per se we want to create what's called
um a new behavior and we can create a
new skill or a composite skill based on
the task that we want to have it could
be text extraction it could be
summarization um it could be just fluent
writing uh but having a model that's the
teacher that might would know some of
the task um and you could even have a
bank of models where you where as where
your teacher models would then be asked
questioned by your student models so
that so the data would would flow this
way and then the output from your
teacher models will go back to the
student so that it could learn over time
and develop those types of skills um
that it would need um and now if I want
to look at another approach I want to
shrink a model uh
quantization is a nice approach so here
is where I want to compress a model into
a much smaller footprint so I might
would take these 32-bit um floating
Point numbers and I want to make them
much smaller so I might want to make it
into an 8bit representation of that
floating Point number or I could do a
4bit representation now there are
different pros and cons of the order
which you do it so one of them could be
you shrink it uh before training now
this requires more compute resources
when you do train uh but it creates a
smaller model um and it still can
maintain a lot of the accuracy levels
whichever way you measure accuracy um
whenever you apply that at inference
time now you could also do this uh post
training now this is you know um lesser
requirements on compute when you train
but when you do and apply this at in
inference your accuracy might go down
right so those trade-offs right is
something that you do need to keep in
mind as you apply this compression
technique if you like this video and
want to see more like it please like And
subscribe if you have any questions or
want to share your thoughts about this
topic please leave a comment below