Scaling Deep Learning with PyTorch DDP
Key Points
- PyTorch enables scalable deep‑learning by providing modular building blocks and utilities like Distributed Data Parallel (DDP) to train larger neural networks efficiently.
- DDP works by overlapping gradient computation with communication, synchronizing gradients bucket‑wise to keep GPU utilization near 100 % and avoid idle workers.
- While DDP can handle models that fit on a single GPU (e.g., a 7 B‑parameter LLaMA variant), it becomes ineffective for much larger models that exceed one GPU’s memory capacity.
- For training such massive models (30 B‑70 B parameters), PyTorch offers additional strategies beyond DDP to partition the model across multiple GPUs, allowing training that would otherwise be impossible on a single device.
Sections
- Scaling Deep Models with PyTorch DDP - The speaker outlines how PyTorch facilitates cost‑effective large‑scale neural network training—highlighting the memory/computation trade‑offs of bigger models and explaining the three‑step Distributed Data Parallel workflow for synchronized gradient updates.
- Scaling Large Models with FSDP - The speaker explains that Distributed Data Parallel works only for models that fit on one GPU, so Fully Sharded Data Parallel shards oversized models across multiple GPUs to keep all GPUs fully utilized.
- FSDP Overlap and Network Choices - The speaker explains how Fully Sharded Data Parallel (FSDP) leverages overlapping communication and computation to keep GPUs busy, and debunks the myth that large‑scale model training always requires InfiniBand, showing that Ethernet can also be efficient.
- Understanding PyTorch Eager Execution - The speaker explains how PyTorch's eager mode translates each Python line into independent GPU instructions, illustrating CPU/GPU queues and the resulting efficiency considerations.
- TorchDynamo: Efficient Graph Compilation - The speaker explains how scaling models leads to idle GPU time and rising costs, and introduces PyTorch 2.0’s TorchDynamo compiler, which converts eager code into a fused static graph to regain efficiency while preserving programming flexibility.
- Torch.compile Boosts Training and Inference - IBM highlights torch.compile delivering order‑of‑magnitude speedups for both training and inference, emphasizing community collaboration and open‑source large language models like Meta's Llama.
Full Transcript
# Scaling Deep Learning with PyTorch DDP **Source:** [https://www.youtube.com/watch?v=85RfazjDPwA](https://www.youtube.com/watch?v=85RfazjDPwA) **Duration:** 00:18:28 ## Summary - PyTorch enables scalable deep‑learning by providing modular building blocks and utilities like Distributed Data Parallel (DDP) to train larger neural networks efficiently. - DDP works by overlapping gradient computation with communication, synchronizing gradients bucket‑wise to keep GPU utilization near 100 % and avoid idle workers. - While DDP can handle models that fit on a single GPU (e.g., a 7 B‑parameter LLaMA variant), it becomes ineffective for much larger models that exceed one GPU’s memory capacity. - For training such massive models (30 B‑70 B parameters), PyTorch offers additional strategies beyond DDP to partition the model across multiple GPUs, allowing training that would otherwise be impossible on a single device. ## Sections - [00:00:00](https://www.youtube.com/watch?v=85RfazjDPwA&t=0s) **Scaling Deep Models with PyTorch DDP** - The speaker outlines how PyTorch facilitates cost‑effective large‑scale neural network training—highlighting the memory/computation trade‑offs of bigger models and explaining the three‑step Distributed Data Parallel workflow for synchronized gradient updates. - [00:03:03](https://www.youtube.com/watch?v=85RfazjDPwA&t=183s) **Scaling Large Models with FSDP** - The speaker explains that Distributed Data Parallel works only for models that fit on one GPU, so Fully Sharded Data Parallel shards oversized models across multiple GPUs to keep all GPUs fully utilized. - [00:06:06](https://www.youtube.com/watch?v=85RfazjDPwA&t=366s) **FSDP Overlap and Network Choices** - The speaker explains how Fully Sharded Data Parallel (FSDP) leverages overlapping communication and computation to keep GPUs busy, and debunks the myth that large‑scale model training always requires InfiniBand, showing that Ethernet can also be efficient. - [00:09:14](https://www.youtube.com/watch?v=85RfazjDPwA&t=554s) **Understanding PyTorch Eager Execution** - The speaker explains how PyTorch's eager mode translates each Python line into independent GPU instructions, illustrating CPU/GPU queues and the resulting efficiency considerations. - [00:12:20](https://www.youtube.com/watch?v=85RfazjDPwA&t=740s) **TorchDynamo: Efficient Graph Compilation** - The speaker explains how scaling models leads to idle GPU time and rising costs, and introduces PyTorch 2.0’s TorchDynamo compiler, which converts eager code into a fused static graph to regain efficiency while preserving programming flexibility. - [00:15:28](https://www.youtube.com/watch?v=85RfazjDPwA&t=928s) **Torch.compile Boosts Training and Inference** - IBM highlights torch.compile delivering order‑of‑magnitude speedups for both training and inference, emphasizing community collaboration and open‑source large language models like Meta's Llama. ## Full Transcript
We are here to talk to you about PyTorch and how it's a great tool
for you to scale your deep learning models in a cheap and a fast and efficient way.
So when we're scaling neural networks up, what that means is
we are adding more and more layers
to your neural network.
And what this does is it allows your model to capture more nuance in your data
and allows your model to perform more complex tasks.
Well, all of this doesn't come for free.
The model will start to require more memory and more compute.
Let's see the llama model, for example.
It has four variants
ranging from 7 billion parameters
to 70 billion parameters.
The smallest variant, 7 billion,
was trained using 2 trillion tokens
and it required around
roughly 200,000 GPU hours.
That's a long time.
So PyTorch, along with offering
the modular building blocks to build your neural networks,
it also offers some utilities to distribute to this training workload.
And let's take a look at one of them
called Distributed Data Parallel (DDP).
So when you're training your model
with DDP, there are three steps that need to happen.
First, the forward pass where you take the data
and you pass it through the model,
computes the loss
which is then back propagated
through the model, which then gives you the gradients.
The third step where we update the model's weights is preceded
by a synchronization, where all the computed gradients
from each of these replicas are communicated with each other.
Now, the hallmark of DDP and really all distributed training
is an overlap of
the computation and communication.
Essentially what
that means is simultaneously we are doing the back propagation
while communicating all of the calculated gradients.
This saves us time and keeps the GPU running
at a near 100% utilization.
At a very high level,
what that looks like:
You first divide the model into buckets.
Each replica calculates
the gradients of the first bucket
and while it is calculating
the gradients of the second bucket,
these first bucket gradients are synchronized.
And this is happening simultaneously
with the computation of the gradients in the second bucket.
And similarly, as each bucket is being calculated,
the preceding buckets gradients are being communicated.
This ensures that all the GPUs are running
at full utilization and you're not having any idle workers.
They're all working very hard to train your model.
And this is the
case where the model fits in one GPU.
For example, the 7 billion model can fit on
on any cloud GPUs today.
But when you start scaling that up, like, for example, 70 billion
or even the 30 billion models, it's very difficult to fit them in one GPU.
So in that paradigm, in those regimes,
the DDP model does not work.
Now I'll call upon Raghu to talk to us about what in PyTorch allows us
to train models which are larger than what your single GPU can accommodate.
All right, so you've heard
all about distributed data parallel, and how do you scale a small model
on very large number of GPUs and reduce your training times.
So now let's look at what happens
when we have a model that does not fit in a single GPU.
And that is where FSDP comes to your rescue.
FSDP stands for "Fully Sharded Data Parallel"
and what fully sharded data parallel does is it
takes the model, breaks it down into what are called
units, and all of these units are
then sharded across GPUs.
So you can take that--
Think of it as shredding the model and then each GPU owns
small portions of this model.
So that's where shards are coming in.
Now, what
happens after that is pretty much very much like what DDP is doing.
Think of instead of a model, you are thinking of it as a unit.
And during the initial unit
construction, from the shards,
you are going to do an AllGather.
And this happens in the forward pass.
So you're gathering the unit, you're computing on top of it.
And this happens across all the units.
So as soon as your unit is computed, you lose that unit's memory
and then you go to the next unit and so on.
And that's the forward pass.
Once the entire forward pass is
computed, your loss is
computed, and you go--
So this is your forward pass.
And now you're going to do an AllGather
in the backward pass.
And during the backward pass, what is happening is you are computing
again, very much like here, you are gathering the unit, you're computing
the back propagation, but just in the reverse fashion.
Once you have computed the gradients,
then those are again,
very much like what you have done for your DDP.
Those are synchronized across all the GPUs that are responsible for holding
that particular portion of the model.
So once you are synchronized, that completes your entire step
and then you continue doing all of this and then FSDP, very much like DDP,
you are going to utilize overlap in a significant way
because imagine in DDP there was only one single synchronization step, in FSDP,
I have more opportunities for doing overlap and keeping those GPUs
continuously busy while you are doing your computation.
And that is how you achieve scale of these models.
So what are the typical ways in which people train these large models?
Most of the world knows about them as HPC systems,
very large scale systems with very high speed interconnects,
state of the art GPUs and servers and training these models.
So what happens if I have "not a HPC system" that, you know, I have a good node.
So let's say that instead of this GPU, we call this a node.
Typically nodes have eight,
maybe 16 GPUs in them.
And many of these, the typical HPC system
has an HPC interconnect things like
the interconnect
where people
may have heard the term InfiniBand.
So very high speed interconnect between nodes;
and people may have heard the term Ethernet.
Ethernet is far more
common in many environments, whether it is cloud
or on-prem environments, whereas InfiniBand is less common.
So there is a common misconception that these larger models
will always need InfiniBand, even though that is true
for some of the larger models and it is faster.
But with Ethernet also, you can train these models in an efficient manner.
And that is where IBM helped, worked with the PyTorch community,
and introduced this concept of rate limiter.
What rate limiter is doing, is really balancing off the trade off
between the overlap in terms of communication and computation
by managing the memory in the GPU better.
And that is how you reduce the amount
of communication per
GPU computation time step. And you are increasing the amount
of compute while you're keeping that communication constant.
And that is how you can get away with training on Ethernet
while achieving similar kinds of benefits as InfiniBand.
So Raghu, we saw quite a few things about how PyTorch helps
in scaling up your deep learning workloads and you AI workloads.
We saw FSDP, we saw the rate limiter API.
And they do a pretty good job.
Absolutely.
And we also saw how
interconnects play a role on how do you scale up and so on, right?
But what happens inside a single node between that CPU and GPU?
Tell me a little bit more about it.
Yeah, you know,
it's an interesting question you raise because there is
some more efficiency gains that we can we can opt in.
And let's take a look at, you know, what's what's happening on the board.
So the reason-- one of the reasons --why PyTorch is
so popular is because of its programing paradigm
called "eager mode".
And eager mode allows developers
to have a very Pythonic approach to their programs.
It allows dynamism, like, you know, you can do your if-else blocks.
You're going to have for-loops within that.
It's very flexible.
Yeah.
And we already spoke about CPU and GPU.
Yeah.
So I'm going to draw a small little schematic over here,
which sort of illustrates what's happening in
eager mode.
Let's say this is your CPU queue
and this is your GPU
queue. And queue is essentially
the order of instructions that each chip is launching.
Mm hmm.
Now, taking a step back, your AI programs,
your AI models are essentially a sequence of operations.
Small AI models, small deep networks have lesser instructions comparatively.
And larger models have many more.
Mm hmm. Mm hmm.
So every line of code that I write in
in my Python program basically translates to going to the GPU.
Is that essentially what's happening? Yes.
Almost every line is an independent instruction
that your CPU is deriving from the program.
And so we can probably think of your model,
the execution of your model, as a sequence of instructions over here.
Mm hmm.
And because you're using a hardware accelerator like a GPU,
you're going to queue up those instructions
onto the relevant operations specific to the back end that are using.
Okay.
So let's say you have CUDA operations
on an Nvidia GPU.
So this is kind of how eager mode works.
This is the appropriate paradigm that allows you to iterate
or interact with your program in a very, you know, 1-to-1 way.
Okay.
So what is happening between,
you know, these times when-- what happens there?
And that's, you know, if you've hit the
nail on the head,
these empty spaces over here
are actually times when the GPU is not working.
These are
idle.
And, you know, that's what we do
not like about our GPU is because they're a very expensive resource.
We don't like them to be idle.
We want them to be working all through,
like you want to maximize the utilization over there.
And these get more and more as you scale.
These are the model size, right? Exactly.
Like as you as your models get larger, the skew starts getting much longer.
And as the queue starts getting longer, the number of these idle spaces
start increasing
quite a lot.
Yeah.
And these essentially translate to a lot of costs, unnecessary costs.
You're burning GPU hours without actually getting any bang for your buck.
Okay, so how do we address this?
So here we have an interesting tradeoff.
It poses a pretty interesting engineering challenge.
We want that you got more, like, flexibility.
It makes programing fun. Mm hmm.
But we also want that efficiency. Mm hmm.
So earlier last last year
at the PyTorch conference, we announced 2.0.
So PyTorch 2.0 packs in a very interesting new paradigm.
So it's essentially a new compiler
called TorchDynamo.
And this still allows you
to program your PyTorch code, as you always have.
It's still got that flexibility, that interactivity.
But what it does differently is instead of having
separate instructions
queued up like in eager mode,
your program is essentially converted to a graph of operations.
It's almost a static graph.
And so all of these instructions are sort of merged together.
Mm hmm.
And they form like it's a 1-to-m.
Mm hmm.
And sometimes you might have graph breaks, You know,
you might want to come back to eager mode paradigm.
So your entire program has been translated
to two instructions over here instead of many different instructions.
And on the GPU,
again, you're queuing up only two instructions
instead of all of n instructions over here.
Yeah.
And quite often you won't even have this break.
So your entire program, your entire model, is this one block of instructions in it
which executes seamlessly on the GPU.
That is pretty cool.
And how do you actually achieve this from a programing standpoint?
Like what does, say me, as a developer,
what do I need to do to get this?
One line of code.
That's wonderful.
All of this
goodness in one API called torch.compile.
So that's like super easy for me as a developer to get up to.
What kind of speed ups have you seen with torch.compile?
Let's say just about training it.
Oh, torch.compile
primarily targets training workloads
and depending on the architecture and the model, it varies.
But we have seen speed ups of at least 40 times. Wow.
So that is an order of magnitude.
Many orders of magnitude of on display and eager mode.
Okay, wonderful.
And at IBM, what we have been doing is to really work with the community
and to look at torch.compile from an inference lens.
So what are the beauty of compile?
I feel, is that it not only works for training and giving
that kind of speed up, but it actually works for inferencing as well.
The same kind of gaps that you're seeing, we see that being provided to.
Are you saying the same fundamental principle applies?
I understand, but it's not exclusive to this training.
It's not exclusive to training.
So I think torch.compile is the new paradigm that is going to
change efficiency, completely across both training and inferencing.
So fantastic times for PyTorch.
And for the users. Yes, I know.
And what else are we doing to get PyTorch to be more
community
oriented, to grow that option of larger models?
Yeah, I mean, it's no secret like we're living right now
in the age of language models, large language models which are
just awe inspiring and what simple algorithms are able to do.
Mm hmm.
Well, not so simple, but you're probably familiar with Llama.
It's a language model from Meta. Yeah. Yeah.
And a lot of that code is open sourced.
You know, the model itself, the way it's there, available to the community
in a very permissive license.
Moreover,
we also have a lot of scripts available on how you can be using these downstream.
Okay.
There is one GitHub repository that you should check out.
Okay. It's called llama recipes.
So it's under the facebookresearch/llama-recipes on GitHub.
Oh, we'll add a link to to that in the video.A
And you'll notice
quite interestingly that they also use FSDP there too.
Okay, especially for fine tuning llama you know.
Yeah.
You want to adapt all of that knowledge and that model
to a particular domain or to a particular topic of interest.
Yeah. So you want to fine tune it.
Mm hmm. And of course, like we already saw how
[crosstalk] really works and everything.
Yeah. Yeah.
And how FSDP helps resolve a lot of that with an easy to use API.
And you can see that in practice
at the llama recipes repository.
Wonderful! So I think everything is coming together
with Python in the models, the way we train them, tune them,
and I'm sure you know, very soon it'll be around inferencing tools.
So this is fantastic news for the PyTorch community.
Yeah, I'm really looking forward to what the community comes up with
and what they're, how they start using this.
Absolutely!
Probably had a like a, probably a new step in this age.
Absolutely.
Looking forward to that. Thank you, Suraj.
Thanks Raghu.
Thanks for watching.
And if you like the video, please click on like and subscribe to our channel.
And if you want to learn more about PyTorch, about the concepts
we spoke about in this video, and even Llama,
do check out the links below in the video description.