Learning Library

← Back to Library

Scaling Deep Learning with PyTorch DDP

Key Points

  • PyTorch enables scalable deep‑learning by providing modular building blocks and utilities like Distributed Data Parallel (DDP) to train larger neural networks efficiently.
  • DDP works by overlapping gradient computation with communication, synchronizing gradients bucket‑wise to keep GPU utilization near 100 % and avoid idle workers.
  • While DDP can handle models that fit on a single GPU (e.g., a 7 B‑parameter LLaMA variant), it becomes ineffective for much larger models that exceed one GPU’s memory capacity.
  • For training such massive models (30 B‑70 B parameters), PyTorch offers additional strategies beyond DDP to partition the model across multiple GPUs, allowing training that would otherwise be impossible on a single device.

Sections

Full Transcript

# Scaling Deep Learning with PyTorch DDP **Source:** [https://www.youtube.com/watch?v=85RfazjDPwA](https://www.youtube.com/watch?v=85RfazjDPwA) **Duration:** 00:18:28 ## Summary - PyTorch enables scalable deep‑learning by providing modular building blocks and utilities like Distributed Data Parallel (DDP) to train larger neural networks efficiently. - DDP works by overlapping gradient computation with communication, synchronizing gradients bucket‑wise to keep GPU utilization near 100 % and avoid idle workers. - While DDP can handle models that fit on a single GPU (e.g., a 7 B‑parameter LLaMA variant), it becomes ineffective for much larger models that exceed one GPU’s memory capacity. - For training such massive models (30 B‑70 B parameters), PyTorch offers additional strategies beyond DDP to partition the model across multiple GPUs, allowing training that would otherwise be impossible on a single device. ## Sections - [00:00:00](https://www.youtube.com/watch?v=85RfazjDPwA&t=0s) **Scaling Deep Models with PyTorch DDP** - The speaker outlines how PyTorch facilitates cost‑effective large‑scale neural network training—highlighting the memory/computation trade‑offs of bigger models and explaining the three‑step Distributed Data Parallel workflow for synchronized gradient updates. - [00:03:03](https://www.youtube.com/watch?v=85RfazjDPwA&t=183s) **Scaling Large Models with FSDP** - The speaker explains that Distributed Data Parallel works only for models that fit on one GPU, so Fully Sharded Data Parallel shards oversized models across multiple GPUs to keep all GPUs fully utilized. - [00:06:06](https://www.youtube.com/watch?v=85RfazjDPwA&t=366s) **FSDP Overlap and Network Choices** - The speaker explains how Fully Sharded Data Parallel (FSDP) leverages overlapping communication and computation to keep GPUs busy, and debunks the myth that large‑scale model training always requires InfiniBand, showing that Ethernet can also be efficient. - [00:09:14](https://www.youtube.com/watch?v=85RfazjDPwA&t=554s) **Understanding PyTorch Eager Execution** - The speaker explains how PyTorch's eager mode translates each Python line into independent GPU instructions, illustrating CPU/GPU queues and the resulting efficiency considerations. - [00:12:20](https://www.youtube.com/watch?v=85RfazjDPwA&t=740s) **TorchDynamo: Efficient Graph Compilation** - The speaker explains how scaling models leads to idle GPU time and rising costs, and introduces PyTorch 2.0’s TorchDynamo compiler, which converts eager code into a fused static graph to regain efficiency while preserving programming flexibility. - [00:15:28](https://www.youtube.com/watch?v=85RfazjDPwA&t=928s) **Torch.compile Boosts Training and Inference** - IBM highlights torch.compile delivering order‑of‑magnitude speedups for both training and inference, emphasizing community collaboration and open‑source large language models like Meta's Llama. ## Full Transcript
0:00We are here to talk to you about PyTorch and how it's a great tool 0:03for you to scale your deep learning models in a cheap and a fast and efficient way. 0:09So when we're scaling neural networks up, what that means is 0:12we are adding more and more layers 0:18to your neural network. 0:20And what this does is it allows your model to capture more nuance in your data 0:26and allows your model to perform more complex tasks. 0:29Well, all of this doesn't come for free. 0:31The model will start to require more memory and more compute. 0:36Let's see the llama model, for example. 0:40It has four variants 0:41ranging from 7 billion parameters 0:44to 70 billion parameters. 0:47The smallest variant, 7 billion, 0:50was trained using 2 trillion tokens 0:55and it required around 0:56roughly 200,000 GPU hours. 1:02That's a long time. 1:03So PyTorch, along with offering 1:05the modular building blocks to build your neural networks, 1:09it also offers some utilities to distribute to this training workload. 1:13And let's take a look at one of them 1:16called Distributed Data Parallel (DDP). 1:18So when you're training your model 1:19with DDP, there are three steps that need to happen. 1:23First, the forward pass where you take the data 1:27and you pass it through the model, 1:30computes the loss 1:32which is then back propagated 1:36through the model, which then gives you the gradients. 1:39The third step where we update the model's weights is preceded 1:44by a synchronization, where all the computed gradients 1:52from each of these replicas are communicated with each other. 1:57Now, the hallmark of DDP and really all distributed training 2:03is an overlap of 2:04the computation and communication. 2:08Essentially what 2:09that means is simultaneously we are doing the back propagation 2:14while communicating all of the calculated gradients. 2:17This saves us time and keeps the GPU running 2:20at a near 100% utilization. 2:24At a very high level, 2:25what that looks like: 2:27You first divide the model into buckets. 2:33Each replica calculates 2:35the gradients of the first bucket 2:38and while it is calculating 2:40the gradients of the second bucket, 2:44these first bucket gradients are synchronized. 2:50And this is happening simultaneously 2:52with the computation of the gradients in the second bucket. 2:56And similarly, as each bucket is being calculated, 3:00the preceding buckets gradients are being communicated. 3:04This ensures that all the GPUs are running 3:07at full utilization and you're not having any idle workers. 3:11They're all working very hard to train your model. 3:14And this is the 3:15case where the model fits in one GPU. 3:18For example, the 7 billion model can fit on 3:23on any cloud GPUs today. 3:26But when you start scaling that up, like, for example, 70 billion 3:29or even the 30 billion models, it's very difficult to fit them in one GPU. 3:34So in that paradigm, in those regimes, 3:38the DDP model does not work. 3:41Now I'll call upon Raghu to talk to us about what in PyTorch allows us 3:46to train models which are larger than what your single GPU can accommodate. 3:51All right, so you've heard 3:53all about distributed data parallel, and how do you scale a small model 3:58on very large number of GPUs and reduce your training times. 4:02So now let's look at what happens 4:04when we have a model that does not fit in a single GPU. 4:08And that is where FSDP comes to your rescue. 4:12FSDP stands for "Fully Sharded Data Parallel" 4:16and what fully sharded data parallel does is it 4:18takes the model, breaks it down into what are called 4:21units, and all of these units are 4:27then sharded across GPUs. 4:31So you can take that-- 4:32Think of it as shredding the model and then each GPU owns 4:37small portions of this model. 4:41So that's where shards are coming in. 4:44Now, what 4:44happens after that is pretty much very much like what DDP is doing. 4:48Think of instead of a model, you are thinking of it as a unit. 4:53And during the initial unit 4:56construction, from the shards, 5:00you are going to do an AllGather. 5:04And this happens in the forward pass. 5:06So you're gathering the unit, you're computing on top of it. 5:11And this happens across all the units. 5:13So as soon as your unit is computed, you lose that unit's memory 5:18and then you go to the next unit and so on. 5:21And that's the forward pass. 5:22Once the entire forward pass is 5:23computed, your loss is 5:25computed, and you go-- 5:27So this is your forward pass. 5:29And now you're going to do an AllGather 5:33in the backward pass. 5:36And during the backward pass, what is happening is you are computing 5:40again, very much like here, you are gathering the unit, you're computing 5:44the back propagation, but just in the reverse fashion. 5:48Once you have computed the gradients, 5:51then those are again, 5:54very much like what you have done for your DDP. 5:58Those are synchronized across all the GPUs that are responsible for holding 6:04that particular portion of the model. 6:07So once you are synchronized, that completes your entire step 6:10and then you continue doing all of this and then FSDP, very much like DDP, 6:16you are going to utilize overlap in a significant way 6:19because imagine in DDP there was only one single synchronization step, in FSDP, 6:25I have more opportunities for doing overlap and keeping those GPUs 6:30continuously busy while you are doing your computation. 6:35And that is how you achieve scale of these models. 6:39So what are the typical ways in which people train these large models? 6:43Most of the world knows about them as HPC systems, 6:47very large scale systems with very high speed interconnects, 6:51state of the art GPUs and servers and training these models. 6:55So what happens if I have "not a HPC system" that, you know, I have a good node. 7:01So let's say that instead of this GPU, we call this a node. 7:05Typically nodes have eight, 7:08maybe 16 GPUs in them. 7:11And many of these, the typical HPC system 7:15has an HPC interconnect things like 7:20the interconnect 7:24where people 7:25may have heard the term InfiniBand. 7:28So very high speed interconnect between nodes; 7:32and people may have heard the term Ethernet. 7:36Ethernet is far more 7:37common in many environments, whether it is cloud 7:41or on-prem environments, whereas InfiniBand is less common. 7:45So there is a common misconception that these larger models 7:50will always need InfiniBand, even though that is true 7:53for some of the larger models and it is faster. 7:56But with Ethernet also, you can train these models in an efficient manner. 8:01And that is where IBM helped, worked with the PyTorch community, 8:06and introduced this concept of rate limiter. 8:11What rate limiter is doing, is really balancing off the trade off 8:16between the overlap in terms of communication and computation 8:21by managing the memory in the GPU better. 8:26And that is how you reduce the amount 8:29of communication per 8:31GPU computation time step. And you are increasing the amount 8:35of compute while you're keeping that communication constant. 8:39And that is how you can get away with training on Ethernet 8:43while achieving similar kinds of benefits as InfiniBand. 8:48So Raghu, we saw quite a few things about how PyTorch helps 8:52in scaling up your deep learning workloads and you AI workloads. 8:55We saw FSDP, we saw the rate limiter API. 8:59And they do a pretty good job. 9:02Absolutely. 9:03And we also saw how 9:05interconnects play a role on how do you scale up and so on, right? 9:09But what happens inside a single node between that CPU and GPU? 9:14Tell me a little bit more about it. 9:16Yeah, you know, 9:18it's an interesting question you raise because there is 9:21some more efficiency gains that we can we can opt in. 9:26And let's take a look at, you know, what's what's happening on the board. 9:30So the reason-- one of the reasons --why PyTorch is 9:32so popular is because of its programing paradigm 9:35called "eager mode". 9:37And eager mode allows developers 9:41to have a very Pythonic approach to their programs. 9:45It allows dynamism, like, you know, you can do your if-else blocks. 9:48You're going to have for-loops within that. 9:51It's very flexible. 9:52Yeah. 9:52And we already spoke about CPU and GPU. 9:56Yeah. 9:57So I'm going to draw a small little schematic over here, 10:00which sort of illustrates what's happening in 10:03eager mode. 10:08Let's say this is your CPU queue 10:13and this is your GPU 10:15queue. And queue is essentially 10:19the order of instructions that each chip is launching. 10:23Mm hmm. 10:24Now, taking a step back, your AI programs, 10:27your AI models are essentially a sequence of operations. 10:31Small AI models, small deep networks have lesser instructions comparatively. 10:37And larger models have many more. 10:40Mm hmm. Mm hmm. 10:41So every line of code that I write in 10:44in my Python program basically translates to going to the GPU. 10:48Is that essentially what's happening? Yes. 10:51Almost every line is an independent instruction 10:54that your CPU is deriving from the program. 10:59And so we can probably think of your model, 11:03the execution of your model, as a sequence of instructions over here. 11:07Mm hmm. 11:10And because you're using a hardware accelerator like a GPU, 11:13you're going to queue up those instructions 11:16onto the relevant operations specific to the back end that are using. 11:20Okay. 11:20So let's say you have CUDA operations 11:23on an Nvidia GPU. 11:35So this is kind of how eager mode works. 11:37This is the appropriate paradigm that allows you to iterate 11:40or interact with your program in a very, you know, 1-to-1 way. 11:44Okay. 11:45So what is happening between, 11:48you know, these times when-- what happens there? 11:52And that's, you know, if you've hit the 11:55nail on the head, 11:58these empty spaces over here 12:01are actually times when the GPU is not working. 12:04These are 12:06idle. 12:08And, you know, that's what we do 12:10not like about our GPU is because they're a very expensive resource. 12:14We don't like them to be idle. 12:15We want them to be working all through, 12:18like you want to maximize the utilization over there. 12:21And these get more and more as you scale. 12:24These are the model size, right? Exactly. 12:26Like as you as your models get larger, the skew starts getting much longer. 12:31And as the queue starts getting longer, the number of these idle spaces 12:35start increasing 12:37quite a lot. 12:38Yeah. 12:38And these essentially translate to a lot of costs, unnecessary costs. 12:43You're burning GPU hours without actually getting any bang for your buck. 12:46Okay, so how do we address this? 12:49So here we have an interesting tradeoff. 12:52It poses a pretty interesting engineering challenge. 12:55We want that you got more, like, flexibility. 12:58It makes programing fun. Mm hmm. 13:00But we also want that efficiency. Mm hmm. 13:02So earlier last last year 13:05at the PyTorch conference, we announced 2.0. 13:10So PyTorch 2.0 packs in a very interesting new paradigm. 13:15So it's essentially a new compiler 13:18called TorchDynamo. 13:21And this still allows you 13:22to program your PyTorch code, as you always have. 13:26It's still got that flexibility, that interactivity. 13:30But what it does differently is instead of having 13:38separate instructions 13:39queued up like in eager mode, 13:42your program is essentially converted to a graph of operations. 13:47It's almost a static graph. 13:48And so all of these instructions are sort of merged together. 13:51Mm hmm. 13:52And they form like it's a 1-to-m. 13:57Mm hmm. 13:57And sometimes you might have graph breaks, You know, 13:59you might want to come back to eager mode paradigm. 14:03So your entire program has been translated 14:14to two instructions over here instead of many different instructions. 14:22And on the GPU, 14:25again, you're queuing up only two instructions 14:28instead of all of n instructions over here. 14:31Yeah. 14:35And quite often you won't even have this break. 14:38So your entire program, your entire model, is this one block of instructions in it 14:43which executes seamlessly on the GPU. 14:46That is pretty cool. 14:47And how do you actually achieve this from a programing standpoint? 14:51Like what does, say me, as a developer, 14:54what do I need to do to get this? 14:58One line of code. 15:00That's wonderful. 15:06All of this 15:06goodness in one API called torch.compile. 15:10So that's like super easy for me as a developer to get up to. 15:13What kind of speed ups have you seen with torch.compile? 15:16Let's say just about training it. 15:20Oh, torch.compile 15:21primarily targets training workloads 15:24and depending on the architecture and the model, it varies. 15:28But we have seen speed ups of at least 40 times. Wow. 15:32So that is an order of magnitude. 15:34Many orders of magnitude of on display and eager mode. 15:39Okay, wonderful. 15:40And at IBM, what we have been doing is to really work with the community 15:45and to look at torch.compile from an inference lens. 15:49So what are the beauty of compile? 15:50I feel, is that it not only works for training and giving 15:54that kind of speed up, but it actually works for inferencing as well. 15:58The same kind of gaps that you're seeing, we see that being provided to. 16:01Are you saying the same fundamental principle applies? 16:04I understand, but it's not exclusive to this training. 16:07It's not exclusive to training. 16:08So I think torch.compile is the new paradigm that is going to 16:13change efficiency, completely across both training and inferencing. 16:18So fantastic times for PyTorch. 16:21And for the users. Yes, I know. 16:23And what else are we doing to get PyTorch to be more 16:27community 16:27oriented, to grow that option of larger models? 16:31Yeah, I mean, it's no secret like we're living right now 16:35in the age of language models, large language models which are 16:39just awe inspiring and what simple algorithms are able to do. 16:44Mm hmm. 16:44Well, not so simple, but you're probably familiar with Llama. 16:48It's a language model from Meta. Yeah. Yeah. 16:51And a lot of that code is open sourced. 16:54You know, the model itself, the way it's there, available to the community 16:58in a very permissive license. 17:00Moreover, 17:00we also have a lot of scripts available on how you can be using these downstream. 17:04Okay. 17:05There is one GitHub repository that you should check out. 17:08Okay. It's called llama recipes. 17:11So it's under the facebookresearch/llama-recipes on GitHub. 17:15Oh, we'll add a link to to that in the video.A 17:18And you'll notice 17:21quite interestingly that they also use FSDP there too. 17:23Okay, especially for fine tuning llama you know. 17:26Yeah. 17:27You want to adapt all of that knowledge and that model 17:29to a particular domain or to a particular topic of interest. 17:33Yeah. So you want to fine tune it. 17:35Mm hmm. And of course, like we already saw how 17:38[crosstalk] really works and everything. 17:40Yeah. Yeah. 17:40And how FSDP helps resolve a lot of that with an easy to use API. 17:44And you can see that in practice 17:46at the llama recipes repository. 17:48Wonderful! So I think everything is coming together 17:50with Python in the models, the way we train them, tune them, 17:54and I'm sure you know, very soon it'll be around inferencing tools. 17:57So this is fantastic news for the PyTorch community. 18:00Yeah, I'm really looking forward to what the community comes up with 18:03and what they're, how they start using this. 18:05Absolutely! 18:06Probably had a like a, probably a new step in this age. 18:10Absolutely. 18:11Looking forward to that. Thank you, Suraj. 18:13Thanks Raghu. 18:14Thanks for watching. 18:15And if you like the video, please click on like and subscribe to our channel. 18:20And if you want to learn more about PyTorch, about the concepts 18:23we spoke about in this video, and even Llama, 18:26do check out the links below in the video description.