Learning Library

← Back to Library

PyTorch Basics: Data Prep and Modeling

Key Points

  • PyTorch is an open‑source machine‑learning and deep‑learning framework hosted by the PyTorch Foundation (part of the Linux Foundation) that offers a community‑driven, openly governed ecosystem.
  • It streamlines the typical training workflow—data preparation, model building, training, and testing—by providing built‑in utilities for each stage.
  • For data handling, PyTorch supplies Dataset and DataLoader classes that efficiently download, batch, shuffle, and iterate over potentially massive datasets (from gigabytes to petabytes).
  • Model construction is simplified with a rich library of pre‑defined layers (e.g., linear, convolutional) and activation functions, allowing users to define complex deep‑learning architectures with minimal friction.

Full Transcript

# PyTorch Basics: Data Prep and Modeling **Source:** [https://www.youtube.com/watch?v=fJ40w_2h8kk](https://www.youtube.com/watch?v=fJ40w_2h8kk) **Duration:** 00:11:56 ## Summary - PyTorch is an open‑source machine‑learning and deep‑learning framework hosted by the PyTorch Foundation (part of the Linux Foundation) that offers a community‑driven, openly governed ecosystem. - It streamlines the typical training workflow—data preparation, model building, training, and testing—by providing built‑in utilities for each stage. - For data handling, PyTorch supplies Dataset and DataLoader classes that efficiently download, batch, shuffle, and iterate over potentially massive datasets (from gigabytes to petabytes). - Model construction is simplified with a rich library of pre‑defined layers (e.g., linear, convolutional) and activation functions, allowing users to define complex deep‑learning architectures with minimal friction. ## Sections - [00:00:00](https://www.youtube.com/watch?v=fJ40w_2h8kk&t=0s) **Introducing PyTorch and Its Core Workflow** - In this segment, an expert outlines PyTorch as an open‑source machine‑learning framework backed by the Linux Foundation and walks through its primary features, including data preparation, model building, training, and testing. - [00:03:38](https://www.youtube.com/watch?v=fJ40w_2h8kk&t=218s) **Nonlinearity and Loss Functions in Training** - The speaker explains that adding nonlinearity prevents a model from reducing to a straight line, then outlines the training loop—randomly initializing parameters, performing a forward pass, computing loss against the target, and using PyTorch’s loss functions to guide the model toward the desired output. - [00:07:10](https://www.youtube.com/watch?v=fJ40w_2h8kk&t=430s) **PyTorch: Easy, Flexible, Multi‑Platform** - The speaker highlights PyTorch’s beginner‑friendly, Pythonic design, comprehensive documentation, and its flexibility to run on CPUs, GPUs, distributed clusters, and mobile devices, while also addressing how users can become contributors. - [00:10:46](https://www.youtube.com/watch?v=fJ40w_2h8kk&t=646s) **PyTorch Community Highlights & Benefits** - The speaker promotes PyTorch’s latest improvements—storage, compiler optimizations, benchmarking, documentation, and multi‑GPU distributed support—while encouraging viewers to join the active community. ## Full Transcript
0:00PyTorch has emerged as the de facto standard for  machine learning and deep learning. 0:06And I know a little bit about PyTorch, but I've brought in an  expert, Sahdev Zala, 0:11to teach us all more about PyTorch. 0:14So, Sahdev, what is PyTorch? 0:16Hi Brad! So it's a framework for machine learning and deep learning. 0:21And what I mean by that is 0:26you can use PyTorch to build your models 0:30because it provides you all of the building blocks. 0:33It provides you all the functionalities to run faster training on that model. 0:38And it's an open source project under PyTorch Foundation, which is part of the Linux Foundation. 0:43So there is a dynamic community behind the project. 0:46Oh, great! So it's got an ecosystem and it's in the Foundation. 0:50So that means you're going to have open governance and a level playing field. That's wonderful. 0:54Well, Sahdev, can you tell  me about the key features of PyTorch? 0:58Yeah, sure. That's a great question. 1:01So let me just mention  the common steps of model training. 1:05So first, you need to prep your data, 1:08your data set for training. 1:13And, ideally you also want to do it for testing. 1:17And then, the other steps is you're  going to build your model. 1:25And you're going to train it. 1:29And as I mentioned, you're going to test. 1:34Okay. So those look like some pretty straightforward features. 1:37Why don't you tell me about the first one? What do you mean by prepping the data? 1:41Right. So, the data says are you going to use  for your model, maybe small as you're learning it, 1:46but for larger models, these data sets can  be huge--10 terabytes, petabytes wide. 1:52So how do you use this data to train your model and then test it? 1:59So PyTorch provides you two things here. 2:02Data sets and data loader classes that help  you to easily feed this data for your training and testing. 2:14Okay. How does this help me? Does it speed things up? 2:17Well, that's a good question. So, it helps you to download the data to make it accessible for your training and testing. 2:22And this data loader, it provides you iterator over this  data so that you can use them to train in a batch. 2:30Because you're not going to just feed one data  at a time. You're going to train using the batch sizes that you want. 2:37It also provides you other things like shuffling the data. 2:41You don't want to just feed the data in an order so that your model, it's only memorizing the data versus versus it's learning. 2:49So this will shuffle for you as well.  It has other features as well. 2:53Very nice. Well, it also helps you to build models? 2:56Absolutely. So, once you think you're ready with your preparation 3:01using PyTorch because it takes you all the takes  care of all the complexity, the next step would 3:05be and building the model to define your models  and for that what you need is layers because it's 3:12a deep learning, it's made of multiple layers.  So you need different layers like linear layer or 3:17combination layer. And there are many others that  are provided by partners to you. And that are also 3:23things like, besides layers, that are activation  functions that you'll be using to add nonlinearity 3:31to your model-- that's also provided to you by  PyTorch. So you don't have to do anything but 3:36just to call those functions. 3:38What do you mean by nonlinearity? 3:40So, in general, when you train the model and then-- it's a  mathematical term, right, linear as well, 3:46but it will if you don't get nonlinear, you  basically just get like a one straight line. 3:51And in real life, not everything is just  changing in X will be same as changing your Y 4:00output. So it adds you that nonlinearity for  you. And the next step would be training and there 4:08basically I can talk more about the training side, Brad. 4:12Well, so tell me about features-- what  does it do to help you train? 4:15So training will require to use the loss function. And loss function 4:19is basically to find out the loss that you going  to have. As when you run this model like 4:27a forward pass from the input and you get some  output. Well, I'm not going to have the correct output 4:34every time magically, there's no magic there. So  you can have lots of parameters in between, you are 4:39just going to randomize them in the beginning,  you got some output, but then you're going to 4:43have a loss function to calculate the loss from  the desired output. 4:47So your want your model to reach a certain expectation. And typically during the training process that model's falling short 4:55and you're seeing how much it's falling short  from where you want it to be. 4:58Exactly. So that loss functions, there are multiple loss functions and PyTorch provides it to you again. You again 5:04you call them according to your need for  the model. Once you have the loss function used, 5:13the next big thing is finding the gradient of this  loss with regards to your parameters. So PyTorch 5:20provides the backward propagation for you, or,  auto-grade features. That is by far one of the 5:32most popular feature of PyTorch, that it will  calculate the gradient for you. 5:37So if we all think back from our calculus days, gradients are  this piece that helps you to tweak and 5:46get the model the way you want it and it's got  it built-in for you. 5:49Exactly. So once you got the gradient, you basically run the optimizer function just to step over, which is again 5:55provided by PyTorch to you. And like you exactly  said, you're going to tweak the parameters, you're 6:00going to optimize it to reach to a level in a  number of iterations. So that you basically define 6:07those iterations. But the number of iterations  you're going to reach to a level where we're like, 6:12you know what, that should be enough training.  I do like 3x, 5x iterations, and at that point 6:19you are ready to test it. 6:22And is that a big deal for these models 6:25to have to do testing or I just  test once I'm done? Or is it more more than that? 6:28Yeah. So the next step would be no. From here to the test side. You need to test it. Ideally it's optional. But as part of testing, PyTorch provides 6:39a function, an eval evaluation. So you can  evaluate your model. And at that point, you're not 6:45going to calculate the gradient, you're not going  to find the loss function. You basically just do 6:52the forward pass. You see what you're getting. And  if you're happy with it, then pretty much ready 6:59to use the model. If you're not, then you're  going to do the further training. And again, 7:03this data sets, which I mentioned earlier, that  would be used for training useful test, white or 7:08black are two different datasets. 7:10So as part of the testing. I'm getting to decide, hey, is my model good enough? I think I'm ready to go with it. 7:15Pretty much, yeah. 7:17Well, it all seems a little complicated to me. Is PyTorch really easy to use? 7:22Well, yes,  that's one of the best things I love about PyTorch.  It's easy to get started. It's easy to install. 7:30It's easy to use because it's Pythonic; the "Py"  in PyTorch is for Python. So you know how much 7:39data scientists just love Python. Absolutely.  PyTorch is in Python. And it's been easily I use 7:47by data scientists. And if someone if they don't  know Python, they can learn it quickly as well. 7:53PyTorch.org provides a lot of good documentation,  tutorials that will help you to get started very 8:00quickly and it's also flexible. So I mentioned  the training on your right. You can run training, 8:07you can run your PyTorch on CPU just  using the tensor that PyTorch uses in 8:14data structure (multi-dimensional  arrays). They can be run on CPUs, 8:19they can be run on GPUs, they can do the training  on multiple CPU and GPU on a single machine, 8:25you can do that on a distributed environment on  multiple machines, multiple GPUs. And you can 8:33like say part of that you can just run PyTorch on  your laptop and play with it. There is also like 8:39a mobile development going on to to help PyTorch  on your mobile devices. 8:44So yeah, it's a lot of  options. Supports a lot of platforms. GPUs.  CPUs. Well, what if I want to be a contributor? 8:53Well, that's great question. Something I love as  an contributor myself, so it's actually very easy. 8:59PyTorch is part of PyTorch Foundation as I  mentioned. There's a dynamic community behind it, 9:05very friendly. Lots of people are going to help  you to get started, to contribute. As long as you 9:13sign the CLA, follow the code of conduct, these  are things to do. You are ready to contribute. 9:19The community also provides weekly office hours. 9:22Office hours, that's huge. I can come in as a new person and say, hey, could you help me out 9:27or can you give me an easy first item to work on? I could do that in an office hours. 9:30Yes, exactly. And there are things like you can easily find the good first issues. You can find the document issues to 9:35get started with and you can ask questions. And the office hours, through their Slack channel is another one. 9:42And one of the classic tips is when you join a new project, ask for a mentor and ask them to put you to work on something. Because 9:51when they put you to work on something, they're  going to be very interested in what you're doing 9:55and they're going to give you timely reviews and  answer all your questions. So tell me more about 9:59how IBM is contributing to PyTorch. 10:01Yeah, sure.  Well, IBM is contributing to PyTorch in a big way, 10:06like IBM always do. By using PyTorch, so we are  going to contribute to help the community, grow 10:13the community. And a part of that, we working on  many different things, something called like FSDP, 10:20Fully Sharded Data Parallel, well, an advanced topic, but it helps you to shard the model parameters 10:28across multiple GPUs across multiple machines for fast training and for your 10:35large models they may not fit in like a single GPU  or CPU. And so we are contributing there. There's 10:41really good blog posts out there. Just search  for it, "IBM FSDP PyTorch" wiil find it quickly. 10:46Highly recommend to read it. We also provide  improvements in the storage site for training, 10:53compiler optimizations. And besides that benchmarking, test side improvements and documentation. 11:02And we have multiple developers working in the  community. 11:04So it sounds like there's lots of nice features to help it support those large foundation models, supporting multiple GPUs 11:12and running in a distributed fashion. And  a lot of work being done for benchmarking, 11:16seeing how fast things are running and obviously  a lot of work in the documentation to help others 11:21get started. It's a fabulous. It is. 11:23It's amazing. I'm so glad to be part of the community. 11:27Well, thank you, Sahdev. I've learned a lot today. This  is fabulous. We hope that you've learned a lot 11:34about PyTorch and we encourage you to come  join the community. We really enjoy working 11:41on PyTorch and pushing forward with your deep  learning/machine learning initiatives. 11:46Thanks for watching our video. And don't forget, if you  liked it, remember to hit like and subscribe.