Learning Library

← Back to Library

Gradient Descent Explained Through Neural Networks

Key Points

  • Gradient descent is likened to navigating a dark mountain, taking small steps in the direction that feels most downhill to eventually reach the lowest point, which mirrors how the algorithm iteratively reduces error.
  • In neural networks, weights and biases determine how input data is processed, and training adjusts these parameters using labeled data so the model can correctly map inputs (e.g., shapes or house features) to desired outputs.
  • The cost (or loss) function quantifies the mismatch between the network’s predictions and actual values; gradient descent minimizes this cost by moving opposite the gradient of the function.
  • The size of each step in gradient descent is controlled by the learning rate, which must be chosen carefully to ensure steady convergence without overshooting.
  • Real‑world examples—classifying drawn squiggles and predicting house prices—illustrate how the model’s predictions improve as gradient descent repeatedly updates weights and biases to lower the cost.

Full Transcript

# Gradient Descent Explained Through Neural Networks **Source:** [https://www.youtube.com/watch?v=i62czvwDlsw](https://www.youtube.com/watch?v=i62czvwDlsw) **Duration:** 00:07:02 ## Summary - Gradient descent is likened to navigating a dark mountain, taking small steps in the direction that feels most downhill to eventually reach the lowest point, which mirrors how the algorithm iteratively reduces error. - In neural networks, weights and biases determine how input data is processed, and training adjusts these parameters using labeled data so the model can correctly map inputs (e.g., shapes or house features) to desired outputs. - The cost (or loss) function quantifies the mismatch between the network’s predictions and actual values; gradient descent minimizes this cost by moving opposite the gradient of the function. - The size of each step in gradient descent is controlled by the learning rate, which must be chosen carefully to ensure steady convergence without overshooting. - Real‑world examples—classifying drawn squiggles and predicting house prices—illustrate how the model’s predictions improve as gradient descent repeatedly updates weights and biases to lower the cost. ## Sections - [00:00:00](https://www.youtube.com/watch?v=i62czvwDlsw&t=0s) **Untitled Section** - ## Full Transcript
0:00gradient descent is like it's like 0:02trying to find your way down a dark 0:04Mountain you can't see where you're 0:05going so you have to feel your way 0:07around you take small steps in the 0:10direction that feels the most downhill 0:12eventually if you keep going you'll find 0:15your way to the bottom that's gradient 0:17descent let's get into it 0:20so gradient descent is a common 0:22optimization algorithm used to train 0:24machine learning models and neural 0:25networks by training on data these 0:28models can learn over time and because 0:30they're learning over time they can 0:32improve their accuracy now you see a 0:34neural network consists of connected 0:36neurons 0:38like this and these neurons are in 0:41layers and those layers have weights and 0:44biases which describe how we navigate 0:47through this network 0:49we provide the neural network with 0:51labeled training data to determine what 0:54we should set these weights and biases 0:56to to figure something out so like for 0:58example I could input a shape let's say 1:01like that and then we could use the 1:04neural network to learn that squiggle 1:07as our input represents this output 1:11for number three 1:12after we train the neural network we can 1:14provide it with more labeled data like 1:17this squiggle and then we can see if it 1:21could also correctly resolve that 1:22squiggle to 1:24the number six if it gets some of these 1:27squiggles wrong the the weights and 1:29biases here can be adjusted and then we 1:31just try it again 1:32now how can gradient descent help us 1:35here well gradient descent is used to 1:37find the minimum of something called a 1:40cost 1:42function 1:43so what is 1:45a cost function well it's a function 1:48that tells us how far off our 1:50predictions are from the actual values 1:53so the idea is that we want to minimize 1:55this cost function to get the best 1:57predictions now to do this we take small 2:00steps in the direction that reduces the 2:03cost function the most if we think about 2:06this on a graph we start here and we 2:09keep going downhill reducing our cost 2:13function as we go 2:15the size of the steps that we take so 2:18the size of the steps from here to here 2:20and to here that's called The Learning 2:23rate 2:24let's think about another example let's 2:26consider a neural network but instead of 2:27dealing with squiggles predicts how much 2:30a house will sell for so first we train 2:34the network on a labeled data set let's 2:37say that data has some information like 2:39um like the location of a house let's 2:41say the size of the house and then how 2:43much it sold for 2:45so with that we can then use our model 2:48to train new labeled data so here's a 2:52here's another example we've got a house 2:54uh it's location let's do it by ZIP code 2:57275 from three how big is it 3:00uh 3 000 square feet input that into our 3:05neural network so how much does this 3:08house sell for well now our neural 3:10network will make a forecast it says we 3:13think 3:15is sold for three hundred thousand 3:18dollars 3:19and we compare that 3:21to the forecast of the actual sale price 3:23which was 3:26450 3:29000 dollars 3:30not a good guess we have a large cost 3:33function weights and biases now need to 3:36be adjusted and then the model can try 3:37again and did it do any better over the 3:39entire label data set or did it do worse 3:42that's what gradient descent can help us 3:44with 3:45now there are three types of gradient 3:47descent learning algorithms and let's 3:49take a look at some of those 3:53so first of all 3:55we've got a gradient descent called 3:58batch 4:00this sums the entries for each point in 4:03a training set updating the model only 4:05after all the training examples have 4:07been evaluated hence the term batch now 4:10in terms of how well does this do well 4:12computationally it is computationally 4:15effective 4:17you can give this a high rating 4:20because we're doing things in one big 4:21batch but what about processing time 4:25well with processing time we can end up 4:28with long processing times using batch 4:31gradient descent because well we've got 4:33large training data sets and it needs to 4:35store all of that data in memory and 4:37process it 4:38so that's batch another option is 4:40stochastic 4:42gradient descent and this evaluates each 4:45training example but one at a time 4:47instead of in a batch since you only 4:49need to hold one training example 4:52they're easy to store in memory and get 4:54individual responses much faster so in 4:57terms of speed 4:59that's 5:00fast but in terms of computational 5:02efficiency that's lower 5:07now there is a happy medium and that is 5:11called 5:11mini batch and mini batch gradient 5:16descent splits the training data set 5:18into small batch sizes and performs 5:21updates on each of those batches that is 5:23a nice balance of computational 5:25efficiency and of speed now gradient 5:29descent does come with its own 5:30challenges so for example it can 5:32struggle to find the global minimum in 5:34non-convex problems this was a nice 5:37convex problem with a clearly defined 5:39bottom 5:40so when are the slope of the cost 5:42function is close to zero or it's at 5:45zero the model stops learning but if we 5:47don't have this convex model here that 5:49we have something like 5:51this shape that's known as a saddle 5:54point and it can mislead the gradient 5:57descent because it thinks it's 5:59at the bottom 6:01before it really is this is going to 6:03keep going down further 6:05chord a subtle shape because it kind of 6:08looks like a horse saddle I guess 6:11another challenge is that in deeper 6:14neural learning networks a gradient 6:16descent can suffer from vanish 6:17ingredients or exploding gradients so 6:20Vanishing gradients are when the 6:21gradient is too small and the earlier 6:23layers in the network learn more slowly 6:25than the later layers as we go through 6:27this network here 6:29exploding gradients on the other hand 6:31are when the gradient is too large and 6:32that can create an unstable model but 6:34look despite those challenges gradient 6:37descent is a powerful optimization 6:39algorithm and it is commonly used to 6:42train machine learning models and neural 6:44networks today it's a clever way to get 6:47you back down that mountain safely 6:51if you have any questions please drop us 6:53a line below and if you want to see more 6:55videos like this in the future please 6:57like And subscribe thanks for watching