Learning Library

← Back to Library

From Dye Diffusion to Image Generation

Key Points

  • The speaker uses the analogy of dye diffusing in water to illustrate how diffusion models add and later remove noise to generate images from text prompts.
  • In forward diffusion, a training image is gradually corrupted with Gaussian noise over many timesteps using a Markov chain, so each step depends only on the immediately preceding noisy image.
  • The addition of noise is demonstrated with a simple RGB pixel example, where random values sampled from a normal distribution slightly alter the original color (e.g., pure red becomes 253‑2‑0).
  • Diffusion models learn the reverse process—denoising the noisy image step‑by‑step—to reconstruct a clear picture, enabling tools like DALL‑E‑3 to turn complex textual descriptions into photorealistic images.

Full Transcript

# From Dye Diffusion to Image Generation **Source:** [https://www.youtube.com/watch?v=x2GRE-RzmD8](https://www.youtube.com/watch?v=x2GRE-RzmD8) **Duration:** 00:11:53 ## Summary - The speaker uses the analogy of dye diffusing in water to illustrate how diffusion models add and later remove noise to generate images from text prompts. - In forward diffusion, a training image is gradually corrupted with Gaussian noise over many timesteps using a Markov chain, so each step depends only on the immediately preceding noisy image. - The addition of noise is demonstrated with a simple RGB pixel example, where random values sampled from a normal distribution slightly alter the original color (e.g., pure red becomes 253‑2‑0). - Diffusion models learn the reverse process—denoising the noisy image step‑by‑step—to reconstruct a clear picture, enabling tools like DALL‑E‑3 to turn complex textual descriptions into photorealistic images. ## Sections - [00:00:00](https://www.youtube.com/watch?v=x2GRE-RzmD8&t=0s) **Diffusion Models Explained with Dye Analogy** - The speaker uses a dye‑in‑water analogy to introduce forward and reverse diffusion and explains how diffusion models add and then remove noise to generate images from text prompts like DALL‑E‑3. - [00:03:09](https://www.youtube.com/watch?v=x2GRE-RzmD8&t=189s) **Gradual Gaussian Noise Diffusion** - The speaker explains how repeatedly adding random Gaussian noise to image pixels shifts colors, blurs structures, and ultimately transforms a clear picture into white‑noise, governed by a variance (noise) scheduler. - [00:06:20](https://www.youtube.com/watch?v=x2GRE-RzmD8&t=380s) **Understanding Diffusion Model Noise Removal** - The speaker explains how a diffusion model predicts and subtracts added noise from a random image step‑by‑step to reconstruct a clear picture, and introduces conditional (guided) diffusion for incorporating text prompts. - [00:09:24](https://www.youtube.com/watch?v=x2GRE-RzmD8&t=564s) **Self-Attention and Classifier-Free Guidance** - The speaker explains how self‑attention guidance and classifier‑free guidance direct diffusion models to link text embeddings with denoising steps, allowing the generation of novel images from random noise. ## Full Transcript
0:00If I drop red dye into this beaker 0:01of water, the 0:03laws of physics say that the 0:05particles will diffuse throughout 0:06the beaker until the system 0:08reaches equilibrium. 0:09Now, what if I wanted to somehow 0:11reverse this process to get back 0:12to the clear water? 0:14Keep this idea in mind 0:16because this concept of physical 0:17diffusion is what motivates the 0:19approach for text to image 0:20generation with diffusion models. 0:22Diffusion models 0:24power popular image tools 0:26like DALL-E-3 and sample 0:28diffusion where you can go from a 0:30prompt like a turtle 0:31wearing sunglasses playing 0:33basketball, to a hyper 0:35realistic image of just that. 0:37At a high level, diffusion models 0:39are a type of deep neural network 0:41that learn to add noise 0:42to a picture and then learn how to 0:44reverse that process to 0:46reconstruct a clear image. 0:48I know this might sound abstract, 0:50so to unpack this more, I'm going 0:51to walk through three important 0:53concepts that each build off each 0:55other. 0:56Starting first with Forward 0:57Diffusion. 0:59Going back to the beaker, think 1:00of how the drop of dye diffused 1:02and spread out throughout the 1:03glass until the water was no 1:05longer clear. 1:06Similarly with Forward diffusion, 1:09we're going to add noise 1:11to a training image over 1:12a series of time steps until 1:15the model starts to lose its 1:16features and become 1:17unrecognizable. 1:19Now this noise is added by what's 1:21called a Markov chain, 1:23which basically means that the 1:24current state of the image only 1:26depends on the most recent state. 1:29So as an example, let's start with 1:31an image of a person. 1:34My beautiful stick figure here 1:36and labeled this image X 1:38at time T equals to zero. 1:41For simplicity, imagine that 1:43this image is made of just three 1:45RGB pixels and we can 1:47represent the color of these 1:48pixels on our x, y, z plane 1:50here. 1:51Where the coordinates 1:53of each of our pixels correspond 1:56to their R, G, and 2:02B values. 2:05So as we move to 2:06the next timestep, T equals 2:08to one... We 2:10now add random 2:12Gaussian noise to our image. 2:16Think of Gaussian noise as looking 2:18a bit like those specks 2:20of TV static you get on your TV 2:22when you flip to a channel that 2:23has a weak connection. 2:25Now, mathematically adding 2:26Gaussian noise involves randomly 2:28sampling from 2:30a Gaussian distribution, 2:32a.k.a. 2:33a normal distribution or bell 2:35curve, in order to obtain 2:37numbers that will be added 2:39to each of the values of our RGB 2:41pixels. 2:43So to make this more concrete, 2:44let's look at this pixel 2:46in particular. 2:48The color coordinates of this 2:49pixel in the original image 2:51at time zero, start 2:53off at 255, 0, 0, corresponding 2:58to the color red. 2:59Pure red. 3:01Now as we add noise 3:03to the image going to timestep 3:05one, this involves 3:06randomly sampling values from our 3:08Gaussian distribution. 3:09And say we obtain 3:11a random values of 3:15-2, 2, and 0. 3:17Adding these together, what we get 3:19is a new pixel 3:21with color values 253, 2, 3:240 3:25and we can represent this new 3:27color on our plane here. 3:30And show the change in this color 3:31with an arrow. 3:33So what just happened basically 3:35is that this pixel 3:37that was pure red in the original 3:39image at time zero 3:41has now become slightly less 3:43red in the direction of green 3:45at time t goes to one. 3:48So if we continue this process, 3:50so on and so forth, 3:52say we go two times, step two.. 3:56Adding more and more 3:58random Gaussian noise to our 3:59image. 4:02Again by randomly sampling values 4:04from our Gaussian distribution 4:06and using it to 4:08randomly adjust the color values 4:10of each of our pixels, 4:12gradually destroying any 4:14order or form 4:16or structure that can be found in 4:18the image. 4:19If we repeat this process many 4:22times, 4:24say over a thousand times 4:26steps, what 4:28happens is that shapes 4:30and edges in the image start 4:32to become more and more blurred, 4:33and over time, our person 4:35completely disappears. 4:36And what we end up with is 4:39completely white noise 4:41or a full screen and 4:43just TV static. 4:45So how quickly we go 4:48from a clear picture 4:50to an image of random noise 4:52is largely dictated by what's 4:53called the noise scheduler 4:56or the variance scheduler. 4:57This scheduling parameter controls 4:59the variance 5:01of our Gaussian distribution. 5:05Where a higher variance 5:07corresponds to 5:08larger probabilities of selecting 5:11a noise value that is higher 5:13in magnitude, thus resulting 5:15in more drastic jumps 5:16and changes at..for each 5:19color of each pixel. 5:21So after forward diffusion comes 5:22the opposite - reverse diffusion. 5:25This is similar to the process of 5:26if I took the beaker of red water 5:29and I somehow removed the red 5:30dye to get back to the clear 5:32water. 5:33Similarly for reverse 5:35diffusion, we're going to start 5:36with our image of random noise. 5:38And we're going to somehow remove 5:40the noise that was added to it 5:42in very structured and controlled 5:44manners in order to 5:46reconstruct a clear image. 5:49So to help me explain this more, 5:50there's this quote by the famous 5:52sculptor named Michelangelo, 5:54who once said, "Every block 5:56of stone has a statue inside 5:57it and it's the job of the 5:59sculptor to discover it.". 6:01In the same way, think of reverse 6:03diffusion as every image 6:05of random noise has a clear 6:06picture in it. 6:08And it's the job of the diffusion 6:09model to reveal it. 6:11So this can be done by training a 6:13type of convolutional neural 6:15network called a U-Net to 6:17learn this reverse diffusion 6:18process. 6:20So if we start with an 6:21image of completely random noise 6:24at a random time T, 6:28The model learns how to predict 6:30the noise that was 6:31added to this image 6:33at the previous time step. 6:34So say that this 6:36model predicts that the noise that 6:38was added to this image was 6:41a lot in the upper left hand 6:42corner here. 6:44And so the models objective here 6:46is to minimize the mean squared 6:48error between the 6:50predicted noise from the actual 6:51noise that was added to it during 6:53forward diffusion. 6:55We can then take this scale noise 6:56prediction and subtract 6:58it or remove it from 7:01our image at time t in 7:03order to obtain a prediction of 7:04what 7:06the slightly less 7:08noisy image looked like 7:10at time t minus one. 7:14So on our graph here for 7:16reverse diffusion, the model 7:18essentially learns how to 7:19backtrace its steps 7:21from each pixel's augmented colors 7:23back to its t noise colors. 7:27Now, if we repeat this process 7:28many times, over time, 7:31the model learns how to remove 7:33noise and very structured 7:35sequences in patterns in order 7:37to reveal more features 7:39of an image. 7:40Say slowly revealing 7:42an arm and a leg. 7:45It repeats this process until it 7:46gets back to 7:49one final noise prediction. 7:53One final noise removal 7:55and then finally, a clear 7:57picture. 7:58And our person has magically 8:00reappeared. 8:01So now that we've covered forward 8:03and reverse diffusion, it's time 8:04to introduce text into the picture 8:06by introducing a new concept 8:08called conditional fusion or 8:10guided diffusion. 8:11Up to this point, I've been 8:12describing unconditional diffusion 8:14because the image generation was 8:16done without any influence from 8:17outside factors. 8:19On the other hand, with 8:19conditional diffusion, the process 8:21will be guided by or conditioned 8:23on some text prompt. 8:25So the first step is we have to 8:27represent our text within 8:28embedding. 8:30Think of an embedding as a numeric 8:32representation or a numeric vector 8:34as able to capture the semantic 8:36meaning of natural language input. 8:39So as an example, an 8:40embedding model is able to 8:42understand that the word 8:43KING. 8:46Is more closely related to the 8:47word MAN than it 8:49is to the word WOMAN. 8:54So during training, these 8:55embeddings of these text 8:57descriptions are paired 8:59with their respective images that 9:00they describe in order 9:02to form a corpus of 9:04image and text pairs 9:06that are used to train this 9:08model to learn this conditional 9:10reverse diffusion process. 9:12In other words, learning how much 9:14noise to remove in 9:15which patterns at 9:17a given the current image, 9:19and now taking into account the 9:21different features of the embedded 9:22text. 9:24One method for incorporating these 9:25embeddings is what's called self 9:27attention guidance, which 9:29basically forces the model to 9:31pay attention to how specific 9:33portions of the prompt 9:35influenced the generation of 9:37certain regions or 9:38areas of the image. 9:40Another method is called the 9:42classifier free guidance. 9:44Think of this method as helping 9:46to amplify the effect 9:48that certain words in 9:50the prompt have on how the image 9:52is generated. 9:53So putting this all together, this 9:55means that the model is able 9:57to learn the relationship 9:59between the meaning of words 10:01and how they correlate with 10:03certain de-noising sequences 10:05that gradually reveal different 10:06features and shapes and edges 10:08in the picture. 10:10So once this process is learned, 10:12the model can be used to generate 10:14a completely new image. 10:18So first, the users text description 10:23has to be embedded. 10:27Then the model starts with 10:29an image of completely random 10:30noise. 10:34And it uses this text 10:36embedding along 10:38with the conditional reverse 10:40diffusion process it learned 10:42during training, 10:45to remove noise in the image 10:47and structure and patterns, you 10:49know, kind of like removing fog 10:50from the image until 10:53a new image has been generated. 10:56So the sophisticated architecture 10:58of these diffusion models allows 11:00them to pick up on complex 11:02patterns and also to create images 11:04that it's never seen before. 11:06In fact, the application 11:08of diffusion models spanned beyond 11:10just text to image use cases. 11:12Some other use cases involve image 11:14to image models, in painting 11:17missing components into an image, 11:19and even creating other forms of 11:20media like audio or video. 11:23In fact, diffusion models have 11:24been applied in different fields, 11:27everything from the marketing 11:28field to the medical field 11:30to even molecular modeling. 11:33Speaking of molecules, let's 11:35check on our beaker. 11:40If only I could. 11:44.. Well, would you look at that 11:45reverse diffusion! 11:47Anyways, thank 11:48you for watching. 11:48I hope you enjoyed this video and 11:50I will see you all next time. 11:52Peace.