Learning Library

← Back to Library

From Dye Diffusion to Image Generation

11m • Unknown Channel • ai-ml • tutorial • intermediate • Watch on YouTube ↗

Key Points

The speaker uses the analogy of dye diffusing in water to illustrate how diffusion models add and later remove noise to generate images from text prompts.
In forward diffusion, a training image is gradually corrupted with Gaussian noise over many timesteps using a Markov chain, so each step depends only on the immediately preceding noisy image.
The addition of noise is demonstrated with a simple RGB pixel example, where random values sampled from a normal distribution slightly alter the original color (e.g., pure red becomes 253‑2‑0).
Diffusion models learn the reverse process—denoising the noisy image step‑by‑step—to reconstruct a clear picture, enabling tools like DALL‑E‑3 to turn complex textual descriptions into photorealistic images.

Sections

Full Transcript

# From Dye Diffusion to Image Generation **Source:** [https://www.youtube.com/watch?v=x2GRE-RzmD8](https://www.youtube.com/watch?v=x2GRE-RzmD8) **Duration:** 00:11:53 ## Summary - The speaker uses the analogy of dye diffusing in water to illustrate how diffusion models add and later remove noise to generate images from text prompts. - In forward diffusion, a training image is gradually corrupted with Gaussian noise over many timesteps using a Markov chain, so each step depends only on the immediately preceding noisy image. - The addition of noise is demonstrated with a simple RGB pixel example, where random values sampled from a normal distribution slightly alter the original color (e.g., pure red becomes 253‑2‑0). - Diffusion models learn the reverse process—denoising the noisy image step‑by‑step—to reconstruct a clear picture, enabling tools like DALL‑E‑3 to turn complex textual descriptions into photorealistic images. ## Sections - [00:00:00](https://www.youtube.com/watch?v=x2GRE-RzmD8&t=0s) **Diffusion Models Explained with Dye Analogy** - The speaker uses a dye‑in‑water analogy to introduce forward and reverse diffusion and explains how diffusion models add and then remove noise to generate images from text prompts like DALL‑E‑3. - [00:03:09](https://www.youtube.com/watch?v=x2GRE-RzmD8&t=189s) **Gradual Gaussian Noise Diffusion** - The speaker explains how repeatedly adding random Gaussian noise to image pixels shifts colors, blurs structures, and ultimately transforms a clear picture into white‑noise, governed by a variance (noise) scheduler. - [00:06:20](https://www.youtube.com/watch?v=x2GRE-RzmD8&t=380s) **Understanding Diffusion Model Noise Removal** - The speaker explains how a diffusion model predicts and subtracts added noise from a random image step‑by‑step to reconstruct a clear picture, and introduces conditional (guided) diffusion for incorporating text prompts. - [00:09:24](https://www.youtube.com/watch?v=x2GRE-RzmD8&t=564s) **Self-Attention and Classifier-Free Guidance** - The speaker explains how self‑attention guidance and classifier‑free guidance direct diffusion models to link text embeddings with denoising steps, allowing the generation of novel images from random noise. ## Full Transcript

0:00If I drop red dye into this beaker 0:01of water, the 0:03laws of physics say that the 0:05particles will diffuse throughout 0:06the beaker until the system 0:08reaches equilibrium. 0:09Now, what if I wanted to somehow 0:11reverse this process to get back 0:12to the clear water? 0:14Keep this idea in mind 0:16because this concept of physical 0:17diffusion is what motivates the 0:19approach for text to image 0:20generation with diffusion models. 0:22Diffusion models 0:24power popular image tools 0:26like DALL-E-3 and sample 0:28diffusion where you can go from a 0:30prompt like a turtle 0:31wearing sunglasses playing 0:33basketball, to a hyper 0:35realistic image of just that. 0:37At a high level, diffusion models 0:39are a type of deep neural network 0:41that learn to add noise 0:42to a picture and then learn how to 0:44reverse that process to 0:46reconstruct a clear image. 0:48I know this might sound abstract, 0:50so to unpack this more, I'm going 0:51to walk through three important 0:53concepts that each build off each 0:55other. 0:56Starting first with Forward 0:57Diffusion. 0:59Going back to the beaker, think 1:00of how the drop of dye diffused 1:02and spread out throughout the 1:03glass until the water was no 1:05longer clear. 1:06Similarly with Forward diffusion, 1:09we're going to add noise 1:11to a training image over 1:12a series of time steps until 1:15the model starts to lose its 1:16features and become 1:17unrecognizable. 1:19Now this noise is added by what's 1:21called a Markov chain, 1:23which basically means that the 1:24current state of the image only 1:26depends on the most recent state. 1:29So as an example, let's start with 1:31an image of a person. 1:34My beautiful stick figure here 1:36and labeled this image X 1:38at time T equals to zero. 1:41For simplicity, imagine that 1:43this image is made of just three 1:45RGB pixels and we can 1:47represent the color of these 1:48pixels on our x, y, z plane 1:50here. 1:51Where the coordinates 1:53of each of our pixels correspond 1:56to their R, G, and 2:02B values. 2:05So as we move to 2:06the next timestep, T equals 2:08to one... We 2:10now add random 2:12Gaussian noise to our image. 2:16Think of Gaussian noise as looking 2:18a bit like those specks 2:20of TV static you get on your TV 2:22when you flip to a channel that 2:23has a weak connection. 2:25Now, mathematically adding 2:26Gaussian noise involves randomly 2:28sampling from 2:30a Gaussian distribution, 2:32a.k.a. 2:33a normal distribution or bell 2:35curve, in order to obtain 2:37numbers that will be added 2:39to each of the values of our RGB 2:41pixels. 2:43So to make this more concrete, 2:44let's look at this pixel 2:46in particular. 2:48The color coordinates of this 2:49pixel in the original image 2:51at time zero, start 2:53off at 255, 0, 0, corresponding 2:58to the color red. 2:59Pure red. 3:01Now as we add noise 3:03to the image going to timestep 3:05one, this involves 3:06randomly sampling values from our 3:08Gaussian distribution. 3:09And say we obtain 3:11a random values of 3:15-2, 2, and 0. 3:17Adding these together, what we get 3:19is a new pixel 3:21with color values 253, 2, 3:240 3:25and we can represent this new 3:27color on our plane here. 3:30And show the change in this color 3:31with an arrow. 3:33So what just happened basically 3:35is that this pixel 3:37that was pure red in the original 3:39image at time zero 3:41has now become slightly less 3:43red in the direction of green 3:45at time t goes to one. 3:48So if we continue this process, 3:50so on and so forth, 3:52say we go two times, step two.. 3:56Adding more and more 3:58random Gaussian noise to our 3:59image. 4:02Again by randomly sampling values 4:04from our Gaussian distribution 4:06and using it to 4:08randomly adjust the color values 4:10of each of our pixels, 4:12gradually destroying any 4:14order or form 4:16or structure that can be found in 4:18the image. 4:19If we repeat this process many 4:22times, 4:24say over a thousand times 4:26steps, what 4:28happens is that shapes 4:30and edges in the image start 4:32to become more and more blurred, 4:33and over time, our person 4:35completely disappears. 4:36And what we end up with is 4:39completely white noise 4:41or a full screen and 4:43just TV static. 4:45So how quickly we go 4:48from a clear picture 4:50to an image of random noise 4:52is largely dictated by what's 4:53called the noise scheduler 4:56or the variance scheduler. 4:57This scheduling parameter controls 4:59the variance 5:01of our Gaussian distribution. 5:05Where a higher variance 5:07corresponds to 5:08larger probabilities of selecting 5:11a noise value that is higher 5:13in magnitude, thus resulting 5:15in more drastic jumps 5:16and changes at..for each 5:19color of each pixel. 5:21So after forward diffusion comes 5:22the opposite - reverse diffusion. 5:25This is similar to the process of 5:26if I took the beaker of red water 5:29and I somehow removed the red 5:30dye to get back to the clear 5:32water. 5:33Similarly for reverse 5:35diffusion, we're going to start 5:36with our image of random noise. 5:38And we're going to somehow remove 5:40the noise that was added to it 5:42in very structured and controlled 5:44manners in order to 5:46reconstruct a clear image. 5:49So to help me explain this more, 5:50there's this quote by the famous 5:52sculptor named Michelangelo, 5:54who once said, "Every block 5:56of stone has a statue inside 5:57it and it's the job of the 5:59sculptor to discover it.". 6:01In the same way, think of reverse 6:03diffusion as every image 6:05of random noise has a clear 6:06picture in it. 6:08And it's the job of the diffusion 6:09model to reveal it. 6:11So this can be done by training a 6:13type of convolutional neural 6:15network called a U-Net to 6:17learn this reverse diffusion 6:18process. 6:20So if we start with an 6:21image of completely random noise 6:24at a random time T, 6:28The model learns how to predict 6:30the noise that was 6:31added to this image 6:33at the previous time step. 6:34So say that this 6:36model predicts that the noise that 6:38was added to this image was 6:41a lot in the upper left hand 6:42corner here. 6:44And so the models objective here 6:46is to minimize the mean squared 6:48error between the 6:50predicted noise from the actual 6:51noise that was added to it during 6:53forward diffusion. 6:55We can then take this scale noise 6:56prediction and subtract 6:58it or remove it from 7:01our image at time t in 7:03order to obtain a prediction of 7:04what 7:06the slightly less 7:08noisy image looked like 7:10at time t minus one. 7:14So on our graph here for 7:16reverse diffusion, the model 7:18essentially learns how to 7:19backtrace its steps 7:21from each pixel's augmented colors 7:23back to its t noise colors. 7:27Now, if we repeat this process 7:28many times, over time, 7:31the model learns how to remove 7:33noise and very structured 7:35sequences in patterns in order 7:37to reveal more features 7:39of an image. 7:40Say slowly revealing 7:42an arm and a leg. 7:45It repeats this process until it 7:46gets back to 7:49one final noise prediction. 7:53One final noise removal 7:55and then finally, a clear 7:57picture. 7:58And our person has magically 8:00reappeared. 8:01So now that we've covered forward 8:03and reverse diffusion, it's time 8:04to introduce text into the picture 8:06by introducing a new concept 8:08called conditional fusion or 8:10guided diffusion. 8:11Up to this point, I've been 8:12describing unconditional diffusion 8:14because the image generation was 8:16done without any influence from 8:17outside factors. 8:19On the other hand, with 8:19conditional diffusion, the process 8:21will be guided by or conditioned 8:23on some text prompt. 8:25So the first step is we have to 8:27represent our text within 8:28embedding. 8:30Think of an embedding as a numeric 8:32representation or a numeric vector 8:34as able to capture the semantic 8:36meaning of natural language input. 8:39So as an example, an 8:40embedding model is able to 8:42understand that the word 8:43KING. 8:46Is more closely related to the 8:47word MAN than it 8:49is to the word WOMAN. 8:54So during training, these 8:55embeddings of these text 8:57descriptions are paired 8:59with their respective images that 9:00they describe in order 9:02to form a corpus of 9:04image and text pairs 9:06that are used to train this 9:08model to learn this conditional 9:10reverse diffusion process. 9:12In other words, learning how much 9:14noise to remove in 9:15which patterns at 9:17a given the current image, 9:19and now taking into account the 9:21different features of the embedded 9:22text. 9:24One method for incorporating these 9:25embeddings is what's called self 9:27attention guidance, which 9:29basically forces the model to 9:31pay attention to how specific 9:33portions of the prompt 9:35influenced the generation of 9:37certain regions or 9:38areas of the image. 9:40Another method is called the 9:42classifier free guidance. 9:44Think of this method as helping 9:46to amplify the effect 9:48that certain words in 9:50the prompt have on how the image 9:52is generated. 9:53So putting this all together, this 9:55means that the model is able 9:57to learn the relationship 9:59between the meaning of words 10:01and how they correlate with 10:03certain de-noising sequences 10:05that gradually reveal different 10:06features and shapes and edges 10:08in the picture. 10:10So once this process is learned, 10:12the model can be used to generate 10:14a completely new image. 10:18So first, the users text description 10:23has to be embedded. 10:27Then the model starts with 10:29an image of completely random 10:30noise. 10:34And it uses this text 10:36embedding along 10:38with the conditional reverse 10:40diffusion process it learned 10:42during training, 10:45to remove noise in the image 10:47and structure and patterns, you 10:49know, kind of like removing fog 10:50from the image until 10:53a new image has been generated. 10:56So the sophisticated architecture 10:58of these diffusion models allows 11:00them to pick up on complex 11:02patterns and also to create images 11:04that it's never seen before. 11:06In fact, the application 11:08of diffusion models spanned beyond 11:10just text to image use cases. 11:12Some other use cases involve image 11:14to image models, in painting 11:17missing components into an image, 11:19and even creating other forms of 11:20media like audio or video. 11:23In fact, diffusion models have 11:24been applied in different fields, 11:27everything from the marketing 11:28field to the medical field 11:30to even molecular modeling. 11:33Speaking of molecules, let's 11:35check on our beaker. 11:40If only I could. 11:44.. Well, would you look at that 11:45reverse diffusion! 11:47Anyways, thank 11:48you for watching. 11:48I hope you enjoyed this video and 11:50I will see you all next time. 11:52Peace.