Generate images based on text embeddings. I am using seed 180.
A key part of diffusion is the forward process, which takes a clean image and adds noise to it.
Let's try to denoise these images using classical methods. We will use Gaussian blur filtering to try to remove the noise.
Now, we'll use a pretrained diffusion model to denoise.
In one-step denoising, you should see that the denoising UNet does a much better job of projecting the image onto the natural image manifold, but it does get worse as you add more noise. This makes sense, as the problem is much harder with more noise! But diffusion models are designed to denoise iteratively. In this part we will implement this.
In the previous part, we use the diffusion model to denoise an image. Another thing we can do with the iterative_denoise function is to generate images from scratch. We can do this by setting i_start = 0 and passing in random noise. This effectively denoises pure noise. We will use the prompt, "a high quality photo".
We now generate images using classifier-free guidance.
We're going to take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. Effectively, we're going to get an image that is similar to the test image (with a low-enough noise level). This follows the SDEdit algorithm.
This procedure works particularly well if we start with a nonrealistic image (e.g. painting, a sketch, some scribbles) and project it onto the natural image manifold.
We can use the same procedure to implement inpainting (following the RePaint paper).
Now, we will do the same thing as SDEdit, but guide the projection with a text prompt. This is no longer pure "projection to the natural image manifold" but also adds control using language.
Visual anagrams can be created by making our noise estimate the average of the noise estimates of two different prompts.
Hybrid images can be created by making our noise estimate the low pass noise estimate of one prompt added to the high pass noise estimate of another prompt.
In this project, we implement the denoiser as a UNet. It consists of a few downsampling and upsampling blocks with skip connections.
Given a noisy image, we will train a denoiser to map it back to the clean image, optimizing using the L2 loss function. We will visualize the noising process over time.
Now, we will train the model to perform denoising.
Our denoiser was training on MNIST digits noised with sigma=0.5. Let's see how the denoiser performs on different sigma's that it wasn't trained for.
We need a way to inject scalar t (time) into our UNet model to condition it.
Training our time-conditioned UNet is now pretty easy. Basically, we pick a random image from the training set, a random t , and train the denoiser to predict the noise. We repeat this for different images and different t values until the model converges and we are happy.
To make the results better and give us more control for image generation, we can also optionally condition our UNet on the class of the digit 0-9.