Centered Website

Generate images based on text embeddings. I am using seed 180.

Sampling Loops

Implementing the Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it.

Classical Denoising

Let's try to denoise these images using classical methods. We will use Gaussian blur filtering to try to remove the noise.

One-Step Denoising

Now, we'll use a pretrained diffusion model to denoise.

Iterative Denoising

In one-step denoising, you should see that the denoising UNet does a much better job of projecting the image onto the natural image manifold, but it does get worse as you add more noise. This makes sense, as the problem is much harder with more noise! But diffusion models are designed to denoise iteratively. In this part we will implement this.

Diffusion Model Sampling

In the previous part, we use the diffusion model to denoise an image. Another thing we can do with the iterative_denoise function is to generate images from scratch. We can do this by setting i_start = 0 and passing in random noise. This effectively denoises pure noise. We will use the prompt, "a high quality photo".

Classifier-Free Guidance (CFG)

We now generate images using classifier-free guidance.

Image-to-image Translation

We're going to take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. Effectively, we're going to get an image that is similar to the test image (with a low-enough noise level). This follows the SDEdit algorithm.

Editing Hand-Drawn and Web Images

This procedure works particularly well if we start with a nonrealistic image (e.g. painting, a sketch, some scribbles) and project it onto the natural image manifold.

Inpainting

We can use the same procedure to implement inpainting (following the RePaint paper).

Text-Conditional Image-to-image Translation

Now, we will do the same thing as SDEdit, but guide the projection with a text prompt. This is no longer pure "projection to the natural image manifold" but also adds control using language.

Visual Anagrams

Visual anagrams can be created by making our noise estimate the average of the noise estimates of two different prompts.

Hybrid Images

Hybrid images can be created by making our noise estimate the low pass noise estimate of one prompt added to the high pass noise estimate of another prompt.

Part B: Diffusion Models from Scratch!

Training a Single-Step Denoising UNet

Implementing the UNet

In this project, we implement the denoiser as a UNet. It consists of a few downsampling and upsampling blocks with skip connections.

Using the UNet to Train a Denoiser

Given a noisy image, we will train a denoiser to map it back to the clean image, optimizing using the L2 loss function. We will visualize the noising process over time.

Training

Now, we will train the model to perform denoising.

Out-of_Distribution Testing

Our denoiser was training on MNIST digits noised with sigma=0.5. Let's see how the denoiser performs on different sigma's that it wasn't trained for.

Training a Diffusion Model

Adding Time Conditioning to UNet

We need a way to inject scalar t (time) into our UNet model to condition it.

Training the UNet

Training our time-conditioned UNet is now pretty easy. Basically, we pick a random image from the training set, a random t , and train the denoiser to predict the noise. We repeat this for different images and different t values until the model converges and we are happy.

Sampling from the UNet

Adding Class-Conditioning to UNet

To make the results better and give us more control for image generation, we can also optionally condition our UNet on the class of the digit 0-9.