Fun With Diffusion Models!

Alec Thompson

Part A: The Power of Diffusion Models!

Setup

Generate images based on text embeddings. I am using seed 180.

First Image
an oil painting of a snowy mountain village
Second Image
a man wearing a hat
Third Image
a rocket ship

Sampling Loops

Implementing the Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it.

First Image
Campanile Noising Process

Classical Denoising

Let's try to denoise these images using classical methods. We will use Gaussian blur filtering to try to remove the noise.

First Image
Gaussian Denoising Process

One-Step Denoising

Now, we'll use a pretrained diffusion model to denoise.

First Image
Noisy image at t=250
First Image
Noisy image at t=500
First Image
Noisy image at t=750

Iterative Denoising

In one-step denoising, you should see that the denoising UNet does a much better job of projecting the image onto the natural image manifold, but it does get worse as you add more noise. This makes sense, as the problem is much harder with more noise! But diffusion models are designed to denoise iteratively. In this part we will implement this.

First Image
Iterative Denoising Process
First Image
Denoising Process Comparison

Diffusion Model Sampling

In the previous part, we use the diffusion model to denoise an image. Another thing we can do with the iterative_denoise function is to generate images from scratch. We can do this by setting i_start = 0 and passing in random noise. This effectively denoises pure noise. We will use the prompt, "a high quality photo".

First Image
Iterative Denoise Sampling

Classifier-Free Guidance (CFG)

We now generate images using classifier-free guidance.

First Image

Image-to-image Translation

We're going to take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. Effectively, we're going to get an image that is similar to the test image (with a low-enough noise level). This follows the SDEdit algorithm.

First Image
SDEdit Campanile
First Image
SDEdit Golden Gate Bridge
First Image
SDEdit Eiffel Tower

Editing Hand-Drawn and Web Images

This procedure works particularly well if we start with a nonrealistic image (e.g. painting, a sketch, some scribbles) and project it onto the natural image manifold.

First Image
SDEdit Crab Clipart
First Image
SDEdit Drawn Car
First Image
SDEdit Drawn Tree

Inpainting

We can use the same procedure to implement inpainting (following the RePaint paper).

First Image
Inpainting Examples

Text-Conditional Image-to-image Translation

Now, we will do the same thing as SDEdit, but guide the projection with a text prompt. This is no longer pure "projection to the natural image manifold" but also adds control using language.

First Image
Rocket Ship to Campanile
First Image
Rocket Ship to Golden Gate Bridge
First Image
Rocket Ship to Eiffel Tower

Visual Anagrams

Visual anagrams can be created by making our noise estimate the average of the noise estimates of two different prompts.

First Image
Visual Anagrams

Hybrid Images

Hybrid images can be created by making our noise estimate the low pass noise estimate of one prompt added to the high pass noise estimate of another prompt.

First Image
Hybrid Images

Part B: Diffusion Models from Scratch!

Training a Single-Step Denoising UNet

Implementing the UNet

In this project, we implement the denoiser as a UNet. It consists of a few downsampling and upsampling blocks with skip connections.

Using the UNet to Train a Denoiser

Given a noisy image, we will train a denoiser to map it back to the clean image, optimizing using the L2 loss function. We will visualize the noising process over time.

First Image
Adding noise to a clean image

Training

Now, we will train the model to perform denoising.

First Image
Unconditional Training UNet Loss Curve
Second Image
Denoising Process at Epoch 1
Third Image
Denoising Process at Epoch 5

Out-of_Distribution Testing

Our denoiser was training on MNIST digits noised with sigma=0.5. Let's see how the denoiser performs on different sigma's that it wasn't trained for.

First Image
Denoising process on different noise levels

Training a Diffusion Model

Adding Time Conditioning to UNet

We need a way to inject scalar t (time) into our UNet model to condition it.

Training the UNet

Training our time-conditioned UNet is now pretty easy. Basically, we pick a random image from the training set, a random t , and train the denoiser to predict the noise. We repeat this for different images and different t values until the model converges and we are happy.

First Image
Time Conditional Training UNet Loss Curve

Sampling from the UNet

First Image
Denoising Process at Epoch 5
First Image
Denoising Process at Epoch 20

Adding Class-Conditioning to UNet

To make the results better and give us more control for image generation, we can also optionally condition our UNet on the class of the digit 0-9.

First Image
Class Conditional Training UNet Loss Curve

Sampling from the Class-Conditioned UNet

First Image
Class Conditional Training UNet Loss Curve
First Image
Class Conditional Training UNet Loss Curve