Tutorial: Fine-tuning Stable Diffusion with PyTorch DDP

Train Stable Diffusion 1.5 to inpaint landscape photographs. The whole thing runs on 10 images and finishes in a few minutes on a single GPU, so you can see the full cycle (upload, build, train, pull results) without waiting around. Once you have the workflow down, swap in your own dataset and scale up.

The script uses PyTorch DDP, so if the machine has multiple GPUs it splits each batch across them automatically. With one GPU it just runs normally.

Quick start

Clone the example, upload it, and queue a job:

squadron examples clone stable-diffusion-ddp && cd stable-diffusion-ddp
squadron project init
squadron project upload
squadron job queue --latest

Then view its progress on the Jobs page.

The manifest declares two builtin volumes: the SD 1.5 model weights (builtin/stable-diffusion-v1-5) and a small set of landscape photos (builtin/landscape-images). The daemon mounts both into the container for you, and they're downloaded on demand. There's no separate download step.

Once the build finishes you'll see training loss printed every 10 steps:

  step 10/5000  loss=0.1823
  step 20/5000  loss=0.1701
  ...

Every 100 steps, the script saves a checkpoint and generates an inpainting demo image so you can see the model improving. The manifest sets artifact_upload_order = "LIFO", which means the most recent checkpoint always uploads first. Without this, checkpoint uploads could fall behind and queue up. These artifacts can then be viewed online, and downloaded, in real time.

Prerequisites

The Squadron CLI installed (see Installation)
At least one registered machine with an NVIDIA GPU (see Quickstart)

The example's manifest sets cuda = "13.0" and min_memory = "16GB", so the scheduler only considers machines whose driver supports CUDA 13.0 or newer and that can give a container 16 GB. If nothing in your organization qualifies, the job stays queued rather than failing. Lower either value in .squadron/manifest.toml and upload again if your hardware sits below those numbers.

The training script

Only the UNet is fine-tuned. The VAE (image encoder/decoder) stays frozen, and there's no text encoder to load.

Training is unconditional: instead of running a text encoder, the UNet sees a fixed null embedding (a zeros tensor) in place of caption conditioning. This is enough for inpainting, where the masked region and surrounding pixels provide the context.

Because the training set is just 10 fixed images, the script encodes them all to latent space once with the VAE before training starts, then stores the latents in a TensorDataset. The VAE drops out of the training loop after that, so each step works on a cached latent. Roughly, the steps are

Random noise is added to a cached latent at a random timestep
The UNet predicts the noise, conditioned on a null text embedding
Compute MSE loss between predicted and actual noise

The dataset is the builtin/landscape-images builtin volume: nature and landscape photographs at 512x512. The script takes the first 10 images, which is deliberately tiny, so a run finishes in a few minutes on a single GPU and you can see how the pieces fit together.

At checkpoint intervals, the script generates inpainting demos, where it picks images from the training set, masks out a random rectangle, and runs the denoising loop to fill the masked region. These demo images are output to /artifacts/ so you can watch the model's inpainting ability improve over time.

Distributed training

The script wraps the UNet in PyTorch DDP. Each GPU keeps a full copy of the UNet and trains on a different slice of the batch, and DDP averages the gradients across them after every backward pass. The SD 1.5 UNet fits comfortably on one GPU, so there's no need to shard it; DDP just multiplies throughput. run.sh launches one worker per GPU with torchrun, counting the GPUs inside the container with nvidia-smi -L. The daemon passes every GPU on the machine into the container, so that count is the full GPU count. You can set GPUS_PER_NODE yourself to override it.

Environment variables

The learning rate is configurable at queue time with -e:

Variable	Default	Notes
`LEARNING_RATE`	`2e-5`	AdamW learning rate

Since each job runs independently, you can queue several at once to sweep learning rates:

for lr in 1e-5 2e-5 5e-5 1e-4; do
  squadron job queue --latest -e LEARNING_RATE=$lr
done

Using your own dataset

Upload your images as a volume:

squadron volume upload ./my-images --name my-images --kind dataset

Then add it to the volumes list in .squadron/manifest.toml:

volumes = ["builtin/stable-diffusion-v1-5", "my-images"]

Update DATASET_PATH in main.py to point at the new volume's mount path (/volumes/my-images). You may need to modify the data-loading code in main.py to match your dataset's format. The Volumes page has the full pattern.

Iterating

Update main.py or the env var defaults, then upload a new version. Run both commands from inside the cloned directory, so the CLI picks the project name up from the manifest:

squadron project upload --version-name v2
squadron job queue --latest --name sd-run-v2

If you'd rather pass --project explicitly, use the name that examples clone generated rather than stable-diffusion-ddp. Cloning appends a random suffix so that several people can clone the same example without colliding, and squadron project ls will show you the real name.

List all uploaded versions:

squadron project version ls

Next steps

Adapting Your Code has tips for modifying your training script to run on Squadron
The CLI reference has details on all the CLI commands used in this tutorial