Tutorial: Fine-tuning Stable Diffusion with PyTorch DDP
Train Stable Diffusion 1.5 to inpaint landscape photographs. The whole thing runs on 10 images and finishes in a few minutes on a single GPU, so you can see the full cycle (upload, build, train, pull results) without waiting around. Once you have the workflow down, swap in your own dataset and scale up.
The script uses PyTorch DDP, so if the machine has multiple GPUs it splits each batch across them automatically. With one GPU it just runs normally.
Quick start
Clone the example, upload it, and queue a job:
cycle examples clone stable-diffusion-ddp && cd stable-diffusion-ddp
cycle project init
cycle project test # Optional
cycle project upload
cycle job queue --latest
cycle job tail --latestThe manifest declares two builtin volumes: the SD 1.5 model weights (builtin/stable-diffusion-v1-5) and a small set of landscape photos (builtin/landscape-images). The daemon mounts both into the container for you. There's no separate download step.
Once the build finishes you'll see training loss printed every 10 steps:
step 10/5000 loss=0.1823
step 20/5000 loss=0.1701
...Every 100 steps, the script saves a checkpoint and generates an inpainting demo image so you can see the model improving. The manifest sets artifact_upload_order = "LIFO", which means the most recent checkpoint always uploads first. Without this, checkpoint uploads can fall behind and queue up. These stream back to your machine through job tail in real time.
In a second terminal, point TensorBoard at the streaming directory to watch train/loss and train/lr update live:
tensorboard --logdir .cycleswap/jobs/<job-name>/artifacts/tensorboardOpen http://localhost:6006. TensorBoard re-reads the directory as new events arrive, so the curves update as training progresses. If you don't have it yet, pip install tensorboard.
Prerequisites
- The Cycleswap CLI installed (see Installation)
- At least one registered machine with a GPU (see Quickstart)
The training script
Only the UNet is fine-tuned. The VAE (image encoder/decoder) stays frozen, and there's no text encoder to load.
Training is unconditional: instead of running a text encoder, the UNet sees a fixed null embedding (a zeros tensor) in place of caption conditioning. This is enough for inpainting, where the masked region and surrounding pixels provide the context.
Because the training set is just 10 fixed images, the script encodes them all to latent space once with the VAE before training starts, then stores the latents in a TensorDataset. The VAE drops out of the training loop after that, so each step works on a cached latent:
- Random noise is added to a cached latent at a random timestep
- The UNet predicts the noise, conditioned on a null text embedding
- MSE loss between predicted and actual noise drives the weight update
The dataset is the builtin/landscape-images builtin volume: nature and landscape photographs at 512x512. The script takes the first 10 images, which is deliberately tiny, so a run finishes in a few minutes on a single GPU and you can see how the pieces fit together.
At checkpoint intervals, the script generates inpainting demos: it picks images from the training set, masks out a random rectangle, and runs the denoising loop to fill the masked region. These demo images land in /artifacts/ so you can watch the model's inpainting ability improve over training.
Distributed training
The script wraps the UNet in PyTorch DDP. Each GPU keeps a full copy of the UNet and trains on a different slice of the batch, and DDP averages the gradients across them after every backward pass. The SD 1.5 UNet fits comfortably on one GPU, so there's no need to shard it; DDP just multiplies throughput.
torchrun always launches the script, so there's no separate single-GPU path. With one GPU it's just a process group of one: the same DDP code runs, the gradient sync has nothing to average, and training proceeds as usual. That's why the code has no if distributed branches.
torchrun --nproc_per_node=$GPUS_PER_NODE in run.sh launches the workers. The daemon sets GPUS_PER_NODE based on what's available on the machine.
Environment variables
The learning rate is configurable at queue time with -e:
| Variable | Default | Notes |
|---|---|---|
LEARNING_RATE | 2e-5 | AdamW learning rate |
Since each job runs independently, you can queue several at once to sweep learning rates:
for lr in 1e-5 2e-5 5e-5 1e-4; do
cycle job queue --latest -e LEARNING_RATE=$lr
doneUsing your own dataset
Upload your images as a volume:
cycle volume upload ./my-images --name my-images --kind datasetThen add it to the volumes list in manifest.toml:
volumes = ["builtin/stable-diffusion-v1-5", "my-images"]Update DATASET_PATH in main.py to point at the new volume's mount path (/volumes/my-images). You may need to modify the data-loading code in main.py to match your dataset's format. The Volumes page has the full pattern.
Iterating
Update main.py or the env var defaults, then upload a new version:
cycle project upload --version-name v2
cycle job queue --project stable-diffusion-ddp --latest --name sd-run-v2List all uploaded versions:
cycle project version lsNext steps
- Adapting Your Code has tips for modifying your training script to run on Cycleswap
- The CLI reference has details on all the CLI commands used in this tutorial