Adapting Your Code

Most programs need two small changes to run well on Squadron: save your work periodically, and pick up where you left off on restart.

Why? Jobs run on idle machines. If someone sits down at their workstation while your job is running, the job goes back into the queue and gets reassigned to a different machine. The new machine mounts whatever the previous run wrote to /artifacts, so your code can read those files and resume, but only if you wrote something to resume from.

If your program already writes output files and can restart from partial results, you may not need to change anything. Read on to see the pattern.

1. Save results to `/artifacts`

Anything your code writes to /artifacts inside the container gets uploaded automatically while the job runs. Write checkpoints there at regular intervals.

# Save a checkpoint every N steps
if step % SAVE_EVERY == 0:
    checkpoint = {
        "step": step,
        "model": model.state_dict(),
        "optimizer": optimizer.state_dict(),
    }
    torch.save(checkpoint, f"/artifacts/checkpoint-{step}.pt")

You can save whatever makes sense for your workload. Training scripts typically save model weights and optimizer state. Data processing pipelines might save the index of the last record processed. Simulations might dump their state to a file.

2. Resume from `/artifacts` on startup

When your program starts, check whether a previous run left anything in /artifacts. If it did, load it and continue from there instead of starting from scratch.

from pathlib import Path

checkpoints = sorted(Path("/artifacts").glob("checkpoint-*.pt"))
if checkpoints:
    state = torch.load(checkpoints[-1])
    model.load_state_dict(state["model"])
    optimizer.load_state_dict(state["optimizer"])
    start_step = state["step"]
    print(f"Resuming from step {start_step}")
else:
    start_step = 0

Detecting the container environment

The environment variable INSIDE_SQUADRON_CONTAINER is set to 1 inside every job container. You can use it to switch behavior depending on whether your code is running locally or on a remote machine.

import os

if os.environ.get("INSIDE_SQUADRON_CONTAINER"):
    save_dir = "/artifacts"
else:
    save_dir = "./outputs"

Next steps

The Stable Diffusion tutorial shows this pattern in a real training script with DDP and TensorBoard
Job Outputs explains how /artifacts uploading works
Job Queue covers the reassignment mechanics

Adapting Your Code

1. Save results to /artifacts

2. Resume from /artifacts on startup

Detecting the container environment

Next steps

1. Save results to `/artifacts`

2. Resume from `/artifacts` on startup