Adapting Your Code
Most programs need two small changes to run well on Cycleswap: save your work periodically, and pick up where you left off on restart.
Why? Jobs run on idle machines. If someone sits down at their workstation while your job is running, the job goes back into the queue and gets reassigned to a different machine. The new machine downloads whatever the previous run wrote to /artifacts before starting the container, so your code can resume, but only if you wrote something to resume from.
If your program already writes output files and can restart from partial results, you may not need to change anything. Read on to see the pattern.
1. Save results to /artifacts
Anything your code writes to /artifacts inside the container gets uploaded automatically while the job runs. Write checkpoints there at regular intervals.
# Save a checkpoint every N steps
if step % SAVE_EVERY == 0:
checkpoint = {
"step": step,
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
}
torch.save(checkpoint, f"/artifacts/checkpoint-{step}.pt")You can save whatever makes sense for your workload. Training scripts typically save model weights and optimizer state. Data processing pipelines might save the index of the last record processed. Simulations might dump their state to a file.
2. Resume from /artifacts on startup
When your program starts, check whether a previous run left anything in /artifacts. If it did, load it and continue from there instead of starting from scratch.
from pathlib import Path
checkpoints = sorted(Path("/artifacts").glob("checkpoint-*.pt"))
if checkpoints:
state = torch.load(checkpoints[-1])
model.load_state_dict(state["model"])
optimizer.load_state_dict(state["optimizer"])
start_step = state["step"]
print(f"Resuming from step {start_step}")
else:
start_step = 0Detecting the container environment
The environment variable INSIDE_CYCLESWAP_CONTAINER is set to 1 inside every job container. You can use it to switch behavior depending on whether your code is running locally or on a remote machine.
import os
if os.environ.get("INSIDE_CYCLESWAP_CONTAINER"):
save_dir = "/artifacts"
else:
save_dir = "./outputs"Optional: Monitoring a running job
Files written to /artifacts stream back to your machine while the job runs. cycle job tail mirrors them into .cycleswap/jobs/<job-name>/artifacts/ locally, so you can inspect intermediate output without waiting for the job to finish.
You can use TensorBoard to watch training metrics update live. Write event files to a subdirectory of /artifacts (e.g. /artifacts/tensorboard/) and point TensorBoard at the local mirror:
tensorboard --logdir .cycleswap/jobs/<job-name>/artifacts/tensorboardThe curves update as new events arrive. The Stable Diffusion tutorial walks through this setup end to end.
Next steps
- The Stable Diffusion tutorial shows this pattern in a real training script with DDP and TensorBoard
- Job Outputs explains how
/artifactsuploading works - Job Queue covers the reassignment mechanics