documentation

Adapting Your Code

Most programs need two small changes to run well on Cycleswap: save your work periodically, and pick up where you left off on restart.

Why? Jobs run on idle machines. If someone sits down at their workstation while your job is running, the job goes back into the queue and gets reassigned to a different machine. The new machine downloads whatever the previous run wrote to /artifacts before starting the container, so your code can resume, but only if you wrote something to resume from.

If your program already writes output files and can restart from partial results, you may not need to change anything. Read on to see the pattern.

1. Save results to /artifacts

Anything your code writes to /artifacts inside the container gets uploaded automatically while the job runs. Write checkpoints there at regular intervals.

# Save a checkpoint every N steps
if step % SAVE_EVERY == 0:
    checkpoint = {
        "step": step,
        "model": model.state_dict(),
        "optimizer": optimizer.state_dict(),
    }
    torch.save(checkpoint, f"/artifacts/checkpoint-{step}.pt")

You can save whatever makes sense for your workload. Training scripts typically save model weights and optimizer state. Data processing pipelines might save the index of the last record processed. Simulations might dump their state to a file.

2. Resume from /artifacts on startup

When your program starts, check whether a previous run left anything in /artifacts. If it did, load it and continue from there instead of starting from scratch.

from pathlib import Path

checkpoints = sorted(Path("/artifacts").glob("checkpoint-*.pt"))
if checkpoints:
    state = torch.load(checkpoints[-1])
    model.load_state_dict(state["model"])
    optimizer.load_state_dict(state["optimizer"])
    start_step = state["step"]
    print(f"Resuming from step {start_step}")
else:
    start_step = 0

Detecting the container environment

The environment variable INSIDE_CYCLESWAP_CONTAINER is set to 1 inside every job container. You can use it to switch behavior depending on whether your code is running locally or on a remote machine.

import os

if os.environ.get("INSIDE_CYCLESWAP_CONTAINER"):
    save_dir = "/artifacts"
else:
    save_dir = "./outputs"

Optional: Monitoring a running job

Files written to /artifacts stream back to your machine while the job runs. cycle job tail mirrors them into .cycleswap/jobs/<job-name>/artifacts/ locally, so you can inspect intermediate output without waiting for the job to finish.

You can use TensorBoard to watch training metrics update live. Write event files to a subdirectory of /artifacts (e.g. /artifacts/tensorboard/) and point TensorBoard at the local mirror:

tensorboard --logdir .cycleswap/jobs/<job-name>/artifacts/tensorboard

The curves update as new events arrive. The Stable Diffusion tutorial walks through this setup end to end.

Next steps