Job queue

When you run squadron job queue, the job enters a queue. Within about a minute, it gets assigned to an available machine and starts running. This page explains what happens between queuing and execution.

How jobs get assigned

The system checks for pending jobs roughly every 60 seconds. When it finds one, it picks the least-loaded machine that meets the job's requirements and assigns the job there. The machine picks it up within a few seconds.

A machine can run at most 1 job at once. If all machines are busy, the job waits until one is free.

GPU matching

If your manifest sets cuda in the [job] section (e.g. cuda = "13.0"), the job only runs on machines whose GPU driver supports that CUDA version or newer. This is a scheduling constraint, not something that installs a driver in the container. If cuda is omitted (the default), the job can run anywhere.

Memory matching

Every manifest must set min_memory in the [job] section. When the job is assigned to a machine, it is granted 90% of that machine's total memory. This is considered when assigning jobs.

If no machine in your organization has enough memory, squadron job queue will still succeed but print a warning, and the job will remain queued.

Running your own jobs while busy

When you're actively using your machine, it's marked BUSY. Other people's jobs won't get assigned to it. But your own jobs still can, by default, since you probably don't want your own work to stall just because you're at your desk.

Idle machines are always preferred. Your job will only be assigned to a busy machine (that you own) if no idle machines anywhere are available.

To opt out of this behavior entirely, run squadron daemon forbid-busy-assign. See Daemon for details.

What happens when a machine goes down

Jobs don't get stuck. If a machine goes offline, runs too long, or becomes busy while running someone else's job, the job goes back into the queue and gets reassigned to a different machine on the next pass.

When the new machine picks up the job, the files the previous run wrote to /artifacts are there again. The daemon does not copy anything down in advance; it fetches each file from storage the first time your code reads it, so a large checkpoint only transfers if you actually load it. Your code can resume where it left off, as long as you write checkpoints to /artifacts and check for them at startup.

Heads up

Your program should be written to resume from /artifacts. Save checkpoints there periodically, and on startup, look for existing checkpoints to continue from. If you don't do this, a reassigned job restarts from scratch every time it moves to a new machine. See Adapting Your Code for details.

Job states

Status	Meaning
`ACTIVE`	Queued or currently running
`PAUSED`	You paused it; the system skips it
`SUCCESS`	The container exited with code 0
`FAILURE`	The container exited with a non-zero code

Pausing and resuming

You can pause a job from the CLI (squadron job pause) or from the Jobs page. A paused job stays in the system but won't be assigned. Resume it to put it back in the queue.

Admins and Owners can pause or resume anyone's jobs. Regular members can only pause their own.

Email notifications

If the manifest sets email_on_success or email_on_failure to true, you get an email when the job finishes. The email goes to the address set in Settings, by default the email associated with your login. You can also override these flags per-job when queuing from the CLI.

Completed jobs in the dashboard

The Jobs page shows completed jobs from the last 30 days, capped at 100. Older jobs are still accessible via squadron job ls.