Notes on setting up Slurm on a single GPU machine, adapted from slurm-for-dummies. Useful if you want job scheduling on a local workstation without a full cluster.

Install and configure

Update and install the packages:

sudo apt update && sudo apt upgrade -y
sudo apt install slurm-wlm munge libmunge2 libmunge-dev -y

Start Munge (handles authentication between Slurm components):

sudo systemctl enable munge
sudo systemctl start munge
munge -n | unmunge | grep STATUS
# should print STATUS: SUCCESS

Slurm config

Create /etc/slurm/slurm.conf (adjust CPUs, RealMemory, and GPU count to match your machine):

ClusterName=localhost
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/affinity

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

AccountingStorageType=accounting_storage/none
JobAcctGatherType=jobacct_gather/none

NodeName=localhost CPUs=1 RealMemory=1000 State=UNKNOWN
PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP

GresTypes=gpu
NodeName=localhost Gres=gpu:1

GPU setup (GRES)

Create /etc/slurm/gres.conf:

NodeName=localhost Name=gpu File=/dev/nvidia0

If you have multiple GPUs, add a line per device (/dev/nvidia0, /dev/nvidia1, etc.) and update Gres=gpu:N in slurm.conf.

Start it up

sudo systemctl enable slurmctld slurmd
sudo systemctl start slurmctld slurmd

Check that the node is up:

sinfo
scontrol show node

Test with a GPU job

gpu_test.sh:

#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --output=gpu_test_%j.out
#SBATCH --error=gpu_test_%j.err
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
 
nvidia-smi
sbatch gpu_test.sh

If things go wrong

  • Slurm logs: /var/log/slurm/slurmctld.log and /var/log/slurm/slurmd.log
  • GPU not showing up? Check nvidia-smi works first, then scontrol show node | grep Gres
  • Node stuck in drain state? sudo scontrol update NodeName=localhost State=RESUME