Notes on setting up Slurm on a single GPU machine, adapted from slurm-for-dummies. Useful if you want job scheduling on a local workstation without a full cluster.
Install and configure
Update and install the packages:
sudo apt update && sudo apt upgrade -y
sudo apt install slurm-wlm munge libmunge2 libmunge-dev -yStart Munge (handles authentication between Slurm components):
sudo systemctl enable munge
sudo systemctl start munge
munge -n | unmunge | grep STATUS
# should print STATUS: SUCCESSSlurm config
Create /etc/slurm/slurm.conf (adjust CPUs, RealMemory, and GPU count to match your machine):
ClusterName=localhost
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/affinity
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/none
JobAcctGatherType=jobacct_gather/none
NodeName=localhost CPUs=1 RealMemory=1000 State=UNKNOWN
PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP
GresTypes=gpu
NodeName=localhost Gres=gpu:1
GPU setup (GRES)
Create /etc/slurm/gres.conf:
NodeName=localhost Name=gpu File=/dev/nvidia0
If you have multiple GPUs, add a line per device (/dev/nvidia0, /dev/nvidia1, etc.) and update Gres=gpu:N in slurm.conf.
Start it up
sudo systemctl enable slurmctld slurmd
sudo systemctl start slurmctld slurmdCheck that the node is up:
sinfo
scontrol show nodeTest with a GPU job
gpu_test.sh:
#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --output=gpu_test_%j.out
#SBATCH --error=gpu_test_%j.err
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
nvidia-smisbatch gpu_test.shIf things go wrong
- Slurm logs:
/var/log/slurm/slurmctld.logand/var/log/slurm/slurmd.log - GPU not showing up? Check
nvidia-smiworks first, thenscontrol show node | grep Gres - Node stuck in
drainstate?sudo scontrol update NodeName=localhost State=RESUME