Based on the GitHub repository you’ve linked: https://github.com/SergioMEV/slurm-for-dummies, I’ll adapt the instructions for setting up Slurm on a single GPU machine. Here’s a modified guide tailored to your use case:
Setting up Slurm on a Single GPU Machine
-
Update your system:
sudo apt update && sudo apt upgrade -y
-
Install required packages:
sudo apt install slurm-wlm munge libmunge2 libmunge-dev -y
-
Setup Munge:
sudo systemctl enable munge sudo systemctl start munge
Test Munge:
munge -n | unmunge | grep STATUS
You should see
STATUS: SUCCESS
. -
Configure Slurm: Create a Slurm configuration file:
sudo nano /etc/slurm/slurm.conf
Add the following content (adjust based on your system):
ClusterName=localhost SlurmctldHost=localhost MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurm SwitchType=switch/none TaskPlugin=task/affinity SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core AccountingStorageType=accounting_storage/none JobAcctGatherType=jobacct_gather/none NodeName=localhost CPUs=1 RealMemory=1000 State=UNKNOWN PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP GresTypes=gpu NodeName=localhost Gres=gpu:1
-
Configure GRES for GPU:
sudo nano /etc/slurm/gres.conf
Add:
NodeName=localhost Name=gpu File=/dev/nvidia0
-
Start Slurm services:
sudo systemctl enable slurmctld slurmd sudo systemctl start slurmctld slurmd
-
Verify Slurm is running:
sudo scontrol show node sinfo
Using Slurm with GPU
Create a test job script (gpu_test.sh
):
#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --output=gpu_test_%j.out
#SBATCH --error=gpu_test_%j.err
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
nvidia-smi
Submit the job:
sbatch gpu_test.sh
Troubleshooting
- Check Slurm logs:
sudo less /var/log/slurm/slurmctld.log
and/var/log/slurm/slurmd.log
- Ensure NVIDIA drivers are correctly installed:
nvidia-smi
- Verify Slurm recognizes the GPU:
scontrol show node | grep Gres
This setup creates a single-node Slurm cluster on your GPU machine, allowing you to submit and run GPU jobs using Slurm. Remember to adjust the configuration based on your specific hardware and requirements.
Citations: [1] https://github.com/SergioMEV/slurm-for-dummies