Based on the GitHub repository you’ve linked: https://github.com/SergioMEV/slurm-for-dummies, I’ll adapt the instructions for setting up Slurm on a single GPU machine. Here’s a modified guide tailored to your use case:

Setting up Slurm on a Single GPU Machine

  1. Update your system:

    sudo apt update && sudo apt upgrade -y
  2. Install required packages:

    sudo apt install slurm-wlm munge libmunge2 libmunge-dev -y
  3. Setup Munge:

    sudo systemctl enable munge
    sudo systemctl start munge

    Test Munge:

    munge -n | unmunge | grep STATUS

    You should see STATUS: SUCCESS.

  4. Configure Slurm: Create a Slurm configuration file:

    sudo nano /etc/slurm/slurm.conf

    Add the following content (adjust based on your system):

    ClusterName=localhost
    SlurmctldHost=localhost
    MpiDefault=none
    ProctrackType=proctrack/linuxproc
    ReturnToService=1
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmdPidFile=/var/run/slurmd.pid
    SlurmdSpoolDir=/var/spool/slurmd
    SlurmUser=slurm
    StateSaveLocation=/var/spool/slurm
    SwitchType=switch/none
    TaskPlugin=task/affinity
    
    SchedulerType=sched/backfill
    SelectType=select/cons_tres
    SelectTypeParameters=CR_Core
    
    AccountingStorageType=accounting_storage/none
    JobAcctGatherType=jobacct_gather/none
    
    NodeName=localhost CPUs=1 RealMemory=1000 State=UNKNOWN
    PartitionName=debug Nodes=localhost Default=YES MaxTime=INFINITE State=UP
    
    GresTypes=gpu
    NodeName=localhost Gres=gpu:1
    
  5. Configure GRES for GPU:

    sudo nano /etc/slurm/gres.conf

    Add:

    NodeName=localhost Name=gpu File=/dev/nvidia0
    
  6. Start Slurm services:

    sudo systemctl enable slurmctld slurmd
    sudo systemctl start slurmctld slurmd
  7. Verify Slurm is running:

    sudo scontrol show node
    sinfo

Using Slurm with GPU

Create a test job script (gpu_test.sh):

#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --output=gpu_test_%j.out
#SBATCH --error=gpu_test_%j.err
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
 
nvidia-smi

Submit the job:

sbatch gpu_test.sh

Troubleshooting

  • Check Slurm logs: sudo less /var/log/slurm/slurmctld.log and /var/log/slurm/slurmd.log
  • Ensure NVIDIA drivers are correctly installed: nvidia-smi
  • Verify Slurm recognizes the GPU: scontrol show node | grep Gres

This setup creates a single-node Slurm cluster on your GPU machine, allowing you to submit and run GPU jobs using Slurm. Remember to adjust the configuration based on your specific hardware and requirements.

Citations: [1] https://github.com/SergioMEV/slurm-for-dummies