11 min read

# Slurm HA Lab Guide (Preflight → Bring-up → HA Verification)

This doc is a battle-tested walkthrough for my Slurm HA lab, including the pitfalls we actually hit and how we recovered.

My lab setup:

  • Network: 10.250.6.0/23
  • ctld1=10.250.6.50, compute1=10.250.6.52, storage1=10.250.6.53, ctld2=10.250.6.54
  • OS: Ubuntu 22.04
  • storage1 acts as NTP server + NFS server
  • No Docker (removed on purpose)

⚠️ Note (important): in this run we took Option A: cgroup enforcement disabled to avoid Slurm 21.08 + Ubuntu 22.04 cgroup-v2 friction.
This makes the lab stable, but has real limitations (see Section 8).



0) Basic sanity checks (All nodes)

Run on ctld1 / ctld2 / compute1 / storage1:

hostnamectl
ip -br a

ping -c 2 storage1
ping -c 2 ctld1

Fix /etc/hosts (Avoid the “localhost / 127.0.1.1” trap)

On every node, use a clean /etc/hosts so hostname resolution is consistent everywhere:

sudo tee /etc/hosts >/dev/null <<'EOF'
127.0.0.1 localhost

10.250.6.50 ctld1
10.250.6.54 ctld2
10.250.6.52 compute1
10.250.6.53 storage1

# IPv6
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
EOF

getent hosts ctld1 ctld2 compute1 storage1

✅ Goal: getent hosts ... returns the expected IPs on every node.



1) Chrony (Time sync) — storage1 as NTP server

Slurm + Munge are extremely sensitive to time drift. Fix time sync first.


1.1 Install and configure Chrony on storage1

sudo apt-get update
sudo apt-get install -y chrony

sudo sed -i 's/^pool /#pool /g' /etc/chrony/chrony.conf

sudo tee -a /etc/chrony/chrony.conf >/dev/null <<'EOF'
server time.google.com iburst
server time.cloudflare.com iburst

# Allow clients in my subnet
allow 10.250.6.0/23

# Safety net: if upstream is down, still serve time (higher stratum, won't override real sources)
local stratum 10
EOF

sudo systemctl restart chrony
sudo systemctl status chrony --no-pager
sudo ss -ulnp | grep ':123'

chronyc -n sources -v
chronyc tracking

✅ Goal: chronyc sources shows a selected source with ^* and Leap status: Normal.


1.2 Configure Chrony clients on ctld1 / ctld2 / compute1

Run on each client node:

sudo apt-get update
sudo apt-get install -y chrony

sudo sed -i 's/^pool /#pool /g' /etc/chrony/chrony.conf

grep -q '^server storage1\b' /etc/chrony/chrony.conf || \
  echo 'server storage1 iburst' | sudo tee -a /etc/chrony/chrony.conf

sudo systemctl restart chrony

chronyc activity
chronyc -n sources -v
chronyc tracking

✅ Goal: client shows ^* 10.250.6.53 ... (or storage1) and becomes synchronized.


Chrony pitfall we hit: storage1 sees no UDP/123 traffic

If storage1 sees no UDP/123 traffic from a client (tcpdump is empty), it’s usually not Chrony’s fault—something is blocking NTP (firewall / vSphere DFW / ACL).

# On storage1 (watch for client traffic)
sudo tcpdump -ni any udp port 123 and host 10.250.6.50 -vv

🧠 If you see storage1 talking to the internet (google/cloudflare) but never seeing client NTP packets, you’re looking at a network policy issue, not a chrony.conf issue.



2) NFS (Shared config + shared state)

We export from storage1:

  • /srv/slurm/etc → shared Slurm configs
  • /srv/slurm/state → shared Slurm state (critical for HA behavior)

2.1 Setup NFS server on storage1

sudo apt-get update
sudo apt-get install -y nfs-kernel-server

sudo mkdir -p /srv/slurm/etc /srv/slurm/state

# pitfall: on some installs, exports.d may not exist
sudo mkdir -p /etc/exports.d

2.2 Create a slurm system user (same UID/GID on all nodes)

This matters because NFS permissions are numeric. If UID/GID differ across nodes, writes will fail in confusing ways.

Run on all nodes (including storage1):

SLURM_UID=64030
SLURM_GID=64030

getent group slurm >/dev/null || sudo groupadd -g $SLURM_GID slurm
id -u slurm >/dev/null 2>&1 || sudo useradd -u $SLURM_UID -g $SLURM_GID -r -M -s /usr/sbin/nologin slurm

id slurm

✅ Goal: id slurm prints the same UID/GID on all nodes.


2.3 Fix directory ownership/permissions on storage1

# state directories used by slurmctld and slurmd
sudo mkdir -p /srv/slurm/state/slurmctld /srv/slurm/state/slurmd

# allow Slurm daemons (user slurm) to write state
sudo chown -R slurm:slurm /srv/slurm/state
sudo chmod 2775 /srv/slurm/state /srv/slurm/state/slurmctld /srv/slurm/state/slurmd

# configs: pick who should edit (example: tommy)
sudo chown -R tommy:tommy /srv/slurm/etc
sudo chmod 2775 /srv/slurm/etc

🔒 Why not just allow root? Because NFS exports usually keep root_squash (safer).
That means client “root” becomes nobody, so sudo touch on the client fails—by design.


2.4 Export the directories from storage1

sudo tee /etc/exports.d/slurm.exports >/dev/null <<'EOF'
/srv/slurm/etc   10.250.6.0/23(rw,sync,no_subtree_check)
/srv/slurm/state 10.250.6.0/23(rw,sync,no_subtree_check)
EOF

sudo exportfs -ra
sudo exportfs -v | grep -E '/srv/slurm/(etc|state)' || true
sudo systemctl restart nfs-kernel-server

2.5 Mount from ctld1 / ctld2 / compute1

Run on each client node:

sudo apt-get update
sudo apt-get install -y nfs-common

sudo mkdir -p /mnt/slurm/etc /mnt/slurm/state

# mount test
sudo mount -t nfs -o vers=4 10.250.6.53:/srv/slurm/etc   /mnt/slurm/etc
sudo mount -t nfs -o vers=4 10.250.6.53:/srv/slurm/state /mnt/slurm/state

df -h | grep /mnt/slurm

Persist it (and avoid duplicates):

grep -q '10\.250\.6\.53:/srv/slurm/etc' /etc/fstab || \
  echo '10.250.6.53:/srv/slurm/etc   /mnt/slurm/etc   nfs4  rw,_netdev,hard,timeo=600,retrans=2  0  0' | sudo tee -a /etc/fstab

grep -q '10\.250\.6\.53:/srv/slurm/state' /etc/fstab || \
  echo '10.250.6.53:/srv/slurm/state /mnt/slurm/state nfs4  rw,_netdev,hard,timeo=600,retrans=2  0  0' | sudo tee -a /etc/fstab

sudo umount /mnt/slurm/etc /mnt/slurm/state
sudo mount -a

df -h | grep /mnt/slurm
systemctl status remote-fs.target --no-pager

2.6 Write test (Do NOT use sudo/root here)

# tommy writes etc (if you allow it)
touch /mnt/slurm/etc/_write_test_from_$(hostname) && echo OK || echo FAIL

# slurm writes state (what we really care about)
sudo -u slurm touch /mnt/slurm/state/_write_test_from_$(hostname) && echo OK || echo FAIL

ls -l /mnt/slurm/etc/_write_test_from_$(hostname) /mnt/slurm/state/_write_test_from_$(hostname)

✅ Goal: /state write succeeds as user slurm.
If you used sudo touch ... and got Permission denied, that was root_squash doing its job.



3) Munge (Shared authentication key)

Munge is the authentication passport used between Slurm components.


3.1 Install Munge on all nodes

Ubuntu 22.04 note: there is no munge-tools package.
You only need munge + libmunge2 for the basics.

Run on all nodes:

sudo apt-get update
sudo apt-get install -y munge libmunge2

sudo systemctl enable --now munge
systemctl status munge --no-pager

command -v munge unmunge remunge

3.2 Generate key on ctld1

sudo dd if=/dev/urandom of=/etc/munge/munge.key bs=1 count=1024 status=none
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 0400 /etc/munge/munge.key
sudo systemctl restart munge

munge -n | unmunge | egrep 'STATUS|ENCODE_HOST'

✅ Goal: STATUS success, and ENCODE_HOST shows ctld1 (10.250.6.50) (not localhost).


3.3 Distribute the same key to ctld2 / compute1 / storage1

We used the safer approach: scp then sudo install.

On ctld1:

sudo install -m 0644 /etc/munge/munge.key /tmp/munge.key
sudo chown $USER:$USER /tmp/munge.key
sha256sum /tmp/munge.key

for h in ctld2 compute1 storage1; do
  echo "==> scp to $h"
  scp /tmp/munge.key tommy@$h:/tmp/munge.key
done

for h in ctld2 compute1 storage1; do
  echo "==> install on $h"
  ssh -tt tommy@$h '
    sudo install -o munge -g munge -m 0400 /tmp/munge.key /etc/munge/munge.key &&
    sudo systemctl restart munge &&
    sudo rm -f /tmp/munge.key
  '
done

sudo rm -f /tmp/munge.key

🧨 Pitfall we hit: piping the key over ssh + sudo can blow up because sudo wants a TTY.
That’s why scp + sudo install is boring… and boring is good.


3.4 Verify key matches everywhere

echo "ctld1:"; sudo sha256sum /etc/munge/munge.key
for h in ctld2 compute1 storage1; do
  echo "$h:"
  ssh -tt tommy@$h 'sudo sha256sum /etc/munge/munge.key'
done

3.5 Cross-host authentication test (Most important)

for h in ctld2 compute1 storage1; do
  echo "==> ctld1 -> $h"
  munge -n | ssh tommy@$h unmunge | egrep 'STATUS|ENCODE_HOST' | head
done

for h in ctld2 compute1 storage1; do
  echo "==> $h -> ctld1"
  ssh tommy@$h 'munge -n' | unmunge | egrep 'STATUS|ENCODE_HOST' | head
done

✅ Goal: every decode succeeds.


Munge pitfall we hit: ENCODE_HOST: localhost (127.0.1.1)

That happens when the node’s hostname resolves via 127.0.1.1 (Ubuntu default behavior).
Fix /etc/hosts (Section 0) so hostname -> real IP, then restart munge.



4) Slurm bring-up (No Docker)

Now the environment is Slurm-ready. The minimum pieces:

  • storage1: slurmdbd + MariaDB (accounting DB)
  • ctld1 / ctld2: slurmctld (HA controllers)
  • compute1: slurmd (compute daemon)

📌 You already had sacctmgr/sacct working later, which means slurmdbd + DB were up in your run.
Still, below is the clean bring-up flow so it’s reproducible.


4.1 Install Slurm packages (example: Ubuntu repo)

Run on each node depending on role:

ctld1 / ctld2

sudo apt-get update
sudo apt-get install -y slurmctld slurm-client

compute1

sudo apt-get update
sudo apt-get install -y slurmd slurm-client

storage1

sudo apt-get update
sudo apt-get install -y slurmdbd mariadb-server

⚠️ Version note: Ubuntu 22.04 repo can give Slurm 21.08.x in some cases (as you saw: slurmd version 21.08.5).
That version is where the cgroup-v2 pain usually starts on Jammy (see Section 8).


4.2 Place configs on shared NFS

We keep configs on /mnt/slurm/etc, then copy/symlink into /etc/slurm (so the service expects the normal paths).

On ctld1 (authoritative editor):

sudo mkdir -p /mnt/slurm/etc
sudo mkdir -p /etc/slurm

# example:
# /mnt/slurm/etc/slurm.conf
# /mnt/slurm/etc/slurmdbd.conf
# /mnt/slurm/etc/cgroup.conf (only if you enable cgroup - we disabled it)

# copy/sync into /etc/slurm
sudo install -m 0644 /mnt/slurm/etc/slurm.conf /etc/slurm/slurm.conf

Then sync /etc/slurm/slurm.conf to ctld2 and compute1 (or mount /mnt/slurm/etc on them and install locally the same way).

✅ Goal: all nodes have the same slurm.conf.
You already verified this with sha256sum /etc/slurm/slurm.conf across nodes.


4.3 Minimal HA lines to confirm in slurm.conf

On both controllers:

sudo egrep -n '^(SlurmctldHost|ControlMachine|BackupController|ControlAddr|BackupAddr|SlurmctldPort|SlurmdPort|StateSaveLocation)' /etc/slurm/slurm.conf

Example we saw:

  • SlurmctldHost=ctld1(10.250.6.50)
  • SlurmctldHost=ctld2(10.250.6.54)
  • SlurmctldPort=6817
  • SlurmdPort=6818
  • StateSaveLocation=/mnt/slurm/state/slurmctld

✅ Goal: StateSaveLocation is on shared NFS so either controller can take over cleanly.



5) Accounting (slurmdbd + sacctmgr) — fix permission denied

Symptom we hit:

  • sacctmgr add account ...Access/permission denied

Fix: create a proper admin association (root admin) and then add accounts/users.

On ctld1:

# 1) inspect current DB
sudo sacctmgr -n list user format=User,AdminLevel,DefaultAccount
sudo sacctmgr -n list account format=Account,Organization,Description

# 2) ensure root account exists
sudo sacctmgr -n list account format=Account | grep -xq root || \
  sudo sacctmgr -i add account name=root Description="Slurm root" Organization="lab"

# 3) ensure root user is admin
sudo sacctmgr -n list user format=User | grep -xq root || \
  sudo sacctmgr -i add user name=root account=root AdminLevel=Administrator

# 4) default account
sudo sacctmgr -n list account format=Account | grep -xq default || \
  sudo sacctmgr -i add account name=default Description="Default" Organization="lab"

# 5) your user
sudo sacctmgr -n list user format=User | grep -xq tommy || \
  sudo sacctmgr -i add user name=tommy account=default

# 6) verify
sudo sacctmgr -n show assoc format=Cluster,Account,User,AdminLevel

Cluster check:

sudo sacctmgr -i show cluster


6) Smoke test (submit a job end-to-end)

On ctld1:

scontrol ping
sinfo

sbatch --wrap="echo whoami=$(whoami); hostname; sleep 2; date"
squeue
sacct -X --format=JobID,User,Account,State,Elapsed,NodeList%20

✅ Goal: job completes on compute1 and shows in sacct with correct account.



7) HA verification runbook (the “don’t lie to myself” version)

You asked “what’s the most reliable way to confirm HA?”
Here’s the practical answer: logs + functional job run.


7.1 Journal commands that include timestamps

Use one of these:

# follow with ISO timestamps
sudo journalctl -u slurmctld -f -o short-iso

# last N lines with timestamps
sudo journalctl -u slurmctld -n 200 -o short-iso --no-pager

# since a time point
sudo journalctl -u slurmctld --since "2025-12-16 02:00:00" -o short-iso --no-pager

🧠 That -o short-iso is the “show me the time” switch.


7.2 Baseline: both controllers online

From either node:

scontrol ping

Expected baseline behavior:

  • both slurmctld processes are running
  • one controller is active (primary in practice), the other is standby

⚠️ Important: scontrol ping labels “(primary)/(backup)” do not flip dynamically.
They reflect how the nodes are configured (ctld1 is “primary host” in config, ctld2 is “backup host”), not who is currently active.


7.3 Find the actual active controller

On ctld1 and ctld2, check for “Running as primary controller”:

sudo journalctl -u slurmctld -n 200 -o short-iso --no-pager | egrep -i 'Running as primary|taking over|standby|not responding'
  • If you see Running as primary controller, that node is currently active.
  • If you see ... taking over, that node just became active.

7.4 Simulate failover (stop the active one)

Example: stop ctld1 slurmctld, watch ctld2.

On ctld1

sudo systemctl stop slurmctld

On ctld2

sudo journalctl -u slurmctld -f -o short-iso | egrep -i 'taking over|Running as primary|not responding'

You should see something like:

  • ControlMachine ctld1 not responding, BackupController1 ctld2 taking over

Then prove it’s not “just logs”:

sbatch --wrap="echo from=$(hostname); whoami=$(whoami); sleep 2; date"
sacct -X --format=JobID,User,Account,State,Elapsed,NodeList%20 | tail -n 5

✅ Pass condition: a job runs successfully while ctld1 is down.


7.5 Bring ctld1 back (don’t fight the cluster)

sudo systemctl start slurmctld

At this point:

  • It’s OK if ctld2 stays active.
  • It’s OK if roles don’t “flip” in scontrol ping output.
  • The goal is service continuity, not ego (“my ctld1 must be king”).

7.6 About scontrol takeover (why it felt weird)

takeover is a forceful action. If you hammer it or do it at the wrong time, you can briefly make both sides unhappy.

If you must force role change, the safest pattern is:

  1. ensure both controllers are healthy
  2. run takeover once
  3. wait for logs to settle
  4. run a job to confirm

✅ Your conclusion was correct: “誰是 primary 不重要,兩台都在線上 + 能跑 job 才重要。”



8) The cgroup incident (and what we changed)

Symptom we hit:

  • compute node got DRAIN
  • job failed with Plugin initialization failed
  • slurmd.log showed:
    • unable to mount cpuset cgroup namespace: Device or resource busy
    • Couldn't load specified plugin name for task/cgroup

We confirmed compute1 was on cgroup v2:

stat -fc %T /sys/fs/cgroup
mount | grep -E 'cgroup|cgroup2'

What we did (Option A): Disable cgroup enforcement

This got jobs running immediately.

✅ Result: stable lab + HA verification possible.


What this means (real limitations)

With cgroup enforcement disabled:

  • Slurm cannot reliably enforce CPU/memory limits per job
  • isolation is weaker (jobs can be noisier neighbors)
  • accounting may be less precise
  • runaway processes are harder to contain

🧠 Forward-looking (customer-facing):
For production on Ubuntu 22.04, the “real fix” is usually use a Slurm version that properly supports cgroup v2, or align the host to cgroup v1 if you must stay on older Slurm.



9) Quick troubleshooting cheatsheet (by node)

ctld1 / ctld2 (controllers)

# health
scontrol ping
sinfo
scontrol show node compute1

# logs (with timestamps)
sudo journalctl -u slurmctld -n 200 -o short-iso --no-pager
sudo journalctl -u slurmctld -f -o short-iso

# who is active (look for "Running as primary controller")
sudo journalctl -u slurmctld -n 200 -o short-iso --no-pager | egrep -i 'Running as primary|taking over|standby|not responding'

# config drift check
sha256sum /etc/slurm/slurm.conf

compute1 (compute daemon)

systemctl status slurmd --no-pager
sudo journalctl -u slurmd -n 200 -o short-iso --no-pager

# common error zone
sudo tail -n 200 /var/log/slurm/slurmd.log 2>/dev/null || true

# cgroup reality check
stat -fc %T /sys/fs/cgroup
mount | grep -E 'cgroup|cgroup2'

storage1 (NTP + NFS)

# NTP
systemctl status chrony --no-pager
ss -ulnp | grep ':123'
chronyc -n sources -v
chronyc tracking

# NFS
systemctl status nfs-kernel-server --no-pager
exportfs -v | grep -E '/srv/slurm/(etc|state)' || true

# packet-level NTP debug (when in doubt)
sudo tcpdump -ni any udp port 123 -vv


Done: “Slurm-prepared” definition

I consider the cluster “Slurm-prepared” when:

  • Chrony
    • storage1: synced to upstream (Leap status: Normal)
    • clients: synced to storage1 (selected ^* source)
  • NFS
    • all clients mounted via fstab
    • user slurm can write to /mnt/slurm/state
  • Munge
    • same munge.key hash on every node
    • cross-host decode tests all succeed
  • Slurm
    • sbatch works and sacct records jobs
    • HA failover works: stop one controller, the other continues running jobs

✅ If you pass all of the above, you’re basically done.
After that, it’s not “setup”, it’s “operations”.