# Slurm HA Lab Guide (Preflight → Bring-up → HA Verification)
This doc is a battle-tested walkthrough for my Slurm HA lab, including the pitfalls we actually hit and how we recovered.
My lab setup:
- Network:
10.250.6.0/23 ctld1=10.250.6.50,compute1=10.250.6.52,storage1=10.250.6.53,ctld2=10.250.6.54- OS: Ubuntu 22.04
storage1acts as NTP server + NFS server- No Docker (removed on purpose)
⚠️ Note (important): in this run we took Option A: cgroup enforcement disabled to avoid Slurm 21.08 + Ubuntu 22.04 cgroup-v2 friction.
This makes the lab stable, but has real limitations (see Section 8).
0) Basic sanity checks (All nodes)
Run on ctld1 / ctld2 / compute1 / storage1:
hostnamectl
ip -br a
ping -c 2 storage1
ping -c 2 ctld1
Fix /etc/hosts (Avoid the “localhost / 127.0.1.1” trap)
On every node, use a clean /etc/hosts so hostname resolution is consistent everywhere:
sudo tee /etc/hosts >/dev/null <<'EOF'
127.0.0.1 localhost
10.250.6.50 ctld1
10.250.6.54 ctld2
10.250.6.52 compute1
10.250.6.53 storage1
# IPv6
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
EOF
getent hosts ctld1 ctld2 compute1 storage1
✅ Goal:
getent hosts ...returns the expected IPs on every node.
1) Chrony (Time sync) — storage1 as NTP server
Slurm + Munge are extremely sensitive to time drift. Fix time sync first.
1.1 Install and configure Chrony on storage1
sudo apt-get update
sudo apt-get install -y chrony
sudo sed -i 's/^pool /#pool /g' /etc/chrony/chrony.conf
sudo tee -a /etc/chrony/chrony.conf >/dev/null <<'EOF'
server time.google.com iburst
server time.cloudflare.com iburst
# Allow clients in my subnet
allow 10.250.6.0/23
# Safety net: if upstream is down, still serve time (higher stratum, won't override real sources)
local stratum 10
EOF
sudo systemctl restart chrony
sudo systemctl status chrony --no-pager
sudo ss -ulnp | grep ':123'
chronyc -n sources -v
chronyc tracking
✅ Goal:
chronyc sourcesshows a selected source with^*andLeap status: Normal.
1.2 Configure Chrony clients on ctld1 / ctld2 / compute1
Run on each client node:
sudo apt-get update
sudo apt-get install -y chrony
sudo sed -i 's/^pool /#pool /g' /etc/chrony/chrony.conf
grep -q '^server storage1\b' /etc/chrony/chrony.conf || \
echo 'server storage1 iburst' | sudo tee -a /etc/chrony/chrony.conf
sudo systemctl restart chrony
chronyc activity
chronyc -n sources -v
chronyc tracking
✅ Goal: client shows
^* 10.250.6.53 ...(orstorage1) and becomes synchronized.
Chrony pitfall we hit: storage1 sees no UDP/123 traffic
If storage1 sees no UDP/123 traffic from a client (tcpdump is empty), it’s usually not Chrony’s fault—something is blocking NTP (firewall / vSphere DFW / ACL).
# On storage1 (watch for client traffic)
sudo tcpdump -ni any udp port 123 and host 10.250.6.50 -vv
🧠 If you see storage1 talking to the internet (google/cloudflare) but never seeing client NTP packets, you’re looking at a network policy issue, not a chrony.conf issue.
2) NFS (Shared config + shared state)
We export from storage1:
/srv/slurm/etc→ shared Slurm configs/srv/slurm/state→ shared Slurm state (critical for HA behavior)
2.1 Setup NFS server on storage1
sudo apt-get update
sudo apt-get install -y nfs-kernel-server
sudo mkdir -p /srv/slurm/etc /srv/slurm/state
# pitfall: on some installs, exports.d may not exist
sudo mkdir -p /etc/exports.d
2.2 Create a slurm system user (same UID/GID on all nodes)
This matters because NFS permissions are numeric. If UID/GID differ across nodes, writes will fail in confusing ways.
Run on all nodes (including storage1):
SLURM_UID=64030
SLURM_GID=64030
getent group slurm >/dev/null || sudo groupadd -g $SLURM_GID slurm
id -u slurm >/dev/null 2>&1 || sudo useradd -u $SLURM_UID -g $SLURM_GID -r -M -s /usr/sbin/nologin slurm
id slurm
✅ Goal:
id slurmprints the same UID/GID on all nodes.
2.3 Fix directory ownership/permissions on storage1
# state directories used by slurmctld and slurmd
sudo mkdir -p /srv/slurm/state/slurmctld /srv/slurm/state/slurmd
# allow Slurm daemons (user slurm) to write state
sudo chown -R slurm:slurm /srv/slurm/state
sudo chmod 2775 /srv/slurm/state /srv/slurm/state/slurmctld /srv/slurm/state/slurmd
# configs: pick who should edit (example: tommy)
sudo chown -R tommy:tommy /srv/slurm/etc
sudo chmod 2775 /srv/slurm/etc
🔒 Why not just allow root? Because NFS exports usually keep
root_squash(safer).
That means client “root” becomesnobody, sosudo touchon the client fails—by design.
2.4 Export the directories from storage1
sudo tee /etc/exports.d/slurm.exports >/dev/null <<'EOF'
/srv/slurm/etc 10.250.6.0/23(rw,sync,no_subtree_check)
/srv/slurm/state 10.250.6.0/23(rw,sync,no_subtree_check)
EOF
sudo exportfs -ra
sudo exportfs -v | grep -E '/srv/slurm/(etc|state)' || true
sudo systemctl restart nfs-kernel-server
2.5 Mount from ctld1 / ctld2 / compute1
Run on each client node:
sudo apt-get update
sudo apt-get install -y nfs-common
sudo mkdir -p /mnt/slurm/etc /mnt/slurm/state
# mount test
sudo mount -t nfs -o vers=4 10.250.6.53:/srv/slurm/etc /mnt/slurm/etc
sudo mount -t nfs -o vers=4 10.250.6.53:/srv/slurm/state /mnt/slurm/state
df -h | grep /mnt/slurm
Persist it (and avoid duplicates):
grep -q '10\.250\.6\.53:/srv/slurm/etc' /etc/fstab || \
echo '10.250.6.53:/srv/slurm/etc /mnt/slurm/etc nfs4 rw,_netdev,hard,timeo=600,retrans=2 0 0' | sudo tee -a /etc/fstab
grep -q '10\.250\.6\.53:/srv/slurm/state' /etc/fstab || \
echo '10.250.6.53:/srv/slurm/state /mnt/slurm/state nfs4 rw,_netdev,hard,timeo=600,retrans=2 0 0' | sudo tee -a /etc/fstab
sudo umount /mnt/slurm/etc /mnt/slurm/state
sudo mount -a
df -h | grep /mnt/slurm
systemctl status remote-fs.target --no-pager
2.6 Write test (Do NOT use sudo/root here)
# tommy writes etc (if you allow it)
touch /mnt/slurm/etc/_write_test_from_$(hostname) && echo OK || echo FAIL
# slurm writes state (what we really care about)
sudo -u slurm touch /mnt/slurm/state/_write_test_from_$(hostname) && echo OK || echo FAIL
ls -l /mnt/slurm/etc/_write_test_from_$(hostname) /mnt/slurm/state/_write_test_from_$(hostname)
✅ Goal:
/statewrite succeeds as userslurm.
If you usedsudo touch ...and gotPermission denied, that was root_squash doing its job.
3) Munge (Shared authentication key)
Munge is the authentication passport used between Slurm components.
3.1 Install Munge on all nodes
Ubuntu 22.04 note: there is no munge-tools package.
You only need munge + libmunge2 for the basics.
Run on all nodes:
sudo apt-get update
sudo apt-get install -y munge libmunge2
sudo systemctl enable --now munge
systemctl status munge --no-pager
command -v munge unmunge remunge
3.2 Generate key on ctld1
sudo dd if=/dev/urandom of=/etc/munge/munge.key bs=1 count=1024 status=none
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 0400 /etc/munge/munge.key
sudo systemctl restart munge
munge -n | unmunge | egrep 'STATUS|ENCODE_HOST'
✅ Goal: STATUS success, and
ENCODE_HOSTshowsctld1 (10.250.6.50)(notlocalhost).
3.3 Distribute the same key to ctld2 / compute1 / storage1
We used the safer approach: scp then sudo install.
On ctld1:
sudo install -m 0644 /etc/munge/munge.key /tmp/munge.key
sudo chown $USER:$USER /tmp/munge.key
sha256sum /tmp/munge.key
for h in ctld2 compute1 storage1; do
echo "==> scp to $h"
scp /tmp/munge.key tommy@$h:/tmp/munge.key
done
for h in ctld2 compute1 storage1; do
echo "==> install on $h"
ssh -tt tommy@$h '
sudo install -o munge -g munge -m 0400 /tmp/munge.key /etc/munge/munge.key &&
sudo systemctl restart munge &&
sudo rm -f /tmp/munge.key
'
done
sudo rm -f /tmp/munge.key
🧨 Pitfall we hit: piping the key over ssh + sudo can blow up because
sudowants a TTY.
That’s whyscp + sudo installis boring… and boring is good.
3.4 Verify key matches everywhere
echo "ctld1:"; sudo sha256sum /etc/munge/munge.key
for h in ctld2 compute1 storage1; do
echo "$h:"
ssh -tt tommy@$h 'sudo sha256sum /etc/munge/munge.key'
done
3.5 Cross-host authentication test (Most important)
for h in ctld2 compute1 storage1; do
echo "==> ctld1 -> $h"
munge -n | ssh tommy@$h unmunge | egrep 'STATUS|ENCODE_HOST' | head
done
for h in ctld2 compute1 storage1; do
echo "==> $h -> ctld1"
ssh tommy@$h 'munge -n' | unmunge | egrep 'STATUS|ENCODE_HOST' | head
done
✅ Goal: every decode succeeds.
Munge pitfall we hit: ENCODE_HOST: localhost (127.0.1.1)
That happens when the node’s hostname resolves via 127.0.1.1 (Ubuntu default behavior).
Fix /etc/hosts (Section 0) so hostname -> real IP, then restart munge.
4) Slurm bring-up (No Docker)
Now the environment is Slurm-ready. The minimum pieces:
storage1:slurmdbd+ MariaDB (accounting DB)ctld1 / ctld2:slurmctld(HA controllers)compute1:slurmd(compute daemon)
📌 You already had
sacctmgr/sacctworking later, which means slurmdbd + DB were up in your run.
Still, below is the clean bring-up flow so it’s reproducible.
4.1 Install Slurm packages (example: Ubuntu repo)
Run on each node depending on role:
ctld1 / ctld2
sudo apt-get update
sudo apt-get install -y slurmctld slurm-client
compute1
sudo apt-get update
sudo apt-get install -y slurmd slurm-client
storage1
sudo apt-get update
sudo apt-get install -y slurmdbd mariadb-server
⚠️ Version note: Ubuntu 22.04 repo can give Slurm 21.08.x in some cases (as you saw:
slurmd version 21.08.5).
That version is where the cgroup-v2 pain usually starts on Jammy (see Section 8).
4.2 Place configs on shared NFS
We keep configs on /mnt/slurm/etc, then copy/symlink into /etc/slurm (so the service expects the normal paths).
On ctld1 (authoritative editor):
sudo mkdir -p /mnt/slurm/etc
sudo mkdir -p /etc/slurm
# example:
# /mnt/slurm/etc/slurm.conf
# /mnt/slurm/etc/slurmdbd.conf
# /mnt/slurm/etc/cgroup.conf (only if you enable cgroup - we disabled it)
# copy/sync into /etc/slurm
sudo install -m 0644 /mnt/slurm/etc/slurm.conf /etc/slurm/slurm.conf
Then sync /etc/slurm/slurm.conf to ctld2 and compute1 (or mount /mnt/slurm/etc on them and install locally the same way).
✅ Goal: all nodes have the same slurm.conf.
You already verified this withsha256sum /etc/slurm/slurm.confacross nodes.
4.3 Minimal HA lines to confirm in slurm.conf
On both controllers:
sudo egrep -n '^(SlurmctldHost|ControlMachine|BackupController|ControlAddr|BackupAddr|SlurmctldPort|SlurmdPort|StateSaveLocation)' /etc/slurm/slurm.conf
Example we saw:
SlurmctldHost=ctld1(10.250.6.50)SlurmctldHost=ctld2(10.250.6.54)SlurmctldPort=6817SlurmdPort=6818StateSaveLocation=/mnt/slurm/state/slurmctld
✅ Goal:
StateSaveLocationis on shared NFS so either controller can take over cleanly.
5) Accounting (slurmdbd + sacctmgr) — fix permission denied
Symptom we hit:
sacctmgr add account ...→ Access/permission denied
Fix: create a proper admin association (root admin) and then add accounts/users.
On ctld1:
# 1) inspect current DB
sudo sacctmgr -n list user format=User,AdminLevel,DefaultAccount
sudo sacctmgr -n list account format=Account,Organization,Description
# 2) ensure root account exists
sudo sacctmgr -n list account format=Account | grep -xq root || \
sudo sacctmgr -i add account name=root Description="Slurm root" Organization="lab"
# 3) ensure root user is admin
sudo sacctmgr -n list user format=User | grep -xq root || \
sudo sacctmgr -i add user name=root account=root AdminLevel=Administrator
# 4) default account
sudo sacctmgr -n list account format=Account | grep -xq default || \
sudo sacctmgr -i add account name=default Description="Default" Organization="lab"
# 5) your user
sudo sacctmgr -n list user format=User | grep -xq tommy || \
sudo sacctmgr -i add user name=tommy account=default
# 6) verify
sudo sacctmgr -n show assoc format=Cluster,Account,User,AdminLevel
Cluster check:
sudo sacctmgr -i show cluster
6) Smoke test (submit a job end-to-end)
On ctld1:
scontrol ping
sinfo
sbatch --wrap="echo whoami=$(whoami); hostname; sleep 2; date"
squeue
sacct -X --format=JobID,User,Account,State,Elapsed,NodeList%20
✅ Goal: job completes on
compute1and shows insacctwith correct account.
7) HA verification runbook (the “don’t lie to myself” version)
You asked “what’s the most reliable way to confirm HA?”
Here’s the practical answer: logs + functional job run.
7.1 Journal commands that include timestamps
Use one of these:
# follow with ISO timestamps
sudo journalctl -u slurmctld -f -o short-iso
# last N lines with timestamps
sudo journalctl -u slurmctld -n 200 -o short-iso --no-pager
# since a time point
sudo journalctl -u slurmctld --since "2025-12-16 02:00:00" -o short-iso --no-pager
🧠 That
-o short-isois the “show me the time” switch.
7.2 Baseline: both controllers online
From either node:
scontrol ping
Expected baseline behavior:
- both slurmctld processes are running
- one controller is active (primary in practice), the other is standby
⚠️ Important:
scontrol pinglabels “(primary)/(backup)” do not flip dynamically.
They reflect how the nodes are configured (ctld1 is “primary host” in config, ctld2 is “backup host”), not who is currently active.
7.3 Find the actual active controller
On ctld1 and ctld2, check for “Running as primary controller”:
sudo journalctl -u slurmctld -n 200 -o short-iso --no-pager | egrep -i 'Running as primary|taking over|standby|not responding'
- If you see
Running as primary controller, that node is currently active. - If you see
... taking over, that node just became active.
7.4 Simulate failover (stop the active one)
Example: stop ctld1 slurmctld, watch ctld2.
On ctld1
sudo systemctl stop slurmctld
On ctld2
sudo journalctl -u slurmctld -f -o short-iso | egrep -i 'taking over|Running as primary|not responding'
You should see something like:
ControlMachine ctld1 not responding, BackupController1 ctld2 taking over
Then prove it’s not “just logs”:
sbatch --wrap="echo from=$(hostname); whoami=$(whoami); sleep 2; date"
sacct -X --format=JobID,User,Account,State,Elapsed,NodeList%20 | tail -n 5
✅ Pass condition: a job runs successfully while ctld1 is down.
7.5 Bring ctld1 back (don’t fight the cluster)
sudo systemctl start slurmctld
At this point:
- It’s OK if ctld2 stays active.
- It’s OK if roles don’t “flip” in
scontrol pingoutput. - The goal is service continuity, not ego (“my ctld1 must be king”).
7.6 About scontrol takeover (why it felt weird)
takeover is a forceful action. If you hammer it or do it at the wrong time, you can briefly make both sides unhappy.
If you must force role change, the safest pattern is:
- ensure both controllers are healthy
- run takeover once
- wait for logs to settle
- run a job to confirm
✅ Your conclusion was correct: “誰是 primary 不重要,兩台都在線上 + 能跑 job 才重要。”
8) The cgroup incident (and what we changed)
Symptom we hit:
- compute node got
DRAIN - job failed with
Plugin initialization failed slurmd.logshowed:unable to mount cpuset cgroup namespace: Device or resource busyCouldn't load specified plugin name for task/cgroup
We confirmed compute1 was on cgroup v2:
stat -fc %T /sys/fs/cgroup
mount | grep -E 'cgroup|cgroup2'
What we did (Option A): Disable cgroup enforcement
This got jobs running immediately.
✅ Result: stable lab + HA verification possible.
What this means (real limitations)
With cgroup enforcement disabled:
- Slurm cannot reliably enforce CPU/memory limits per job
- isolation is weaker (jobs can be noisier neighbors)
- accounting may be less precise
- runaway processes are harder to contain
🧠 Forward-looking (customer-facing):
For production on Ubuntu 22.04, the “real fix” is usually use a Slurm version that properly supports cgroup v2, or align the host to cgroup v1 if you must stay on older Slurm.
9) Quick troubleshooting cheatsheet (by node)
ctld1 / ctld2 (controllers)
# health
scontrol ping
sinfo
scontrol show node compute1
# logs (with timestamps)
sudo journalctl -u slurmctld -n 200 -o short-iso --no-pager
sudo journalctl -u slurmctld -f -o short-iso
# who is active (look for "Running as primary controller")
sudo journalctl -u slurmctld -n 200 -o short-iso --no-pager | egrep -i 'Running as primary|taking over|standby|not responding'
# config drift check
sha256sum /etc/slurm/slurm.conf
compute1 (compute daemon)
systemctl status slurmd --no-pager
sudo journalctl -u slurmd -n 200 -o short-iso --no-pager
# common error zone
sudo tail -n 200 /var/log/slurm/slurmd.log 2>/dev/null || true
# cgroup reality check
stat -fc %T /sys/fs/cgroup
mount | grep -E 'cgroup|cgroup2'
storage1 (NTP + NFS)
# NTP
systemctl status chrony --no-pager
ss -ulnp | grep ':123'
chronyc -n sources -v
chronyc tracking
# NFS
systemctl status nfs-kernel-server --no-pager
exportfs -v | grep -E '/srv/slurm/(etc|state)' || true
# packet-level NTP debug (when in doubt)
sudo tcpdump -ni any udp port 123 -vv
Done: “Slurm-prepared” definition
I consider the cluster “Slurm-prepared” when:
- Chrony
storage1: synced to upstream (Leap status: Normal)- clients: synced to
storage1(selected^*source)
- NFS
- all clients mounted via fstab
- user
slurmcan write to/mnt/slurm/state
- Munge
- same
munge.keyhash on every node - cross-host decode tests all succeed
- same
- Slurm
sbatchworks andsacctrecords jobs- HA failover works: stop one controller, the other continues running jobs
✅ If you pass all of the above, you’re basically done.
After that, it’s not “setup”, it’s “operations”.