【新教程】Ubuntu 24.04 单节点安装slurm
<h2 id="背景">背景</h2><p>网上教程老旧,不适用。</p>
<h2 id="详细步骤">详细步骤</h2>
<p>1、安装slurm</p>
<pre><code>sudo apt install slurm-wlm slurm-wlm-doc -y
</code></pre>
<p>检查是否安装成功:</p>
<pre><code>slurmd --version
</code></pre>
<p>如果得到<code>slurm-wlm 23.11.4</code>,表明安装成功。<br>
2、配置slurm。<br>
使用命令:</p>
<pre><code>sudo vi /etc/slurm/slurm.conf
</code></pre>
<p>在其中输入以下内容:</p>
<pre><code>ClusterName=cool[自定义集群名称]
ControlMachine=master
#ControlAddr=
#BackupController=
#BackupAddr=
#
MailProg=/usr/bin/s-nail
SlurmUser=slurm
#SlurmdUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
PartitionName=CPU Nodes=master Default=NO MaxTime=INFINITE State=UP
#NodeName=master State=UNKNOWN
NodeName=master Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
</code></pre>
<p>其中,要修改以下参数,请勿和上述配置完全一样;<br>
ControlMachine=你的主机名,查看方法<code>hostname</code><br>
PartitionName=队列名称,可以自己起,比如改为<code>CPU</code><br>
Nodes=你的主机名,查看方法<code>hostname</code><br>
NodeName=你的主机名,查看方法<code>hostname</code><br>
Sockets=你服务器cpu的个数,查看方法<code>cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l</code><br>
CoresPerSocket=每个cpu的核数,查看方法<code>cat /proc/cpuinfo| grep "cpu cores"| uniq</code><br>
ThreadsPerCore填写方法:<br>
运行下面的脚本;</p>
<pre><code>#!/bin/bash
cpunum=`cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l`
echo "CPU 个数: $cpunum";
cpuhx=`cat /proc/cpuinfo | grep "cores" | uniq | awk -F":" '{print $2}'`
echo "CPU 核心数:$cpuhx" ;
cpuxc=`cat /proc/cpuinfo | grep "processor" | wc -l`
echo "CPU 线程数:$cpuxc" ;
if [[ `expr $cpunum\*$ ` -eq $cpuxc ]];
then
echo "开启了超线程"
else
echo "未开启超线程"
fi
</code></pre>
<p>如果开启了超线程填2,否则填1.<br>
3、创建文件夹。使用以下命令,创建所需的文件夹:</p>
<pre><code>sudo mkdir -p /var/spool/slurmd
sudo mkdir -p /var/spool/slurmctld
sudo chown -R slurm:slurm /var/spool/slurmd
sudo chown -R slurm:slurm /var/spool/slurmctld
sudo chmod -R 755/var/spool/slurmd
sudo chmod -R 755 /var/spool/slurmctld
</code></pre>
<p>4、启动slurm</p>
<pre><code>sudo systemctl enable slurmctld --now
sudo systemctl enable slurmd --now
</code></pre>
<p>5、确保节点状态初始化</p>
<pre><code>sudo scontrol update NodeName=ubuntuseerver State=RESUME
</code></pre>
<p>6、测试是否成功</p>
<pre><code>srun --partition=CPU --time=00:01:00 --ntasks=1 hostname
</code></pre>
<p>如果输出主机名则证明成功。</p>
<h2 id="报错处理">报错处理</h2>
<p>1、如果在启动服务的时候报错,重复执行以下内容;</p>
<pre><code>sudo chmod -R 755/var/spool/slurmd
sudo chmod -R 755 /var/spool/slurmctld
</code></pre>
<p>然后重新启动服务</p>
<pre><code>sudo systemctl restart slurmd
sudo systemctl restart slurmctld
</code></pre>
<p>其他报错,欢迎联系作者询问。</p>
<h2 id="备注">备注</h2>
<p>不同Ubuntu可能有所不同,本文适用于Ubuntu 24.04</p>
<h2 id="参考资料">参考资料</h2>
<p>https://wxyhgk.com/article/ubuntu-slurm</p><br><br>
来源:https://www.cnblogs.com/luk/p/18673674
頁:
[1]