中年米罗 發表於 2019-6-18 15:16:00

ubuntu安装部署slurm指引

<h6>安装munge</h6>
<p>https://www.cnblogs.com/haibaraai0913/p/11016885.html</p>
<p>munge提供组件间的认证通信机制,这个需要在所有节点安装并且启动。</p>
<h6>源码编译安装(全部节点)</h6>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">#切换root
</span><span style="color: rgba(0, 0, 255, 1)">sudo</span> <span style="color: rgba(0, 0, 255, 1)">su</span><span style="color: rgba(0, 0, 0, 1)">
#下载安装包
</span><span style="color: rgba(0, 0, 255, 1)">wget</span> https:<span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">download.schedmd.com/slurm/slurm-19.05.0.tar.bz2</span>
<span style="color: rgba(0, 0, 0, 1)">#解压
</span><span style="color: rgba(0, 0, 255, 1)">tar</span> -xaf slurm*<span style="color: rgba(0, 0, 255, 1)">tar</span><span style="color: rgba(0, 0, 0, 1)">.bz2
#切换路径
cd slurm</span>-<span style="color: rgba(128, 0, 128, 1)">19.05</span>.<span style="color: rgba(128, 0, 128, 1)">0</span><span style="color: rgba(0, 0, 0, 1)">
#编译安装
.</span>/configure --enable-debug --prefix=/opt/slurm --sysconfdir=/opt/slurm/<span style="color: rgba(0, 0, 0, 1)">etc
</span><span style="color: rgba(0, 0, 255, 1)">make</span> &amp;&amp; <span style="color: rgba(0, 0, 255, 1)">make</span> <span style="color: rgba(0, 0, 255, 1)">install</span></pre>
</div>
<p>在编译过程中可能会出现的错误:/usr/bin/env:"python":没有那个文件或目录</p>
<p><img src="https://img2018.cnblogs.com/blog/1714157/201906/1714157-20190618143544656-1158749457.png"></p>
<p>解决办法:&nbsp;</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">#添加软链
</span><span style="color: rgba(0, 0, 255, 1)">ln</span> -s /usr/bin/python3 /usr/bin/python</pre>
</div>
<h6>新建用户并修改文件所属用户(全部节点)</h6>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">#新建用户及其主目录和登录shell
useradd slurm </span>-m -s /bin/<span style="color: rgba(0, 0, 0, 1)">bash
#给用户赋密码
</span><span style="color: rgba(0, 0, 255, 1)">passwd</span><span style="color: rgba(0, 0, 0, 1)"> slurm<br>#新建所需文件夹<br>mkdir /opt/slurm/log<br>mkdir /opt/slurm/spool<br></span>mkdir /opt/slurm/spool/slurm<br><span>mkdir /opt/slurm/run<br></span><span>#修改目录属主<br></span><em id="__mceDel"><em id="__mceDel"><em id="__mceDel"><span>chown -R slurm:slurm /opt/slurm</span></em></em></em></pre>
<pre></pre>
</div>
<h6>配置hostname</h6>
<p>修改本机hostname</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">#临时修改主机名(主节点主机名为manager,子节点主机名为node1,node2)
</span><span style="color: rgba(0, 0, 255, 1)">hostname</span><span style="color: rgba(0, 0, 0, 1)"> manager
#永久修改主机名
vim </span>/etc/<span style="color: rgba(0, 0, 255, 1)">hostname</span> #修改主机名,保存文件。重启后生效。</pre>
</div>
<p>修改hosts</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">#打开hosts配置文件
vim </span>/etc/<span style="color: rgba(0, 0, 0, 1)">hosts
#插入以下几行,保存文件
</span><span style="color: rgba(128, 0, 128, 1)">192.168</span>.<span style="color: rgba(128, 0, 128, 1)">231.128</span><span style="color: rgba(0, 0, 0, 1)"> manager
</span><span style="color: rgba(128, 0, 128, 1)">192.168</span>.<span style="color: rgba(128, 0, 128, 1)">231.129</span><span style="color: rgba(0, 0, 0, 1)"> node1
</span><span style="color: rgba(128, 0, 128, 1)">192.168</span>.<span style="color: rgba(128, 0, 128, 1)">231.130</span> node2</pre>
</div>
<h6>配置(主节点)</h6>
<p>&nbsp;</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">#从源码包拷贝配置文件夹
</span><span style="color: rgba(0, 0, 255, 1)">cp</span> -r /opt/package/slurm-<span style="color: rgba(128, 0, 128, 1)">19.05</span>.<span style="color: rgba(128, 0, 128, 1)">0</span>/etc/ /opt/slurm/etc/<span style="color: rgba(0, 0, 0, 1)">
#修改目录属主
</span><span style="color: rgba(0, 0, 255, 1)">chown</span> -R slurm:slurm /opt/slurm/<span style="color: rgba(0, 0, 0, 1)">etc
#拷贝配置文件实例
cp </span>/opt/slurm/etc/slurm.conf.example /opt/slurm/etc/<span style="color: rgba(0, 0, 0, 1)">slurm.conf
#打开配置文件进行编辑
vim </span>/opt/slurm/etc/slurm.conf </pre>
</div>
<p>配置文件:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">#
# Example slurm.conf </span><span style="color: rgba(0, 0, 255, 1)">file</span><span style="color: rgba(0, 0, 0, 1)">. Please run configurator.html
# (</span><span style="color: rgba(0, 0, 255, 1)">in</span> doc/html) to build a configuration <span style="color: rgba(0, 0, 255, 1)">file</span><span style="color: rgba(0, 0, 0, 1)"> customized
# </span><span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)"> your environment.
#
#
# slurm.conf </span><span style="color: rgba(0, 0, 255, 1)">file</span><span style="color: rgba(0, 0, 0, 1)"> generated by configurator.html.
#
# See the slurm.conf </span><span style="color: rgba(0, 0, 255, 1)">man</span> page <span style="color: rgba(0, 0, 255, 1)">for</span> <span style="color: rgba(0, 0, 255, 1)">more</span><span style="color: rgba(0, 0, 0, 1)"> information.
#
ClusterName</span>=<span style="color: rgba(0, 0, 0, 1)">linux #集群名称
ControlMachine</span>=<span style="color: rgba(0, 0, 0, 1)">manager #主节点名
ControlAddr</span>=<span style="color: rgba(128, 0, 128, 1)">192.168</span>.<span style="color: rgba(128, 0, 128, 1)">231.128</span><span style="color: rgba(0, 0, 0, 1)"> #主节点地址,局域网
#BackupController</span>=<span style="color: rgba(0, 0, 0, 1)">
#BackupAddr</span>=<span style="color: rgba(0, 0, 0, 1)">
#
SlurmUser</span>=<span style="color: rgba(0, 0, 0, 1)">slurm #主节点管理账号
#SlurmdUser</span>=<span style="color: rgba(0, 0, 0, 1)">root
SlurmctldPort</span>=<span style="color: rgba(128, 0, 128, 1)">6817</span><span style="color: rgba(0, 0, 0, 1)"> #主节点服务默认端口号
SlurmdPort</span>=<span style="color: rgba(128, 0, 128, 1)">6818</span><span style="color: rgba(0, 0, 0, 1)"> #子节点服务默认端口号
AuthType</span>=auth/<span style="color: rgba(0, 0, 0, 1)">munge #组件间认证授权通信方式,使用munge
#JobCredentialPrivateKey</span>=<span style="color: rgba(0, 0, 0, 1)">
#JobCredentialPublicCertificate</span>=<span style="color: rgba(0, 0, 0, 1)">
StateSaveLocation</span>=/opt/slurm/spool/slurm/<span style="color: rgba(0, 0, 0, 1)">ctld #记录主节点状态的文件夹
SlurmdSpoolDir</span>=/opt/slurm/spool/slurm/<span style="color: rgba(0, 0, 0, 1)">d #子节点状态信息文件
SwitchType</span>=switch/<span style="color: rgba(0, 0, 0, 1)">none
MpiDefault</span>=<span style="color: rgba(0, 0, 0, 1)">none
SlurmctldPidFile</span>=/opt/slurm/run/<span style="color: rgba(0, 0, 0, 1)">slurmctld.pid #主服务进程文件
SlurmdPidFile</span>=/opt/slurm/run/<span style="color: rgba(0, 0, 0, 1)">slurmd.pid #子节点进程文件
ProctrackType</span>=proctrack/<span style="color: rgba(0, 0, 0, 1)">pgid #监控任务与进程间的关系
#PluginDir</span>=<span style="color: rgba(0, 0, 0, 1)">
#FirstJobId</span>=<span style="color: rgba(0, 0, 0, 1)">
ReturnToService</span>=<span style="color: rgba(128, 0, 128, 1)">0</span><span style="color: rgba(0, 0, 0, 1)">
#MaxJobCount</span>=<span style="color: rgba(0, 0, 0, 1)">
#PlugStackConfig</span>=<span style="color: rgba(0, 0, 0, 1)">
#PropagatePrioProcess</span>=<span style="color: rgba(0, 0, 0, 1)">
#PropagateResourceLimits</span>=<span style="color: rgba(0, 0, 0, 1)">
#PropagateResourceLimitsExcept</span>=<span style="color: rgba(0, 0, 0, 1)">
#Prolog</span>=<span style="color: rgba(0, 0, 0, 1)">
#Epilog</span>=<span style="color: rgba(0, 0, 0, 1)">
#SrunProlog</span>=<span style="color: rgba(0, 0, 0, 1)">
#SrunEpilog</span>=<span style="color: rgba(0, 0, 0, 1)">
#TaskProlog</span>=<span style="color: rgba(0, 0, 0, 1)">
#TaskEpilog</span>=<span style="color: rgba(0, 0, 0, 1)">
#TaskPlugin</span>=<span style="color: rgba(0, 0, 0, 1)">
#TrackWCKey</span>=<span style="color: rgba(0, 0, 0, 1)">no
#TreeWidth</span>=<span style="color: rgba(128, 0, 128, 1)">50</span><span style="color: rgba(0, 0, 0, 1)">
#TmpFS</span>=<span style="color: rgba(0, 0, 0, 1)">
#UsePAM</span>=<span style="color: rgba(0, 0, 0, 1)">
#
# TIMERS
SlurmctldTimeout</span>=<span style="color: rgba(128, 0, 128, 1)">300</span><span style="color: rgba(0, 0, 0, 1)">
SlurmdTimeout</span>=<span style="color: rgba(128, 0, 128, 1)">300</span><span style="color: rgba(0, 0, 0, 1)">
InactiveLimit</span>=<span style="color: rgba(128, 0, 128, 1)">0</span><span style="color: rgba(0, 0, 0, 1)">
MinJobAge</span>=<span style="color: rgba(128, 0, 128, 1)">300</span><span style="color: rgba(0, 0, 0, 1)">
KillWait</span>=<span style="color: rgba(128, 0, 128, 1)">30</span><span style="color: rgba(0, 0, 0, 1)">
Waittime</span>=<span style="color: rgba(128, 0, 128, 1)">0</span><span style="color: rgba(0, 0, 0, 1)">
#
# SCHEDULING
SchedulerType</span>=sched/<span style="color: rgba(0, 0, 0, 1)">backfill
#SchedulerAuth</span>=<span style="color: rgba(0, 0, 0, 1)">
#SelectType</span>=<span style="color: rgba(0, 0, 255, 1)">select</span>/<span style="color: rgba(0, 0, 0, 1)">linear
FastSchedule</span>=<span style="color: rgba(128, 0, 128, 1)">1</span><span style="color: rgba(0, 0, 0, 1)">
#PriorityType</span>=priority/<span style="color: rgba(0, 0, 0, 1)">multifactor
#PriorityDecayHalfLife</span>=<span style="color: rgba(128, 0, 128, 1)">14</span>-<span style="color: rgba(128, 0, 128, 1)">0</span><span style="color: rgba(0, 0, 0, 1)">
#PriorityUsageResetPeriod</span>=<span style="color: rgba(128, 0, 128, 1)">14</span>-<span style="color: rgba(128, 0, 128, 1)">0</span><span style="color: rgba(0, 0, 0, 1)">
#PriorityWeightFairshare</span>=<span style="color: rgba(128, 0, 128, 1)">100000</span><span style="color: rgba(0, 0, 0, 1)">
#PriorityWeightAge</span>=<span style="color: rgba(128, 0, 128, 1)">1000</span><span style="color: rgba(0, 0, 0, 1)">
#PriorityWeightPartition</span>=<span style="color: rgba(128, 0, 128, 1)">10000</span><span style="color: rgba(0, 0, 0, 1)">
#PriorityWeightJobSize</span>=<span style="color: rgba(128, 0, 128, 1)">1000</span><span style="color: rgba(0, 0, 0, 1)">
#PriorityMaxAge</span>=<span style="color: rgba(128, 0, 128, 1)">1</span>-<span style="color: rgba(128, 0, 128, 1)">0</span><span style="color: rgba(0, 0, 0, 1)">
#
# LOGGING
SlurmctldDebug</span>=<span style="color: rgba(128, 0, 128, 1)">3</span><span style="color: rgba(0, 0, 0, 1)">
SlurmctldLogFile</span>=/opt/slurm/log/<span style="color: rgba(0, 0, 0, 1)">slurmctld.log #主节点log日志
SlurmdDebug</span>=<span style="color: rgba(128, 0, 128, 1)">3</span><span style="color: rgba(0, 0, 0, 1)">
SlurmdLogFile</span>=/opt/slurm/log/<span style="color: rgba(0, 0, 0, 1)">slurmd.log #子节点log日志
JobCompType</span>=jobcomp/<span style="color: rgba(0, 0, 0, 1)">none
#JobCompLoc</span>=<span style="color: rgba(0, 0, 0, 1)">
#
# ACCOUNTING
#JobAcctGatherType</span>=jobacct_gather/<span style="color: rgba(0, 0, 0, 1)">linux
#JobAcctGatherFrequency</span>=<span style="color: rgba(128, 0, 128, 1)">30</span><span style="color: rgba(0, 0, 0, 1)">
#
#AccountingStorageType</span>=accounting_storage/<span style="color: rgba(0, 0, 0, 1)">slurmdbd
#AccountingStorageHost</span>=<span style="color: rgba(0, 0, 0, 1)">
#AccountingStorageLoc</span>=<span style="color: rgba(0, 0, 0, 1)">
#AccountingStoragePass</span>=<span style="color: rgba(0, 0, 0, 1)">
#AccountingStorageUser</span>=<span style="color: rgba(0, 0, 0, 1)">
#
# COMPUTE NODES
#节点名称,CPUs核数,corepersocket,threadspersocket,使用lscpu查看,realmemory实际分配给slurm内存,procs是实际CPU个数,</span>/proc/cpuinfo里查看 state=<span style="color: rgba(0, 0, 0, 1)">unknown是刚启动集群的时候为unknown,之后会变成idle
NodeName</span>=manager,node1,node2 Procs=<span style="color: rgba(128, 0, 128, 1)">1</span> State=<span style="color: rgba(0, 0, 0, 1)">UNKNOWN
#partitionname是分成control和compute,default</span>=yes是说这个用来计算,我们设置node1/<span style="color: rgba(0, 0, 0, 1)">2这两台default为yes,用来计算的
PartitionName</span>=control Nodes=manager Default=NO MaxTime=INFINITE State=<span style="color: rgba(0, 0, 0, 1)">UP
PartitionName</span>=compute Nodes=node1,node2 Default=Yes MaxTime=INFINITE State=UP</pre>
</div>
<p>分发配置文件:</p>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">#拷贝主节点配置节点到子节点
</span><span style="color: rgba(0, 0, 255, 1)">scp</span> -r etc/ slurm@<span style="color: rgba(128, 0, 128, 1)">192.168</span>.<span style="color: rgba(128, 0, 128, 1)">231.129</span>:/opt/slurm/
<span style="color: rgba(0, 0, 255, 1)">scp</span> -r etc/ slurm@<span style="color: rgba(128, 0, 128, 1)">192.168</span>.<span style="color: rgba(128, 0, 128, 1)">231.130</span>:/opt/slurm/</pre>
</div>
<h6>启动集群</h6>
<div class="cnblogs_code">
<pre><span style="color: rgba(0, 0, 0, 1)">#主节点root用户执行
</span>/opt/slurm/sbin/slurmctld -<span style="color: rgba(0, 0, 0, 1)">c
</span>/opt/slurm/sbin/slurmd -<span style="color: rgba(0, 0, 0, 1)">c
#子节点root用户执行
</span>/opt/slurm/sbin/slurmd -c</pre>
</div>
<p>&nbsp;</p><br><br>
来源:https://www.cnblogs.com/haibaraai0913/p/11045295.html
頁: [1]
查看完整版本: ubuntu安装部署slurm指引