概述
我司 GPU 算力 Kubernetes 集群有 MPI 命令的报错,下面是排查的过程。
背景
GPU 容器执行 mpirun -n 1 echo hello
没有问题,在算力平台的 GPU 容器执行则会报错。
GPU 容器无法执行 mpirun
远程命令。
排查过程
关于问题1
参考 Intel MPI 的官方文档,提示设置 I_MPI_HYDRA_TOPOLIB=ipl
即可。
gdb
简单查看一下,如果不设置上述环境变量,在算力平台的 GPU 容器会出现下面错误。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
# gdb --args mpiexec -n 1 echo hello
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mpiexec...done.
(gdb) r
Starting program: /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/mpiexec -n 1 echo hello
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
ipl_detect_machine_topology ()
at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1704
1704 ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c: No such file or directory.
(gdb) bt
#0 ipl_detect_machine_topology ()
at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1704
#1 0x0000000000447d62 in ipl_processor_info (info=0x6dc940, pid=0x6,
detect_platform_only=10)
at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1943
#2 0x000000000044a112 in ipl_entrance (detect_platform_only=7194944)
at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_main.c:19
#3 0x000000000041e1c2 in i_read_default_env ()
at ../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec_params.h:241
#4 0x000000000041bc7e in mpiexec_get_parameters (t_argv=0x6dc940)
at ../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1350
#5 0x0000000000404a77 in main (argc=7194944, argv=0x6)
at ../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1743
|
hub.oa.com/gameai/ddrl-gpu:impi-test1
镜像中 pciutils
版本较低,只有 3.2.1,按照 db 的建议,升级,升级后的镜像为 hub.oa.com/gameai/ddrl-gpu:latest
。同时在该镜像中已经安装好 hwloc
,可以通过命令 lstopo
测试。
1
2
3
4
5
6
|
# apt-get -s install pciutils
Reading package lists... Done
Building dependency tree
Reading state information... Done
pciutils is already the newest version (1:3.5.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
|
测试结果如下。
- 更新完后,该错误仍然存在,所以排除
pciutils
版本的问题。
- 指定环境变量
export I_MPI_HYDRA_TOPOLIB=/hwloc-2.1.0/hwloc
成功,但是依然要设置环境变量
由于比较 TenC GPU 和算力平台 GPU 容器的环境,未发现什么差别(大概只有Nvidia的库装的位置不一样)。官方也未建议一定需要使用 hwloc
,实际上两个软件都是发现物理拓扑的工具,ipl
是 Intel MPI 实现的,所以个人认为,直接通过环境变量指定即可。所以关于问题 1.1 暂时解决,也可以重新讨论。
关于问题2
测试两个算力平台 GPU 容器的网络问题。容器启动后,配置两个容器的免密登录。
1
2
|
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
|
按照业务提供的命令 mpirun -verbose -n 1 -host 9.73.155.16(对方地址) echo hello
执行失败,具体见下面。
1
2
3
4
5
6
|
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] Launch arguments: /usr/bin/ssh -q -x 9.73.155.110 /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host fbee39ba-8504-4522-8490-4630079153e8 --upstream-port 43143 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:114): unable to run proxy on 9.73.155.110 (pid 8928)
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:152): check exit codes error
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:205): poll for event error
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:731): error waiting for event
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1919): error setting up the boostrap proxies
|
查看文档,发现 -hosts
不用 ip
,所以将 AB 容器 /etc/hosts
分别配置好。
1
2
3
4
|
# cat /etc/hosts
# Kubernetes-managed hosts file.
9.73.155.16 fbee39ba-8504-4522-8490-4630079153e8
9.73.155.110 31afc6a4-bb14-41aa-b20a-3e295ecb8650
|
执行另一个命令,确保命令是从 A 容器发到 B 容器,并且在 B 容器执行 ifconfig
。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# mpirun -verbose -n 1 -hosts 31afc6a4-bb14-41aa-b20a-3e295ecb8650 ifconfig
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] Launch arguments: /usr/bin/ssh -q -x 31afc6a4-bb14-41aa-b20a-3e295ecb8650 /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host fbee39ba-8504-4522-8490-4630079153e8 --upstream-port 41211 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 9.73.155.110 netmask 255.255.254.0 broadcast 0.0.0.0
ether 52:54:00:f0:b9:87 txqueuelen 1000 (Ethernet)
RX packets 16387 bytes 3270010 (3.2 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1035 bytes 136208 (136.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
loop txqueuelen 1000 (Local Loopback)
RX packets 124 bytes 21004 (21.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 124 bytes 21004 (21.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
|
将 ip
换成 hostname 即可,或者给 mpirun
传入 hostfile。关于 jobmaster 和 learner 还隔了 LB 的通信问题暂未测试。
参考资料
警告
本文最后更新于 2017年2月1日,文中内容可能已过时,请谨慎参考。