目录

算力平台MPI的通信问题

概述

我司 GPU 算力 Kubernetes 集群有 MPI 命令的报错,下面是排查的过程。

背景

GPU 容器执行 mpirun -n 1 echo hello 没有问题,在算力平台的 GPU 容器执行则会报错。

/%E7%AE%97%E5%8A%9B%E5%B9%B3%E5%8F%B0mpi%E7%9A%84%E9%80%9A%E4%BF%A1%E9%97%AE%E9%A2%98/image_1e3ef7khkl9e12rtd956i6v5j9.png

GPU 容器无法执行 mpirun 远程命令。

/%E7%AE%97%E5%8A%9B%E5%B9%B3%E5%8F%B0mpi%E7%9A%84%E9%80%9A%E4%BF%A1%E9%97%AE%E9%A2%98/image_1e3efb8ho11lumis84qdo71nj7m.png

排查过程

关于问题1

参考 Intel MPI 的官方文档,提示设置 I_MPI_HYDRA_TOPOLIB=ipl 即可。

gdb 简单查看一下,如果不设置上述环境变量,在算力平台的 GPU 容器会出现下面错误。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# gdb --args mpiexec -n 1 echo hello
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from mpiexec...done.
(gdb) r
Starting program: /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/mpiexec -n 1 echo hello
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
ipl_detect_machine_topology ()
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1704
1704	../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c: No such file or directory.
(gdb) bt
#0  ipl_detect_machine_topology ()
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1704
#1  0x0000000000447d62 in ipl_processor_info (info=0x6dc940, pid=0x6,
    detect_platform_only=10)
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_processor.c:1943
#2  0x000000000044a112 in ipl_entrance (detect_platform_only=7194944)
    at ../../../../../src/pm/i_hydra/../../intel/ipl/include/../src/ipl_main.c:19
#3  0x000000000041e1c2 in i_read_default_env ()
    at ../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec_params.h:241
#4  0x000000000041bc7e in mpiexec_get_parameters (t_argv=0x6dc940)
    at ../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1350
#5  0x0000000000404a77 in main (argc=7194944, argv=0x6)
    at ../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1743

hub.oa.com/gameai/ddrl-gpu:impi-test1 镜像中 pciutils 版本较低,只有 3.2.1,按照 db 的建议,升级,升级后的镜像为 hub.oa.com/gameai/ddrl-gpu:latest。同时在该镜像中已经安装好 hwloc,可以通过命令 lstopo 测试。

1
2
3
4
5
6
# apt-get -s install pciutils
Reading package lists... Done
Building dependency tree
Reading state information... Done
pciutils is already the newest version (1:3.5.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

测试结果如下。

  1. 更新完后,该错误仍然存在,所以排除pciutils版本的问题。
  2. 指定环境变量export I_MPI_HYDRA_TOPOLIB=/hwloc-2.1.0/hwloc成功,但是依然要设置环境变量

由于比较 TenC GPU 和算力平台 GPU 容器的环境,未发现什么差别(大概只有Nvidia的库装的位置不一样)。官方也未建议一定需要使用 hwloc,实际上两个软件都是发现物理拓扑的工具,ipl 是 Intel MPI 实现的,所以个人认为,直接通过环境变量指定即可。所以关于问题 1.1 暂时解决,也可以重新讨论。

关于问题2

测试两个算力平台 GPU 容器的网络问题。容器启动后,配置两个容器的免密登录。

1
2
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

按照业务提供的命令 mpirun -verbose -n 1 -host 9.73.155.16(对方地址) echo hello 执行失败,具体见下面。

1
2
3
4
5
6
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] Launch arguments: /usr/bin/ssh -q -x 9.73.155.110 /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host fbee39ba-8504-4522-8490-4630079153e8 --upstream-port 43143 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:114): unable to run proxy on 9.73.155.110 (pid 8928)
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:152): check exit codes error
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:205): poll for event error
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:731): error waiting for event
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1919): error setting up the boostrap proxies

查看文档,发现 -hosts 不用 ip,所以将 AB 容器 /etc/hosts 分别配置好。

1
2
3
4
# cat /etc/hosts
# Kubernetes-managed hosts file.
9.73.155.16	 fbee39ba-8504-4522-8490-4630079153e8
9.73.155.110 31afc6a4-bb14-41aa-b20a-3e295ecb8650

执行另一个命令,确保命令是从 A 容器发到 B 容器,并且在 B 容器执行 ifconfig

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# mpirun -verbose -n 1 -hosts 31afc6a4-bb14-41aa-b20a-3e295ecb8650 ifconfig
[mpiexec@fbee39ba-8504-4522-8490-4630079153e8] Launch arguments: /usr/bin/ssh -q -x 31afc6a4-bb14-41aa-b20a-3e295ecb8650 /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_bstrap_proxy --upstream-host fbee39ba-8504-4522-8490-4630079153e8 --upstream-port 41211 --pgid 0 --launcher ssh --launcher-number 0 --base-path /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 9.73.155.110  netmask 255.255.254.0  broadcast 0.0.0.0
        ether 52:54:00:f0:b9:87  txqueuelen 1000  (Ethernet)
        RX packets 16387  bytes 3270010 (3.2 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1035  bytes 136208 (136.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 124  bytes 21004 (21.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 124  bytes 21004 (21.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ip 换成 hostname 即可,或者给 mpirun 传入 hostfile。关于 jobmaster 和 learner 还隔了 LB 的通信问题暂未测试。

参考资料

警告
本文最后更新于 2017年2月1日,文中内容可能已过时,请谨慎参考。