目录

Macvlan-on-Kubernetes调研

概述

本文简述一下 Macvlan 的网络模式。

混杂模式

混杂模式(promiscuous mode)是电脑网络中的术语。是指一台机器的网卡能够接收所有经过它的数据流,而不论其目的地址是否是它。一般计算机网卡都工作在非混杂模式下,此时网卡只接受来自网络端口的目的地址指向自己的数据。当网卡工作在混杂模式下时,网卡将来自接口的所有数据都捕获并交给相应的驱动程序,也就是不验证 MAC 地址

配置虚拟网卡

下面的脚本会基于 eth0 分出两张虚拟网卡,并且创建两个 Macvlan 的 Docker 容器分别使用这两个虚拟网卡。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
vconfig add eth0 100
vconfig add eth0 200

vconfig set_flag eth0.100 1 1
vconfig set_flag eth0.200 1 1

ifconfig eth0.100 up
ifconfig eth0.200 up

# 创建容器
docker network create -d macvlan --subnet=172.16.10.0/24 --gateway=172.16.10.1 -o parent=eth0.100 mac10
docker network create -d macvlan --subnet=172.16.20.0/24 --gateway=172.16.20.1 -o parent=eth0.200 mac20

vconfig

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# 安装vlan(vconfig)和加载8021q模块
yum install vconfig
modprobe 8021q
lsmod |grep -i 8021q

# 在eth0接口上配置两个VLAN
# vconfig add eth0 100
Added VLAN with VID == 100 to IF -:eth0:-
# vconfig add eth0 200
Added VLAN with VID == 200 to IF -:eth0:-

# 设置VLAN的REORDER_HDR参数,默认就行了
# vconfig set_flag eth0.100 1 1
Set flag on device -:eth0.100:- Should be visible in /proc/net/vlan/eth0.100
# vconfig set_flag eth0.200 1 1
Set flag on device -:eth0.200:- Should be visible in /proc/net/vlan/eth0.200

# 配置网络信息
ifconfig eth0.100 172.16.1.8 netmask 255.255.255.0 up
ifconfig eth0.200 172.16.2.8 netmask 255.255.255.0 up

# 删除VLAN命令
# vconfig rem eth0.100
Removed VLAN -:eth0.100:-
# vconfig rem eth0.200
Removed VLAN -:eth0.200:-

flannel配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# cat /etc/cni/net.d/10-flannel.conflist
{
  "name": "cbr0",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}

路由

flannel 默认模式的路由表很清楚了,除了 cni0,其他用到 flannel.1 的都会路由到其他主机,对于本机的包会走 cni0

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.22.0.1      0.0.0.0         UG    0      0        0 eth0
10.244.0.0      10.244.0.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.2.0      10.244.2.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.3.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.4.0      10.244.4.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.5.0      10.244.5.0      255.255.255.0   UG    0      0        0 flannel.1
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
172.22.0.0      0.0.0.0         255.255.240.0   U     0      0        0 eth0

stable集群

可以看出 macvlan 虚拟出来的网卡,MAC 是一样的

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# PROMISC开了混杂模式
eth1: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST>  mtu 1500
        ether 52:54:00:ce:00:a0  txqueuelen 1000  (Ethernet)
        RX packets 1630423643  bytes 817372419479 (761.2 GiB)
        RX errors 0  dropped 612495  overruns 0  frame 0
        TX packets 1704375141  bytes 664356826008 (618.7 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        
eth1.228: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 52:54:00:ce:00:a0  txqueuelen 0  (Ethernet)
        RX packets 244  bytes 13146 (12.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 71  bytes 7130 (6.9 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1.233: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 52:54:00:ce:00:a0  txqueuelen 0  (Ethernet)
        RX packets 7732818  bytes 3368280421 (3.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9487917  bytes 2180371925 (2.0 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1.240: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 52:54:00:ce:00:a0  txqueuelen 0  (Ethernet)
        RX packets 795537153  bytes 538024953755 (501.0 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 913399169  bytes 370031722903 (344.6 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

机器详情

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# master
# lspci |grep -i eth
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:07.0 Ethernet controller: Red Hat, Inc. Virtio network device
# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
inet 10.9.204.77  netmask 255.255.255.0  broadcast 10.9.204.255
# ifconfig eth1
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 52:54:00:44:a0:57  txqueuelen 1000  (Ethernet)

####################

# node1
# lspci |grep -i eth
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:07.0 Ethernet controller: Red Hat, Inc. Virtio network device
# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.9.204.78  netmask 255.255.255.0  broadcast 10.9.204.255
# ifconfig eth1
eth1: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST>  mtu 1500
        inet 192.168.1.4  netmask 255.255.255.0  broadcast 192.168.1.255        

####################

# node2
# lspci |grep -i eth
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
[root@node2 ~]# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

####################

# node3
# lspci |grep -i eth
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.9.24.178  netmask 255.255.255.0  broadcast 10.9.24.255
        ether 52:54:0a:09:18:b2  txqueuelen 1000  (Ethernet)

网卡配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
[root@master ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=eth1
DEVICE=eth1
ONBOOT=yes
IPADDR=10.9.204.8
NETMASK=255.255.255.0
GATEWAY=10.9.204.254
[root@master ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=eth0
DEVICE=eth0
ONBOOT=yes
IPADDR=10.9.204.77
NETMASK=255.255.255.0
GATEWAY=10.9.204.254

测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ip link set eth1 promisc on
docker network create -d macvlan --subnet=172.16.5.0/24 --gateway=172.16.5.1 -o parent=eth1 macvlan1

ip link set eth0 promisc on
docker network create -d macvlan --subnet=172.16.5.0/24 --gateway=172.16.5.1 -o parent=eth1 macvlan1

docker network ls

docker run -itd --name busybox1 --ip=172.16.5.2 --network macvlan1 busybox
docker run -itd --name busybox2 --ip=172.16.5.3 --network macvlan1 busybox

一个网卡对一个 macvlan 就好了。

Rancher Macvlan

Rancher 的 macvlan cni 是重新开发过的,叫做 static-macvlan-cni,大概可以理解成即使 pod 重建,分配出来的 ip 也不会变,所以改造起来不是那么容易的。

1
2
3
4
# k get network-attachment-definitions.k8s.cni.cncf.io -A
NAMESPACE                   NAME                        AGE
cadvisor                    static-macvlan-cni-attach   200d
cattle-pipeline             static-macvlan-cni-attach   201d

samplepod with maclvan ip samplepod1

case 1

ping samplepod-macvlan ip samplepod1 -> samplepod

ping 10.9.228.249 tcpdump -i eth0 -c 100 -w macvlan-samplepod.pcap 母机的 cni 作为默认网关,ping 包先到 cni 网桥

母机上抓包。

1
2
3
4
5
6
7
# tcpdump -i cni0 host 10.254.4.51
16:09:23.924996 IP 10.254.4.51 > 10.9.228.249: ICMP echo request, id 63, seq 3, length 64
16:09:24.948972 IP 10.254.4.51 > 10.9.228.249: ICMP echo request, id 63, seq 4, length 64
16:09:25.972982 IP 10.254.4.51 > 10.9.228.249: ICMP echo request, id 63, seq 5, length 64
16:09:27.317031 ARP, Request who-has 10.254.4.1 tell 10.254.4.51, length 28
16:09:27.317055 ARP, Reply 10.254.4.1 is-at fe:31:7a:9e:6d:e3 (oui Unknown), length 28
16:09:34.076693 IP 10.254.4.51 > 10.9.228.249: ICMP echo request, id 64, seq 1, length 64

母机上eth1混杂网卡抓包。

1
2
3
4
5
6
7
# tcpdump -i eth1.228 host 10.9.228.11
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1.228, link-type EN10MB (Ethernet), capture size 262144 bytes
16:45:35.274412 IP 10.1.136.40 > 10.9.228.11: ICMP echo request, id 43525, seq 27, length 64
16:45:35.274460 IP 10.9.228.11 > 10.1.136.40: ICMP echo reply, id 43525, seq 27, length 64
16:45:36.278306 IP 10.1.136.40 > 10.9.228.11: ICMP echo request, id 43525, seq 28, length 64
16:45:36.278347 IP 10.9.228.11 > 10.1.136.40: ICMP echo reply, id 43525, seq 28, length 64

集团的 macvlan 子网没有单网卡的,可以试试。

还是没搞明白集团的 macvlan 是怎么通信的。

又发现了,母机如果没有配置 macvlan,Pod 肯定是调度不上去的。

1
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "38ac8411a5e73d2f78d76027573ab4b2f9ba7d33d6a4342032f057cb8a545396" network for pod "a-0": networkPlugin cni failed to set up pod "a-0_default" network: Multus: [default/a-0]: error adding container to network "static-macvlan-cni-attach": delegateAdd: error invoking DelegateAdd - "static-macvlan-cni": error in getting result from AddNetwork: Static Macvlan: failed to set promisc on: eth1 failed to lookup iface "eth1": Link not found

修改子网会有问题。

1
2
3
4
Events:
Type     Reason                  Age               From     Message
  ----     ------                  ----              ----     -------
Warning  FailedCreatePodSandBox  26s               kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "93ab4d79b614fc0ce22161d746c726447cc8cc2224a6807a1f50cde72dd42eb5" network for pod "c-0": networkPlugin cni failed to set up pod "c-0_default" network: Multus: [default/c-0]: error adding container to network "static-macvlan-cni-attach": delegateAdd: error invoking DelegateAdd - "static-macvlan-cni": error in getting result from AddNetwork: netplugin failed but error parsing its diagnostic message "ipam.ExecDel: static-ipam CNI_COMMAND is not DEL\n{\n    \"code\": 100,\n    \"msg\": \"failed to change default gateway network is unreachable\"\n}": invalid character 'i' looking for beginning of value
1
docker network create -d macvlan --subnet=173.16.125.0/24 --gateway=10.9.228.254 -o parent=eth0.125 macvlan-125

macvlan config

交换机开启混杂模式,于是一台母机的一个网卡,可以虚拟多个不同 vlan 的 macvlan 网络,不同 vlan 是需要通过路由转发的,不能直接互通。

同个 macvlan 网络下通过网桥,也就是比较流行的 bridge 模式,二层直接通过 arp 泛洪通信。

所以网卡 vlan 接口和子网需要提前配置好。

假设不在交换机开启混杂模式,一个母机网卡就一个 macvlan 网呢,也就是一个网卡对应一个 vlan,感觉问题不大啊.为什么这么认为呢?

  1. 一般配置C类网络就254个可用ip,我们的母机,基本不存在部署这么多ip的情况
  2. 目前一台母机,四个macvlan,四个vlan,可选的ip就很多了,如果四个母机,全部都是这四个vlan,那么好像挺好啊办的?

自己模拟的单 macvlan,情况是这样的。

  1. 外部->pod: 本地mac去访问,应该要能通的
  2. pod->外部: pod访问我本地mac也应该要通
  3. pod<->pod: 通过flannel

实验

两个 vlan,分别是228和229,按理论来说,跨了 vlan 的子网,需要通过路由来转发。

1
2
vlan 228: a 10.9.228.11
vlan 229: d 10.9.229.10

在 d ping a 的地址,通过下面的命令可以抓到类型的包。

1
2
tcpdump -i eth1.228 -nn
tcpdump -i eth1.229 -nn

为什么 ping 通,是因为下面的路由规则,数据包到 eth1,也就是 eth1.229 准备虚拟网卡上,通过网桥/交换机,到 eth1.228 上,然后通过 a 里的 eth1 收到包?

1
default via 10.9.229.254 dev eth1

一个容器辅助的脚本。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/usr/bin/env bash
function e_net() {
  set -eu
  pod=`kubectl get pod ${pod_name} -n ${namespace} -o template --template='{{range .status.containerStatuses}}{{.containerID}}{{end}}' | sed 's/docker:\/\/\(.*\)$/\1/'`
  pid=`docker inspect -f {{.State.Pid}} $pod`
  echo -e "\033[32m Entering pod netns for ${namespace}/${pod_name} \033[0m\n"
  cmd="nsenter -n -t ${pid}"
  echo -e "\033[32m Execute the command: ${cmd} \033[0m"
  ${cmd}
}

# 运行函数
pod_name=$1
namespace=${2-"default"}
e_net
1
kubectl get pod a-0 -n default -o template --template='{{range .status.containerStatuses}}{{.containerID}}{{end}}' | sed 's/docker:\/\/\(.*\)$/\1/'

进展

  1. IP是可以分配得到的,因为插件做的,问题不大,网卡也可以设置,但是网络通不通是另外一回事了
  2. 关于路由和网路,需要基础SRE配合,这个不算在OKR里
  3. 静态IP的问题,需要解决

基本可以确定,rancher 是通过 static macvlan 插件,通过指定的 master,也就是网卡名,以及 vlan,来找到对应的母机的虚拟网卡

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
spec:
  cidr: 10.9.228.0/24
  gateway: 10.9.228.254
  master: eth1
  mode: bridge
  podDefaultGateway:
    enable: true
    serviceCidr: 10.255.0.0/16
  ranges:
  - rangeEnd: 10.9.228.250
    rangeStart: 10.9.228.10
  routes:
  - dst: 10.254.0.0/16
    gw: 169.254.1.1
    iface: eth0
  - dst: 10.9.204.0/24
    gw: 169.254.1.1
    iface: eth0
  - dst: 10.9.205.0/24
    gw: 169.254.1.1
    iface: eth0
  - dst: 10.9.206.0/24
    gw: 169.254.1.1
    iface: eth0
  - dst: 10.9.207.0/24
    gw: 169.254.1.1
    iface: eth0
  vlan: 228

想法

为什么会建议两张网卡呢,因为如果在一个网卡上做虚拟,Macvlan 用了,那这台机器的其他容器这个 Pod 都是访问不了的,所以两张网卡的话,配置上会让其中一个网卡作为本地容器通信用的网桥/网关,这样包才能转回来这台主机。

一定需要Macvlan吗

如果有靠谱的四层代理,其实不需要。

参考资料

  1. Linux网络虚拟化: macvlan
  2. 浅谈K8S cni和网络方案
  3. 如何打通K8s虚拟网络(flannel vxlan 网络)和 K8s 2层网络(macvlan网络)
  4. 搭建k8s集群(rpm+macvlan+ipam)
  5. Kubernetes Multus-CNI
  6. Rancher的扁平网络实现
  7. Rancher网络选项
  8. 使用Rancher Server自动下发F5负载均衡策略实践|环境搭建
  9. 实现基于Macvlan的高性能容器网络
  10. docker使用macvlan配置网络,使容器与宿主机在同一局域网,广播域内
  11. CentOS7.x 配置sub-interface (用于docker macvlan)
  12. Docker跨主机通信之macvlan
  13. Pod多网卡方案MULTUS
  14. Macvlan网络结构分析
警告
本文最后更新于 2021年12月1日,文中内容可能已过时,请谨慎参考。