概述
本文简述一下 Macvlan 的网络模式。
混杂模式
混杂模式(promiscuous mode)是电脑网络中的术语。是指一台机器的网卡能够接收所有经过它的数据流,而不论其目的地址是否是它。一般计算机网卡都工作在非混杂模式下,此时网卡只接受来自网络端口的目的地址指向自己的数据。当网卡工作在混杂模式下时,网卡将来自接口的所有数据都捕获并交给相应的驱动程序,也就是不验证 MAC 地址
配置虚拟网卡
下面的脚本会基于 eth0 分出两张虚拟网卡,并且创建两个 Macvlan 的 Docker 容器分别使用这两个虚拟网卡。
1
2
3
4
5
6
7
8
9
10
11
12
|
vconfig add eth0 100
vconfig add eth0 200
vconfig set_flag eth0.100 1 1
vconfig set_flag eth0.200 1 1
ifconfig eth0.100 up
ifconfig eth0.200 up
# 创建容器
docker network create -d macvlan --subnet=172.16.10.0/24 --gateway=172.16.10.1 -o parent=eth0.100 mac10
docker network create -d macvlan --subnet=172.16.20.0/24 --gateway=172.16.20.1 -o parent=eth0.200 mac20
|
vconfig
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
# 安装vlan(vconfig)和加载8021q模块
yum install vconfig
modprobe 8021q
lsmod |grep -i 8021q
# 在eth0接口上配置两个VLAN
# vconfig add eth0 100
Added VLAN with VID == 100 to IF -:eth0:-
# vconfig add eth0 200
Added VLAN with VID == 200 to IF -:eth0:-
# 设置VLAN的REORDER_HDR参数,默认就行了
# vconfig set_flag eth0.100 1 1
Set flag on device -:eth0.100:- Should be visible in /proc/net/vlan/eth0.100
# vconfig set_flag eth0.200 1 1
Set flag on device -:eth0.200:- Should be visible in /proc/net/vlan/eth0.200
# 配置网络信息
ifconfig eth0.100 172.16.1.8 netmask 255.255.255.0 up
ifconfig eth0.200 172.16.2.8 netmask 255.255.255.0 up
# 删除VLAN命令
# vconfig rem eth0.100
Removed VLAN -:eth0.100:-
# vconfig rem eth0.200
Removed VLAN -:eth0.200:-
|
flannel配置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# cat /etc/cni/net.d/10-flannel.conflist
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
|
路由
flannel 默认模式的路由表很清楚了,除了 cni0,其他用到 flannel.1 的都会路由到其他主机,对于本机的包会走 cni0
1
2
3
4
5
6
7
8
9
10
11
12
|
# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 172.22.0.1 0.0.0.0 UG 0 0 0 eth0
10.244.0.0 10.244.0.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.1.0 10.244.1.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.2.0 10.244.2.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.3.0 0.0.0.0 255.255.255.0 U 0 0 0 cni0
10.244.4.0 10.244.4.0 255.255.255.0 UG 0 0 0 flannel.1
10.244.5.0 10.244.5.0 255.255.255.0 UG 0 0 0 flannel.1
169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 eth0
172.22.0.0 0.0.0.0 255.255.240.0 U 0 0 0 eth0
|
stable集群
可以看出 macvlan 虚拟出来的网卡,MAC 是一样的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
# PROMISC开了混杂模式
eth1: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST> mtu 1500
ether 52:54:00:ce:00:a0 txqueuelen 1000 (Ethernet)
RX packets 1630423643 bytes 817372419479 (761.2 GiB)
RX errors 0 dropped 612495 overruns 0 frame 0
TX packets 1704375141 bytes 664356826008 (618.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1.228: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 52:54:00:ce:00:a0 txqueuelen 0 (Ethernet)
RX packets 244 bytes 13146 (12.8 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 71 bytes 7130 (6.9 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1.233: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 52:54:00:ce:00:a0 txqueuelen 0 (Ethernet)
RX packets 7732818 bytes 3368280421 (3.1 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 9487917 bytes 2180371925 (2.0 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1.240: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 52:54:00:ce:00:a0 txqueuelen 0 (Ethernet)
RX packets 795537153 bytes 538024953755 (501.0 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 913399169 bytes 370031722903 (344.6 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
|
机器详情
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
# master
# lspci |grep -i eth
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:07.0 Ethernet controller: Red Hat, Inc. Virtio network device
# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.9.204.77 netmask 255.255.255.0 broadcast 10.9.204.255
# ifconfig eth1
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
ether 52:54:00:44:a0:57 txqueuelen 1000 (Ethernet)
####################
# node1
# lspci |grep -i eth
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:07.0 Ethernet controller: Red Hat, Inc. Virtio network device
# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.9.204.78 netmask 255.255.255.0 broadcast 10.9.204.255
# ifconfig eth1
eth1: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST> mtu 1500
inet 192.168.1.4 netmask 255.255.255.0 broadcast 192.168.1.255
####################
# node2
# lspci |grep -i eth
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
[root@node2 ~]# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
####################
# node3
# lspci |grep -i eth
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.9.24.178 netmask 255.255.255.0 broadcast 10.9.24.255
ether 52:54:0a:09:18:b2 txqueuelen 1000 (Ethernet)
|
网卡配置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
[root@master ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth1
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=eth1
DEVICE=eth1
ONBOOT=yes
IPADDR=10.9.204.8
NETMASK=255.255.255.0
GATEWAY=10.9.204.254
[root@master ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=eth0
DEVICE=eth0
ONBOOT=yes
IPADDR=10.9.204.77
NETMASK=255.255.255.0
GATEWAY=10.9.204.254
|
测试
1
2
3
4
5
6
7
8
9
10
|
ip link set eth1 promisc on
docker network create -d macvlan --subnet=172.16.5.0/24 --gateway=172.16.5.1 -o parent=eth1 macvlan1
ip link set eth0 promisc on
docker network create -d macvlan --subnet=172.16.5.0/24 --gateway=172.16.5.1 -o parent=eth1 macvlan1
docker network ls
docker run -itd --name busybox1 --ip=172.16.5.2 --network macvlan1 busybox
docker run -itd --name busybox2 --ip=172.16.5.3 --network macvlan1 busybox
|
一个网卡对一个 macvlan 就好了。
Rancher Macvlan
Rancher 的 macvlan cni 是重新开发过的,叫做 static-macvlan-cni,大概可以理解成即使 pod 重建,分配出来的 ip 也不会变,所以改造起来不是那么容易的。
1
2
3
4
|
# k get network-attachment-definitions.k8s.cni.cncf.io -A
NAMESPACE NAME AGE
cadvisor static-macvlan-cni-attach 200d
cattle-pipeline static-macvlan-cni-attach 201d
|
samplepod with maclvan ip
samplepod1
case 1
ping samplepod-macvlan ip
samplepod1 -> samplepod
ping 10.9.228.249
tcpdump -i eth0 -c 100 -w macvlan-samplepod.pcap
母机的 cni 作为默认网关,ping 包先到 cni 网桥
母机上抓包。
1
2
3
4
5
6
7
|
# tcpdump -i cni0 host 10.254.4.51
16:09:23.924996 IP 10.254.4.51 > 10.9.228.249: ICMP echo request, id 63, seq 3, length 64
16:09:24.948972 IP 10.254.4.51 > 10.9.228.249: ICMP echo request, id 63, seq 4, length 64
16:09:25.972982 IP 10.254.4.51 > 10.9.228.249: ICMP echo request, id 63, seq 5, length 64
16:09:27.317031 ARP, Request who-has 10.254.4.1 tell 10.254.4.51, length 28
16:09:27.317055 ARP, Reply 10.254.4.1 is-at fe:31:7a:9e:6d:e3 (oui Unknown), length 28
16:09:34.076693 IP 10.254.4.51 > 10.9.228.249: ICMP echo request, id 64, seq 1, length 64
|
母机上eth1混杂网卡抓包。
1
2
3
4
5
6
7
|
# tcpdump -i eth1.228 host 10.9.228.11
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1.228, link-type EN10MB (Ethernet), capture size 262144 bytes
16:45:35.274412 IP 10.1.136.40 > 10.9.228.11: ICMP echo request, id 43525, seq 27, length 64
16:45:35.274460 IP 10.9.228.11 > 10.1.136.40: ICMP echo reply, id 43525, seq 27, length 64
16:45:36.278306 IP 10.1.136.40 > 10.9.228.11: ICMP echo request, id 43525, seq 28, length 64
16:45:36.278347 IP 10.9.228.11 > 10.1.136.40: ICMP echo reply, id 43525, seq 28, length 64
|
集团的 macvlan 子网没有单网卡的,可以试试。
还是没搞明白集团的 macvlan 是怎么通信的。
又发现了,母机如果没有配置 macvlan,Pod 肯定是调度不上去的。
1
|
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "38ac8411a5e73d2f78d76027573ab4b2f9ba7d33d6a4342032f057cb8a545396" network for pod "a-0": networkPlugin cni failed to set up pod "a-0_default" network: Multus: [default/a-0]: error adding container to network "static-macvlan-cni-attach": delegateAdd: error invoking DelegateAdd - "static-macvlan-cni": error in getting result from AddNetwork: Static Macvlan: failed to set promisc on: eth1 failed to lookup iface "eth1": Link not found
|
修改子网会有问题。
1
2
3
4
|
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 26s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "93ab4d79b614fc0ce22161d746c726447cc8cc2224a6807a1f50cde72dd42eb5" network for pod "c-0": networkPlugin cni failed to set up pod "c-0_default" network: Multus: [default/c-0]: error adding container to network "static-macvlan-cni-attach": delegateAdd: error invoking DelegateAdd - "static-macvlan-cni": error in getting result from AddNetwork: netplugin failed but error parsing its diagnostic message "ipam.ExecDel: static-ipam CNI_COMMAND is not DEL\n{\n \"code\": 100,\n \"msg\": \"failed to change default gateway network is unreachable\"\n}": invalid character 'i' looking for beginning of value
|
1
|
docker network create -d macvlan --subnet=173.16.125.0/24 --gateway=10.9.228.254 -o parent=eth0.125 macvlan-125
|
macvlan config
交换机开启混杂模式,于是一台母机的一个网卡,可以虚拟多个不同 vlan 的 macvlan 网络,不同 vlan 是需要通过路由转发的,不能直接互通。
同个 macvlan 网络下通过网桥,也就是比较流行的 bridge 模式,二层直接通过 arp 泛洪通信。
所以网卡 vlan 接口和子网需要提前配置好。
假设不在交换机开启混杂模式,一个母机网卡就一个 macvlan 网呢,也就是一个网卡对应一个 vlan,感觉问题不大啊.为什么这么认为呢?
- 一般配置C类网络就254个可用ip,我们的母机,基本不存在部署这么多ip的情况
- 目前一台母机,四个macvlan,四个vlan,可选的ip就很多了,如果四个母机,全部都是这四个vlan,那么好像挺好啊办的?
自己模拟的单 macvlan,情况是这样的。
- 外部->pod: 本地mac去访问,应该要能通的
- pod->外部: pod访问我本地mac也应该要通
- pod<->pod: 通过flannel
实验
两个 vlan,分别是228和229,按理论来说,跨了 vlan 的子网,需要通过路由来转发。
1
2
|
vlan 228: a 10.9.228.11
vlan 229: d 10.9.229.10
|
在 d ping a 的地址,通过下面的命令可以抓到类型的包。
1
2
|
tcpdump -i eth1.228 -nn
tcpdump -i eth1.229 -nn
|
为什么 ping 通,是因为下面的路由规则,数据包到 eth1,也就是 eth1.229 准备虚拟网卡上,通过网桥/交换机,到 eth1.228 上,然后通过 a 里的 eth1 收到包?
1
|
default via 10.9.229.254 dev eth1
|
一个容器辅助的脚本。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
#!/usr/bin/env bash
function e_net() {
set -eu
pod=`kubectl get pod ${pod_name} -n ${namespace} -o template --template='{{range .status.containerStatuses}}{{.containerID}}{{end}}' | sed 's/docker:\/\/\(.*\)$/\1/'`
pid=`docker inspect -f {{.State.Pid}} $pod`
echo -e "\033[32m Entering pod netns for ${namespace}/${pod_name} \033[0m\n"
cmd="nsenter -n -t ${pid}"
echo -e "\033[32m Execute the command: ${cmd} \033[0m"
${cmd}
}
# 运行函数
pod_name=$1
namespace=${2-"default"}
e_net
|
1
|
kubectl get pod a-0 -n default -o template --template='{{range .status.containerStatuses}}{{.containerID}}{{end}}' | sed 's/docker:\/\/\(.*\)$/\1/'
|
进展
- IP是可以分配得到的,因为插件做的,问题不大,网卡也可以设置,但是网络通不通是另外一回事了
- 关于路由和网路,需要基础SRE配合,这个不算在OKR里
- 静态IP的问题,需要解决
基本可以确定,rancher 是通过 static macvlan 插件,通过指定的 master,也就是网卡名,以及 vlan,来找到对应的母机的虚拟网卡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
spec:
cidr: 10.9.228.0/24
gateway: 10.9.228.254
master: eth1
mode: bridge
podDefaultGateway:
enable: true
serviceCidr: 10.255.0.0/16
ranges:
- rangeEnd: 10.9.228.250
rangeStart: 10.9.228.10
routes:
- dst: 10.254.0.0/16
gw: 169.254.1.1
iface: eth0
- dst: 10.9.204.0/24
gw: 169.254.1.1
iface: eth0
- dst: 10.9.205.0/24
gw: 169.254.1.1
iface: eth0
- dst: 10.9.206.0/24
gw: 169.254.1.1
iface: eth0
- dst: 10.9.207.0/24
gw: 169.254.1.1
iface: eth0
vlan: 228
|
想法
为什么会建议两张网卡呢,因为如果在一个网卡上做虚拟,Macvlan 用了,那这台机器的其他容器这个 Pod 都是访问不了的,所以两张网卡的话,配置上会让其中一个网卡作为本地容器通信用的网桥/网关,这样包才能转回来这台主机。
一定需要Macvlan吗
如果有靠谱的四层代理,其实不需要。
参考资料
- Linux网络虚拟化: macvlan
- 浅谈K8S cni和网络方案
- 如何打通K8s虚拟网络(flannel vxlan 网络)和 K8s 2层网络(macvlan网络)
- 搭建k8s集群(rpm+macvlan+ipam)
- Kubernetes Multus-CNI
- Rancher的扁平网络实现
- Rancher网络选项
- 使用Rancher Server自动下发F5负载均衡策略实践|环境搭建
- 实现基于Macvlan的高性能容器网络
- docker使用macvlan配置网络,使容器与宿主机在同一局域网,广播域内
- CentOS7.x 配置sub-interface (用于docker macvlan)
- Docker跨主机通信之macvlan
- Pod多网卡方案MULTUS
- Macvlan网络结构分析
警告
本文最后更新于 2021年12月1日,文中内容可能已过时,请谨慎参考。