EC2两张网卡连接两个子网,分别关联EIP,其中一个EIP ping不通,怎么办?

问题


- 一个有两张网卡的 EC2 instance (RHEL7.5),每个网卡分别对应一个public subnet, 每个网卡也关联一个 Public IP.
- 从其他网络 ping 这两个 IP 地址,发现其中一个IP能ping通,另一个IP不能ping通。

[ec2-user@ip-172-31-30-14 ~]$ ping 52.80.82.70 -c2
PING 52.80.82.70 (52.80.82.70) 56(84) bytes of data.
64 bytes from 52.80.82.70: icmp_seq=1 ttl=63 time=1.71 ms
64 bytes from 52.80.82.70: icmp_seq=2 ttl=63 time=1.80 ms

[ec2-user@ip-172-31-30-14 ~]$ ping 52.81.2.166 -c2
PING 52.81.2.166 (52.81.2.166) 56(84) bytes of data.

--- 52.81.2.166 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1004ms

重现步骤

1. 创建一个VPC, IPv4 CIDR 10.0.0.0/16.
2. 在这个VPC中创建两个 subnet, CIDR 分别是 10.0.0.0/24 和 10.0.2.0/24.
3. 创建一个 Internet Gateway (igw), 关联到这个 VPC. 在VPC的主路由表中,添加 0.0.0.0/0 指向 igw.
4. 为方便重现问题,创建一个 Security Group, 开放所有流量到所有地址。
5. 创建一个 instance, 使用 RHEL7.5 镜像,添加两张网卡(eni),每张网卡对应一个 subnet, 暂时不分配 public ip.
6. 申请两个 Elastic IP (EIP),分别关联到上面的两个网卡上。
7. 登陆OS,默认情况下RHEL7中只能看到一个 connection. 需要手动启动另一张网卡。

[root@ip-10-0-0-29 ~]# nmcli con
NAME         UUID                                  TYPE      DEVICE
System eth0  5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03  ethernet  eth0

[root@ip-10-0-0-29 ~]# nmcli dev con eth1
Device 'eth1' successfully activated with 'bba19bdc-c101-45ba-a6ec-fc4a8988ad1b'.

#ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP group default qlen 1000
    inet 10.0.0.29/24 brd 10.0.0.255 scope global noprefixroute dynamic eth0
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP group default qlen 1000
    inet 10.0.2.37/24 brd 10.0.2.255 scope global noprefixroute dynamic eth1

8. 在RHEL7中开启tcpdump抓包,在其他网络中尝试 ping 这个instance的两个EIP. RHEL7 响应了到10.0.0.29(52.80.82.70)的ping请求,但是没有回复到10.0.2.37(52.81.2.166)的请求。

[root@ip-10-0-0-29 ~]# tcpdump -i any -n icmp
07:05:45.994003 IP 10.0.0.29 > 54.202.202.11: ICMP 10.0.0.29 udp port scp-config unreachable, length 52
07:05:56.751808 IP 52.81.15.146 > 10.0.0.29: ICMP echo request, id 31537, seq 1, length 64
07:05:56.751839 IP 10.0.0.29 > 52.81.15.146: ICMP echo reply, id 31537, seq 1, length 64
07:05:57.753706 IP 52.81.15.146 > 10.0.0.29: ICMP echo request, id 31537, seq 2, length 64
07:05:57.753733 IP 10.0.0.29 > 52.81.15.146: ICMP echo reply, id 31537, seq 2, length 64
^^^^ 能 ping 通 52.80.82.70 ^^^^
vvvv 不能 ping 通 52.81.2.166 vvvv
07:06:01.368063 IP 52.81.15.146 > 10.0.2.37: ICMP echo request, id 31538, seq 1, length 64
07:06:02.389995 IP 52.81.15.146 > 10.0.2.37: ICMP echo request, id 31538, seq 2, length 64

这么看,问题出现在操作系统不回包。

背景知识

Linux回路检测

在RHEL7中,系统默认开启了 rp_filter 的 Strict mode. 在这个模式下,如果进来的数据包来源跟将要发送的路径不一样,内核会将这个数据包抛弃。

[root@ip-10-0-0-29 ~]# sysctl -a | grep -F ".rp_filter"
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.eth0.rp_filter = 1
net.ipv4.conf.eth1.rp_filter = 1
net.ipv4.conf.lo.rp_filter = 0

kernel-doc/networking/ip-sysctl.txt

rp_filter - INTEGER
        0 - No source validation.
        1 - Strict mode as defined in RFC3704 Strict Reverse Path
            Each incoming packet is tested against the FIB and if the interface
            is not the best reverse path the packet check will fail.
            By default failed packets are discarded.
        2 - Loose mode as defined in RFC3704 Loose Reverse Path
            Each incoming packet's source address is also tested against the FIB
            and if the source address is not reachable via any interface
            the packet check will fail.

        Current recommended practice in RFC3704 is to enable strict mode
        to prevent IP spoofing from DDos attacks. If using asymmetric routing
        or other complicated routing, then loose mode is recommended.

        The max value from conf/{all,interface}/rp_filter is used
        when doing source validation on the {interface}.

        Default value is 0. Note that some distributions enable it
        in startup scripts.

在这个例子中,ping 52.81.2.166 的包会被 eth1(10.0.2.37)接收。如果系统要回应这个ping包,它要从 eth0 发送到 10.0.0.1。 进来的路由跟出去的路由不一致,系统会抛弃这个包。
在成功的例子中,ping 52.80.82.70 的包会被 eth0(10.0.0.29)接收。根据路由表,系统要从eth0回复,没问题。

[root@ip-10-0-0-29 ~]# ip r
default via 10.0.0.1 dev eth0 proto dhcp metric 100
default via 10.0.2.1 dev eth1 proto dhcp metric 101
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.29 metric 100
10.0.2.0/24 dev eth1 proto kernel scope link src 10.0.2.37 metric 101

那么把 rp_filter 关闭会好使吗?
修改 /etc/sysctl.conf 如下

[root@ip-10-0-0-29 ~]# sysctl -p
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.eth0.rp_filter = 0
net.ipv4.conf.eth1.rp_filter = 0

修改后客户端再次尝试ping, RHEL7中开启抓包。

[root@ip-10-0-0-29 ~]# tcpdump -i any -n icmp
07:49:52.964673 IP 52.81.15.146 > 10.0.0.29: ICMP echo request, id 31691, seq 1, length 64
07:49:52.964701 IP 10.0.0.29 > 52.81.15.146: ICMP echo reply, id 31691, seq 1, length 64
07:49:53.966666 IP 52.81.15.146 > 10.0.0.29: ICMP echo request, id 31691, seq 2, length 64
07:49:53.966693 IP 10.0.0.29 > 52.81.15.146: ICMP echo reply, id 31691, seq 2, length 64

07:50:00.948659 IP 52.81.15.146 > 10.0.2.37: ICMP echo request, id 31692, seq 1, length 64
07:50:00.948693 IP 10.0.2.37 > 52.81.15.146: ICMP echo reply, id 31692, seq 1, length 64
07:50:01.973014 IP 52.81.15.146 > 10.0.2.37: ICMP echo request, id 31692, seq 2, length 64
07:50:01.973046 IP 10.0.2.37 > 52.81.15.146: ICMP echo reply, id 31692, seq 2, length 64

从抓包中,可以看到RHEL7回复了来自两个不同网卡的ping请求。然而,从客户端的角度看,ping 52.80.82.70 依然是成功的,ping 52.81.2.166 依然是失败的。从客户端抓包,ping 52.81.2.166 时没有收到任何回复。

AWS 中的 Elastic IP


还有个需要留意的事情是EIP.
虽然不知道AWS是怎样实现 Private IP <-> Public IP 的转换,但是从观察上来理解,EC2往互联网发送的数据包在经过IGW时,源IP地址会被从 Private IP 替换成 Public IP,协议端口都不变。从其他网络通过EIP访问EC2,IGW也会将目的IP地址从 Public IP 替换成 Private IP.

因此,尽管将 rp_filter 设置成 0,在 EC2 将回复包从 eth0 发送出去之后,源IP地址也会变成 52.80.82.70,而不是52.81.2.166. 由于在客户端收不到任何回复,所以也无法验证。

解决方法

解决这个问题,正路是让数据包从哪进来从哪儿出去。在系统层面,可以通过策略路由来解决。

1. 安装 NetworkManager-dispatcher-routing-rules (需要先启用rhel-server-optional软件源),它可以让策略路由在系统重启后仍然有效。

[root@ip-10-0-0-29 ~]# yum install NetworkManager-dispatcher-routing-rules

2. 写入以下配置

[root@ip-10-0-0-29 ~]# echo '500 eth1' >> /etc/iproute2/rt_tables
[root@ip-10-0-0-29 ~]# cat /etc/sysconfig/network-scripts/rule-eth1
from 10.0.2.0/24 table eth1
to 10.0.2.0/24 table eth1
[root@ip-10-0-0-29 ~]# cat /etc/sysconfig/network-scripts/route-eth1
default via 10.0.2.1 dev eth1 table eth1

3. 启动 NetworkManager-dispatcher, 重启网络连接。

[root@ip-10-0-0-29 ~]# systemctl enable NetworkManager-dispatcher.service
[root@ip-10-0-0-29 ~]# systemctl restart NetworkManager-dispatcher.service
[root@ip-10-0-0-29 ~]# nmcli con reload
[root@ip-10-0-0-29 ~]# nmcli con down eth1; nmcli con up eth1

4. 检查路由是否生效。

[root@ip-10-0-0-29 ~]# ip rule
0:      from all lookup local
32764:  from all to 10.0.2.0/24 lookup eth1
32765:  from 10.0.2.0/24 lookup eth1
32766:  from all lookup main
32767:  from all lookup default

[root@ip-10-0-0-29 ~]# ip route list table eth1
default via 10.0.2.1 dev eth1

再次从客户端测试ping, 这次两个EIP都能够ping通。

[ec2-user@ip-172-31-30-14 ~]$ ping 52.80.82.70 -c1 > /dev/null && ping 52.81.2.166 -c1 > /dev/null; echo $?
0

Amazon Linux 的实现

如果使用 Amazon Linux, 在添加网卡的时候,系统会自动创建一个策略路由。所以添加网卡不需要我们操心其他事情。

在添加网卡的时候,可以从/var/log/messages看到以下日志。

Nov 28 09:39:06 ip-10-0-0-220 systemd-udevd: Network interface NamePolicy= disabled on kernel command line, ignoring.
Nov 28 09:39:06 ip-10-0-0-220 systemd: Started Enable elastic network interfaces eth1.
Nov 28 09:39:06 ip-10-0-0-220 systemd: Starting Enable elastic network interfaces eth1...
Nov 28 09:39:06 ip-10-0-0-220 ec2net: [plug_interface] eth1 plugged
Nov 28 09:39:06 ip-10-0-0-220 ec2net: [rewrite_primary] Rewriting configs for eth1
Nov 28 09:39:06 ip-10-0-0-220 ec2net: [get_meta] Trying to get http://169.254.169.254/latest/meta-data/network/interfaces/macs/06:e3:59:61:01:34/subnet-ipv4-cidr-block

Nov 28 09:39:11 ip-10-0-0-220 ec2net: [activate_primary] Activating eth1
Nov 28 09:39:11 ip-10-0-0-220 dhclient[3515]: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 5 (xid=0x1fb19cc6)
Nov 28 09:39:11 ip-10-0-0-220 dhclient[3515]: DHCPREQUEST on eth1 to 255.255.255.255 port 67 (xid=0x1fb19cc6)
Nov 28 09:39:11 ip-10-0-0-220 dhclient[3515]: DHCPOFFER from 10.0.2.1
Nov 28 09:39:11 ip-10-0-0-220 dhclient[3515]: DHCPACK from 10.0.2.1 (xid=0x1fb19cc6)
Nov 28 09:39:11 ip-10-0-0-220 ec2net: [get_meta] Trying to get http://169.254.169.254/latest/meta-data/network/interfaces/macs/06:e3:59:61:01:34/local-ipv4s
Nov 28 09:39:11 ip-10-0-0-220 ec2net: [rewrite_rules] Rewriting rules for eth1
Nov 28 09:39:11 ip-10-0-0-220 dhclient[3515]: bound to 10.0.2.142 -- renewal in 1520 seconds.
Nov 28 09:39:11 ip-10-0-0-220 ec2ifup: Determining IP information for eth1... done.
Nov 28 09:39:11 ip-10-0-0-220 ec2net: [get_meta] Trying to get http://169.254.169.254/latest/meta-data/network/interfaces/macs/06:e3:59:61:01:34/local-ipv4s
Nov 28 09:39:11 ip-10-0-0-220 ec2net: [rewrite_aliases] Rewriting aliases of eth1
Nov 28 09:39:12 ip-10-0-0-220 ec2ifup: Determining IPv6 information for eth1... done.

可以看到路由表也自动写好了。

[root@ip-10-0-0-220 ~]# ip rule
0:      from all lookup local
32765:  from 10.0.2.142 lookup 10001
32766:  from all lookup main
32767:  from all lookup default
[root@ip-10-0-0-220 ~]# ip route list table 10001
default via 10.0.2.1 dev eth1
10.0.2.0/24 dev eth1 proto kernel scope link src 10.0.2.142

这些功能应该是 ec2-net-utils 提供的。

[root@ip-10-0-0-220 ~]# rpm -ql ec2-net-utils-1.1-1.1.amzn2.noarch
/etc/dhcp/dhclient.d/ec2dhcp.sh
/etc/modprobe.d/ixgbevf.conf
/etc/sysconfig/network-scripts/ec2net-functions
/etc/sysconfig/network-scripts/ec2net.hotplug
/etc/udev/rules.d/53-ec2-network-interfaces.rules
/etc/udev/rules.d/75-persistent-net-generator.rules
/usr/lib/systemd/system/ec2net-ifup@.service
/usr/lib/systemd/system/ec2net-scan.service
/usr/lib/udev/rule_generator.functions
/usr/lib/udev/write_net_rules
/usr/sbin/ec2ifdown
/usr/sbin/ec2ifscan
/usr/sbin/ec2ifup
/usr/share/man/man8/ec2ifdown.8.gz
/usr/share/man/man8/ec2ifscan.8.gz
/usr/share/man/man8/ec2ifup.8.gz

从 /etc/sysconfig/network-scripts/ec2net-functions 可以看到,rewrite_primary()会执行写路由文件的动作。rewrite_primary()会在网卡启动的时候被调用。

rewrite_primary() {
### snip ###
  cat <<- EOF > ${route_file}
        default via ${gateway} dev ${INTERFACE} table ${RTABLE}
        default via ${gateway} dev ${INTERFACE} metric ${RTABLE}
        ${cidr} dev ${INTERFACE} proto kernel scope link src ${primary_ipv4} table ${RTABLE}
EOF
### snip ###
}

rewrite_rules() {
### snip ###
  logger --tag ec2net "[rewrite_rules] Rewriting rules for ${INTERFACE}"
  # Retrieve a list of IP rules for the route table that belongs
  # to this interface. Treat this as the stale list. For each IP
  # address obtained from metadata, cross the corresponding rule
  # off the stale list if present. Otherwise, add a rule sending
  # outbound traffic from that IP to the interface route table.
  # Then, remove all other rules found in the stale list.
  declare -A rules
  for rule in $(/sbin/ip -4 rule list \
                |grep "from .* lookup ${RTABLE}" \
                |awk '{print $1$3}'); do
    split=(${rule//:/ })
    rules[${split[1]}]=${split[0]}
  done
  for ip in ${ips[@]}; do
    if [[ ${rules[${ip}]} ]]; then
      unset rules[${ip}]
    else
      /sbin/ip -4 rule add from ${ip} lookup ${RTABLE}
    fi
  done
  for rule in "${!rules[@]}"; do
    /sbin/ip -4 rule delete pref "${rules[${rule}]}"
  done
### snip ###
}

参考文档

VPC中NAT的那点事
RHEL7 & CentOS7 policy based routing