写在前面,万分感谢openGauss各位大佬针对此问题及时响应定位及修复;包括但不限于(微信昵称)@来杯拿铁 @没睡午觉 @不吃糖 @行尘 @Resolution
在实际环境部署中,我们存在一个局点部署一直会在高斯扩容为双机节点前后出问题,按前置步骤恢复后,继续手动依次执行高斯扩容操作,最终定位到高斯设置虚IP的步骤引发了问题,记录定位过程;
我们扩容流程为:
1.部署好两个单机openGauss2.使用高斯扩容命令gs_expansion进行扩容3.安装CM工具4.使用CM设置虚IP资源
其中,问题出现在第四步,
如下测试环境信息为:
[root@gauss1 expansion]# su - omm -c "gs_om -t status --detail"[ CMServer State ]node node_ip instance state--------------------------------------------------------------------------1 gauss1 10.125.11.86 1 usr/bin/gaussdb/cmserver/cm_server Primary2 gauss2 10.125.11.203 2 usr/bin/gaussdb/cmserver/cm_server Standby[ Cluster State ]cluster_state : Normalredistributing : Nobalanced : Yescurrent_az : AZ_ALL[ Datanode State ]node node_ip instance state-------------------------------------------------------------------------1 gauss1 10.125.11.86 6001 usr/bin/gaussdb/data/dn P Primary Normal2 gauss2 10.125.11.203 6002 /usr/bin/gaussdb/data/dn S Standby Normal
其中主机网络信息为:
[root@gauss1 expansion]# ip a1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00inet 127.0.0.1/8 scope host lovalid_lft forever preferred_lft foreverinet6 ::1/128 scope hostvalid_lft forever preferred_lft forever2: ens7hhhhhhhhhhh: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000link/ether 0c:da:41:1d:89:34 brd ff:ff:ff:ff:ff:ffinet 10.125.11.86/22 brd 10.125.11.255 scope global dynamic noprefixroute ens7hhhhhhhhhhhvalid_lft 19602sec preferred_lft 19602secinet6 fe80::eda:41ff:fe1d:8934/64 scope link noprefixroutevalid_lft forever preferred_lft forever
原第四步执行代码为
su - omm -c "cm_ctl res --del --res_name=\"VIP_AZ1\""su - omm -c "cm_ctl res --add --res_name=\"VIP_AZ1\" --res_attr=\"resources_type=VIP,float_ip=10.125.11.87\""su - omm -c "cm_ctl res --edit --res_name=\"VIP_AZ1\" --add_inst=\"node_id=1,res_instance_id=6001\" --inst_attr=\"base_ip=10.125.11.86\""su - omm -c "cm_ctl res --edit --res_name=\"VIP_AZ1\" --add_inst=\"node_id=2,res_instance_id=6002\" --inst_attr=\"base_ip=10.125.11.203\""scp usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json root@10.125.11.203:/usr/bin/gaussdb/cmserver/cm_agent/cm_resource.jsonssh root@10.125.11.203 "chown omm:dbgrp usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json"su - omm -c "cm_ctl stop&&cm_ctl start -t 60"su - omm -c "cm_ctl show"
可是观察到每次设置完虚IP重启数据库执行后,我们主备节点的业务IP会直接丢失,导致无法登录,好在我们是多网卡,可以通过备用IP登录查看,发现原业务IP的地方变成了高斯的虚IP;
测试环境网络信息变为:
[root@gauss1 ~]# ip a1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00inet 127.0.0.1/8 scope host lovalid_lft forever preferred_lft foreverinet6 ::1/128 scope hostvalid_lft forever preferred_lft forever2: ens7hhhhhhhhhhh: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000link/ether 0c:da:41:1d:89:34 brd ff:ff:ff:ff:ff:ffinet 10.125.11.87/22 brd 10.125.11.255 scope global dynamic noprefixroute ens7hhhhhhhhhhhvalid_lft 21571sec preferred_lft 21571secinet6 fe80::eda:41ff:fe1d:8934/64 scope link noprefixroutevalid_lft forever preferred_lft forever
由于这一步是CM的功能,去定位cm_agent日志,在日志中发现了这一部分openGauss逻辑日志
2024-11-21 02:26:53.631 tid=119525 LOG: [ClearDnIpConn] instId(6001) g_dnConn[0] is null.# 校验虚IP传参是否合法2024-11-21 02:26:54.706 tid=119525 LOG: [[ClearDnIpConn]: 6001] [ClearConn] sqlCommands[select gs_validate_ext_listen_ip('normal', setting::cstring, '10.125.11.87') from pg_settings where name = 'pgxc_node_name' limit 1;] listen_ip validate ok.#cm_server和cm_agent进程是否正常2024-11-21 02:26:55.432 tid=119531 LOG: cm_agent connect to cm_server standy successfully.2024-11-21 02:26:55.432 tid=119531 LOG: cm_agent connect to cm_server standy successfully2024-11-21 02:26:55.432 tid=119531 ERROR: connect to cm server failed! The 1st of cm server node id is = 22024-11-21 02:26:55.633 tid=119531 LOG: (client) MSG_CM_SSL_CONN_ACK receive ssl require msg 12024-11-21 02:26:55.633 tid=119531 LOG: begin to create ssl connection2024-11-21 02:26:55.652 tid=119531 LOG: create ssl connection success.2024-11-21 02:26:55.652 tid=119531 LOG: cm_agent connect to cm_server primary successfully: host=10.125.11.86 port=15000 localhost=10.125.11.86 connect_timeout=3 node_id=1 node_name=gauss1 remote_type=72024-11-21 02:26:55.652 tid=119531 LOG: [ConnCmsPMain] agent connect to server takes 3172042.2024-11-21 02:26:55.681 tid=119532 LOG: binary_upgrade: usr/bin/gaussdb/tmp is not exist!2024-11-21 02:26:56.481 tid=119535 ProcessCmsMsg LOG: notify msg from cm_server, data_dir :/usr/bin/gaussdb/data/dn nodetype is 2, role is 2.2024-11-21 02:26:56.483 tid=119535 ProcessCmsMsg LOG: exec notify command:gs_ctl notify -M standby -D usr/bin/gaussdb/data/dn -w -t 1 >> "/var/log/vdi/gaussdb/omm/cm/cm_agent/system_call-current.log" 2>&1 &2024-11-21 02:26:56.483 tid=119535 ProcessCmsMsg LOG: [ProcessRecvCmsMsgMain] lock=0, wait=4000, pop=0, unlock=0, process=2, free=0, msgType=15.2024-11-21 02:26:57.090 tid=119535 ProcessCmsMsg LOG: instId(0: 6001) successfully connect to datanode: /usr/bin/gaussdb/data/dn.2024-11-21 02:26:57.095 tid=119535 ProcessCmsMsg LOG: instId(6001) process_lock_no_primary_command(select * from pg_catalog.disable_conn('prohibit_connection', '', 0);) succeed!# 开始设置虚IP2024-11-21 02:27:00.087 tid=119535 ProcessCmsMsg LOG: failover msg from cm_server, data_dir :/usr/bin/gaussdb/data/dn nodetype is 22024-11-21 02:27:00.087 tid=119535 ProcessCmsMsg LOG: [process_failover_command] set floatIp oper=1.#获取数据库IP所在网卡信息2024-11-21 02:27:00.483 tid=119524 CheckNetWork LOG: ip is 10.125.11.86, family is 2, netName is ens7hhhhhhhhhhh, netmask is 255.255.252.0.2024-11-21 02:27:00.483 tid=119524 CheckNetWork LOG: [DoUpNetworkOper] Ip: 10.125.11.87 oper=[1: NETWORK_OPER_UP], state=[2: NETWORK_STATE_DOWN], GetNicCmd(timeout -s SIGKILL 2s sudo /usr/sbin/ifconfig ens7hhhhhhhhhhh:15400 10.125.11.87 netmask 255.255.252.0 up).#使用ifconfig命令将虚IP设置在对应网卡,其中ens7hhhhhhhhhhh:15400是高斯虚IP设置的网络别名2024-11-21 02:27:00.513 tid=119524 CheckNetWork LOG: [DoUpNetworkOper] successfully to execute the cmd(timeout -s SIGKILL 2s sudo /usr/sbin/ifconfig ens7hhhhhhhhhhh:15400 10.125.11.87 netmask 255.255.252.0 up).2024-11-21 02:27:00.514 tid=119524 CheckNetWork LOG: [CheckArpingCmdRes] it will notify switch, and cmd is arping -w 1 -A -I ens7hhhhhhhhhhh 10.125.11.87.2024-11-21 02:27:01.540 tid=119524 CheckNetWork LOG: [CheckArpingCmdRes] success to execute the cmd(arping -w 1 -A -I ens7hhhhhhhhhhh 10.125.11.87).#虚IP设置成功,但此时无法通过原业务IP进行数据库校验2024-11-21 02:27:02.485 tid=119527 ERROR: get connect failed for dn(/usr/bin/gaussdb/data/dn/postmaster.pid) phony dead check, errmsg is could not connect to server: Network is unreachableIs the server running on host "10.125.11.86" and acceptingTCP/IP connections on port 15401?2024-11-21 02:27:02.485 tid=119527 LOG: has found 1 times for instance(dn_6001) phony dead check.2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].2024-11-21 02:27:02.732 tid=119529 StartAndStop WARNING: nic related with cmserver not up.
经过日志定位,很明显问题根因在于
timeout -s SIGKILL 2s sudo /usr/sbin/ifconfig ens7hhhhhhhhhhh:15400 10.125.11.87 netmask 255.255.252.0 up
这一条命令;
经过多次尝试确认确实是这条命令一旦执行就会出现IP覆盖的问题,对比问题前后的网络信息发现;这条命令试图在ens7hhhhhhhhhhh网络接口上设置一个网络别名为ens7hhhhhhhhhhh:15400的IP;
但是命令后,网络别名却变成了ens7hhhhhhhhhhh,针对此现象进行进一步查询分析,在网上查到这样一个问题:
在 Linux 系统中,网络接口名称的长度确实是有限制的。这种限制与内核和网络堆栈的实现有关。具体来说,网络接口名称的长度通常被限制在 15 字符以内。这是因为内核的 if_name 缓冲区大小被定义为 16 字节,其中一个字节用于字符串的终止符 \0。
而我们的网络设备为ens7hhhhhhhhhhh已经占满了15个字符,当设置别名为ens7hhhhhhhhhhh:15400时,会自动截取前15个字符ens7hhhhhhhhhhh,就导致和原业务网IP位置冲突,进行了覆盖设置;
进一步查明
ifconfig是net-tools工具包中的命令,ip是iproute2工具包中的命令;自 2001 年起,linux 社区已经停止对 net-tools 进行维护。同时,一些 linux distribution 的新版本已经完全抛弃了net-tools。
测试发现,使用ip addr 命令对网络进行操作,由于语法不通,所以不存在此限制;
发现此问题后,openGauss方面第一时间对我们做出了回复,针对我们使用的版本进行了临时补丁更新,更新
https://gitee.com/opengauss/CM/commit/ce7440bc5c2f374364e08c9cc82b1df639e978d5
更新后,设置虚IP可配置使用命令类型,虚IP设置命令变更为:
su - omm -c "cm_ctl res --del --res_name=\"VIP_AZ1\""su - omm -c "cm_ctl res --add --res_name=\"VIP_AZ1\" --res_attr=\"resources_type=VIP,float_ip=10.125.11.87,cmd=ip,netmask=255.255.252.0\""su - omm -c "cm_ctl res --edit --res_name=\"VIP_AZ1\" --add_inst=\"node_id=1,res_instance_id=6001\" --inst_attr=\"base_ip=10.125.11.86\""su - omm -c "cm_ctl res --edit --res_name=\"VIP_AZ1\" --add_inst=\"node_id=2,res_instance_id=6002\" --inst_attr=\"base_ip=10.125.11.203\""scp /usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json root@10.125.11.203:/usr/bin/gaussdb/cmserver/cm_agent/cm_resource.jsonssh root@10.125.11.203 "chown omm:dbgrp /usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json"su - omm -c "cm_ctl stop&&cm_ctl start -t 60"su - omm -c "cm_ctl show"
发现可以正常设置,问题解决;
再次感谢各位大佬鼎力支持!!
点击阅读原文跳转作者文章




