暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

openGauss一次虚IP设置引发问题定位记录

openGauss 2024-12-28
215
    写在前面,万分感谢openGauss各位大佬针对此问题及时响应定位及修复;包括但不限于(微信昵称)
    @来杯拿铁 @没睡午觉 @不吃糖 @行尘 @Resolution

    在实际环境部署中,我们存在一个局点部署一直会在高斯扩容为双机节点前后出问题,按前置步骤恢复后,继续手动依次执行高斯扩容操作,最终定位到高斯设置虚IP的步骤引发了问题,记录定位过程;

    我们扩容流程为:

      1.部署好两个单机openGauss
      2.使用高斯扩容命令gs_expansion进行扩容
      3.安装CM工具
      4.使用CM设置虚IP资源

      其中,问题出现在第四步,
      如下测试环境信息为:

        [root@gauss1 expansion]# su - omm -c "gs_om -t status --detail"
        [ CMServer State ]


        node node_ip instance state
        --------------------------------------------------------------------------
        1 gauss1 10.125.11.86 1 usr/bin/gaussdb/cmserver/cm_server Primary
        2 gauss2 10.125.11.203 2 usr/bin/gaussdb/cmserver/cm_server Standby


        [ Cluster State ]


        cluster_state : Normal
        redistributing : No
        balanced : Yes
        current_az : AZ_ALL


        [ Datanode State ]


        node node_ip instance state
        -------------------------------------------------------------------------
        1 gauss1 10.125.11.86 6001 usr/bin/gaussdb/data/dn P Primary Normal
        2  gauss2 10.125.11.203   6002 /usr/bin/gaussdb/data/dn S Standby Normal

        其中主机网络信息为:

          [root@gauss1 expansion]# ip a
          1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
          valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host
          valid_lft forever preferred_lft forever
          2: ens7hhhhhhhhhhh: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
          link/ether 0c:da:41:1d:89:34 brd ff:ff:ff:ff:ff:ff
          inet 10.125.11.86/22 brd 10.125.11.255 scope global dynamic noprefixroute ens7hhhhhhhhhhh
          valid_lft 19602sec preferred_lft 19602sec
          inet6 fe80::eda:41ff:fe1d:8934/64 scope link noprefixroute
                 valid_lft forever preferred_lft forever

          原第四步执行代码为

            su - omm -c "cm_ctl res --del --res_name=\"VIP_AZ1\"" 
            su - omm -c "cm_ctl res --add --res_name=\"VIP_AZ1\" --res_attr=\"resources_type=VIP,float_ip=10.125.11.87\""
            su - omm -c "cm_ctl res --edit --res_name=\"VIP_AZ1\" --add_inst=\"node_id=1,res_instance_id=6001\" --inst_attr=\"base_ip=10.125.11.86\""
            su - omm -c "cm_ctl res --edit --res_name=\"VIP_AZ1\" --add_inst=\"node_id=2,res_instance_id=6002\" --inst_attr=\"base_ip=10.125.11.203\""


            scp usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json root@10.125.11.203:/usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json
            ssh root@10.125.11.203 "chown omm:dbgrp usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json"
            su - omm -c "cm_ctl stop&&cm_ctl start -t 60"
            su - omm -c "cm_ctl show"

            可是观察到每次设置完虚IP重启数据库执行后,我们主备节点的业务IP会直接丢失,导致无法登录,好在我们是多网卡,可以通过备用IP登录查看,发现原业务IP的地方变成了高斯的虚IP;
            测试环境网络信息变为:

              [root@gauss1 ~]# ip a
              1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
              link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
              inet 127.0.0.1/8 scope host lo
              valid_lft forever preferred_lft forever
              inet6 ::1/128 scope host
              valid_lft forever preferred_lft forever
              2: ens7hhhhhhhhhhh: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
              link/ether 0c:da:41:1d:89:34 brd ff:ff:ff:ff:ff:ff
              inet 10.125.11.87/22 brd 10.125.11.255 scope global dynamic noprefixroute ens7hhhhhhhhhhh
              valid_lft 21571sec preferred_lft 21571sec
              inet6 fe80::eda:41ff:fe1d:8934/64 scope link noprefixroute
                     valid_lft forever preferred_lft forever

              由于这一步是CM的功能,去定位cm_agent日志,在日志中发现了这一部分openGauss逻辑日志

                2024-11-21 02:26:53.631 tid=119525  LOG: [ClearDnIpConn] instId(6001) g_dnConn[0] is null.
                # 校验虚IP传参是否合法
                2024-11-21 02:26:54.706 tid=119525 LOG: [[ClearDnIpConn]: 6001] [ClearConn] sqlCommands[select gs_validate_ext_listen_ip('normal', setting::cstring, '10.125.11.87') from pg_settings where name = 'pgxc_node_name' limit 1;] listen_ip validate ok.
                #cm_server和cm_agent进程是否正常
                2024-11-21 02:26:55.432 tid=119531 LOG: cm_agent connect to cm_server standy successfully.
                2024-11-21 02:26:55.432 tid=119531 LOG: cm_agent connect to cm_server standy successfully
                2024-11-21 02:26:55.432 tid=119531 ERROR: connect to cm server failed! The 1st of cm server node id is = 2
                2024-11-21 02:26:55.633 tid=119531 LOG: (client) MSG_CM_SSL_CONN_ACK receive ssl require msg 1
                2024-11-21 02:26:55.633 tid=119531 LOG: begin to create ssl connection
                2024-11-21 02:26:55.652 tid=119531 LOG: create ssl connection success.
                2024-11-21 02:26:55.652 tid=119531 LOG: cm_agent connect to cm_server primary successfully: host=10.125.11.86 port=15000 localhost=10.125.11.86 connect_timeout=3 node_id=1 node_name=gauss1 remote_type=7
                2024-11-21 02:26:55.652 tid=119531 LOG: [ConnCmsPMain] agent connect to server takes 3172042.
                2024-11-21 02:26:55.681 tid=119532 LOG: binary_upgrade: usr/bin/gaussdb/tmp is not exist!
                2024-11-21 02:26:56.481 tid=119535 ProcessCmsMsg LOG: notify msg from cm_server, data_dir :/usr/bin/gaussdb/data/dn nodetype is 2, role is 2.
                2024-11-21 02:26:56.483 tid=119535 ProcessCmsMsg LOG: exec notify command:gs_ctl notify -M standby -D usr/bin/gaussdb/data/dn -w -t 1 >> "/var/log/vdi/gaussdb/omm/cm/cm_agent/system_call-current.log" 2>&1 &
                2024-11-21 02:26:56.483 tid=119535 ProcessCmsMsg LOG: [ProcessRecvCmsMsgMain] lock=0, wait=4000, pop=0, unlock=0, process=2, free=0, msgType=15.
                2024-11-21 02:26:57.090 tid=119535 ProcessCmsMsg LOG: instId(0: 6001) successfully connect to datanode: /usr/bin/gaussdb/data/dn.
                2024-11-21 02:26:57.095 tid=119535 ProcessCmsMsg LOG: instId(6001) process_lock_no_primary_command(select * from pg_catalog.disable_conn('prohibit_connection', '', 0);) succeed!


                # 开始设置虚IP
                2024-11-21 02:27:00.087 tid=119535 ProcessCmsMsg LOG: failover msg from cm_server, data_dir :/usr/bin/gaussdb/data/dn nodetype is 2
                2024-11-21 02:27:00.087 tid=119535 ProcessCmsMsg LOG: [process_failover_command] set floatIp oper=1.


                #获取数据库IP所在网卡信息
                2024-11-21 02:27:00.483 tid=119524 CheckNetWork LOG: ip is 10.125.11.86, family is 2, netName is ens7hhhhhhhhhhh, netmask is 255.255.252.0.
                2024-11-21 02:27:00.483 tid=119524 CheckNetWork LOG: [DoUpNetworkOper] Ip: 10.125.11.87 oper=[1: NETWORK_OPER_UP], state=[2: NETWORK_STATE_DOWN], GetNicCmd(timeout -s SIGKILL 2s sudo /usr/sbin/ifconfig ens7hhhhhhhhhhh:15400 10.125.11.87 netmask 255.255.252.0 up).


                #使用ifconfig命令将虚IP设置在对应网卡,其中ens7hhhhhhhhhhh:15400是高斯虚IP设置的网络别名
                2024-11-21 02:27:00.513 tid=119524 CheckNetWork LOG: [DoUpNetworkOper] successfully to execute the cmd(timeout -s SIGKILL 2s sudo /usr/sbin/ifconfig ens7hhhhhhhhhhh:15400 10.125.11.87 netmask 255.255.252.0 up).
                2024-11-21 02:27:00.514 tid=119524 CheckNetWork LOG: [CheckArpingCmdRes] it will notify switch, and cmd is arping -w 1 -A -I ens7hhhhhhhhhhh 10.125.11.87.
                2024-11-21 02:27:01.540 tid=119524 CheckNetWork LOG: [CheckArpingCmdRes] success to execute the cmd(arping -w 1 -A -I ens7hhhhhhhhhhh 10.125.11.87).


                #虚IP设置成功,但此时无法通过原业务IP进行数据库校验
                2024-11-21 02:27:02.485 tid=119527 ERROR: get connect failed for dn(/usr/bin/gaussdb/data/dn/postmaster.pid) phony dead check, errmsg is could not connect to server: Network is unreachable
                Is the server running on host "10.125.11.86" and accepting
                TCP/IP connections on port 15401?


                2024-11-21 02:27:02.485 tid=119527 LOG: has found 1 times for instance(dn_6001) phony dead check.
                2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].
                2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].
                2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].
                2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].
                2024-11-21 02:27:02.541 tid=119524 CheckNetWork WARNING: can't find nic related with 10.125.11.86, cnt=[1: 1].
                2024-11-21 02:27:02.732 tid=119529 StartAndStop WARNING: nic related with cmserver not up.

                经过日志定位,很明显问题根因在于

                  timeout -s SIGKILL 2s sudo /usr/sbin/ifconfig ens7hhhhhhhhhhh:15400 10.125.11.87 netmask 255.255.252.0 up

                  这一条命令;
                  经过多次尝试确认确实是这条命令一旦执行就会出现IP覆盖的问题,对比问题前后的网络信息发现;这条命令试图在ens7hhhhhhhhhhh网络接口上设置一个网络别名为ens7hhhhhhhhhhh:15400的IP;
                  但是命令后,网络别名却变成了ens7hhhhhhhhhhh,针对此现象进行进一步查询分析,在网上查到这样一个问题:

                    在 Linux 系统中,网络接口名称的长度确实是有限制的。这种限制与内核和网络堆栈的实现有关。
                    具体来说,网络接口名称的长度通常被限制在 15 字符以内。这是因为内核的 if_name 缓冲区大小被定义为 16 字节,其中一个字节用于字符串的终止符 \0。

                    而我们的网络设备为ens7hhhhhhhhhhh已经占满了15个字符,当设置别名为ens7hhhhhhhhhhh:15400时,会自动截取前15个字符ens7hhhhhhhhhhh,就导致和原业务网IP位置冲突,进行了覆盖设置;
                    进一步查明

                      ifconfig是net-tools工具包中的命令,ip是iproute2工具包中的命令;
                      自 2001 年起,linux 社区已经停止对 net-tools 进行维护。同时,一些 linux distribution 的新版本已经完全抛弃了net-tools。

                      测试发现,使用ip addr 命令对网络进行操作,由于语法不通,所以不存在此限制;
                      发现此问题后,openGauss方面第一时间对我们做出了回复,针对我们使用的版本进行了临时补丁更新,更新
                      https://gitee.com/opengauss/CM/commit/ce7440bc5c2f374364e08c9cc82b1df639e978d5
                      更新后,设置虚IP可配置使用命令类型,虚IP设置命令变更为:

                        su - omm -c "cm_ctl res --del --res_name=\"VIP_AZ1\""
                        su - omm -c "cm_ctl res --add --res_name=\"VIP_AZ1\" --res_attr=\"resources_type=VIP,float_ip=10.125.11.87,cmd=ip,netmask=255.255.252.0\""
                        su - omm -c "cm_ctl res --edit --res_name=\"VIP_AZ1\" --add_inst=\"node_id=1,res_instance_id=6001\" --inst_attr=\"base_ip=10.125.11.86\""
                        su - omm -c "cm_ctl res --edit --res_name=\"VIP_AZ1\" --add_inst=\"node_id=2,res_instance_id=6002\" --inst_attr=\"base_ip=10.125.11.203\""


                        scp /usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json root@10.125.11.203:/usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json
                        ssh root@10.125.11.203 "chown omm:dbgrp /usr/bin/gaussdb/cmserver/cm_agent/cm_resource.json"
                        su - omm -c "cm_ctl stop&&cm_ctl start -t 60"
                        su - omm -c "cm_ctl show"

                        发现可以正常设置,问题解决;
                        再次感谢各位大佬鼎力支持!!


                        点击阅读原文跳转作者文章

                        文章转载自openGauss,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                        评论