暂无图片
暂无图片
6
暂无图片
暂无图片
暂无图片

达梦数据库守护集群自动切换故障排查,原因不可原谅!

原创 陈举超 2024-06-15
2069

说明:

不是达梦数据库的问题,是XXX人员没按规范操作导致的,本次案例主要介绍达梦数据库守护集群发生自动切换后如何排查切换原因,从本次案例可以看到,达梦守护集群的故障自动发现、故障自动切换功能还是比较完善的

问题现象:

一套非生产达梦数据库,守护集群架构,无监控,某天检查参数时突然发现原主库变成备库,原备库变成了主库,也就是发生了主、备切换。此类问题可大可小,如果是因为数据库BUG导致的自动切换,就是非常严重的问题,需要尽快定位原因并解决问题,最终发现,并不是数据库的问题。

问题分析:

第一步:通过确认监控器确定自动切换时间等信息;第二步:检查主库实例日志、dmwatcher日志、core日志、OS日志等分析自动切换原因;第三步:确认影响及解决方案。

检查监控器日志:

达梦数据库守护集群,通常只有在主库故障时才会自动切换,由于自动切换是由确认监控器发起的,所以第一步先检查确认监控器日志:

通过日志可以看到:

时间:2024-06-03 13:28:02 监控器发现 CJC01 实例故障,WSTATUS 为 ERROR,RECEIVE TIMEOUT CJC01,dwmon_tcp_recv failed。时间:2024-06-03 13:28:08[!!! 实例CJC01[PRIMARY, OPEN, ISTAT_SAME:TRUE]故障,实例CJC02[STANDBY, OPEN, ISTAT_SAME:TRUE]符合自动接管条件 !!!]时间:2024-06-03 13:28:12使用实例CJC02接管成功,主、备数据库自动切换完成,新主库变为CJC02。

详细监控器日志如下:

dmdba@cjc-db-01:/dm8/mal/log$vi dmmonitor_20240525105444.log[monitor]         2024-06-03 13:28:02: dwmon_check_watch_error_low, receive dmwatcher(CJC01) msg timeout(far_err_time:10), close port(ip:192.168.0.11, port:55339, vio:4, n_fixed:1)![monitor]         2024-06-03 13:28:02: dwmon tcp port vio(4) close, inst_name:CJC01, ip:192.168.0.11, port:55339, n_fixed:1.[monitor]         2024-06-03 13:28:02: <RECEIVE TIMEOUT CJC01>[monitor]         2024-06-03 13:28:02: dwmon_tcp_recv failed, close port, vio:0, mid:1991408620, errno:115, code:-6007[monitor]         2024-06-03 13:28:02: dwmon tcp port vio(0) close, inst_name:, ip:192.168.0.11, port:55339, n_fixed:1.[monitor]         2024-06-03 13:28:02: 接收守护进程(CJC01)消息超时                  WTIME                WSTATUS        INST_OK   INAME            ISTATUS     IMODE     RSTAT    N_OPEN   FLSN            CLSN                  2024-06-03 13:27:14  ERROR          OK        CJC01          OPEN        PRIMARY   VALID    10       11554054        11554054[monitor]         2024-06-03 13:28:02: </RECEIVE TIMEOUT CJC01>[monitor]         2024-06-03 13:28:02: dwmon_inst_can_takeover_by_prim_info, check_pstat = 1, n_pinfo = 1, but ok_cnt = 0, return FALSE![monitor]         2024-06-03 13:28:02: [!!! 实例CJC01[PRIMARY, OPEN, ISTAT_SAME:TRUE]仍然处于活动状态,或者模式/状态或LSN值不符合被其他实例自动接管条件 !!!][CJC02]          2024-06-03 13:27:33.500 [INFO] dmwatcher P0000007887 T0000000000000007891  没有收到远程守护进程(CJC01)消息,原状态为(OPEN),距上次接收消息时间间隔(11)s, 设置远程守护进程为ERROR状态[monitor]         2024-06-03 13:28:08: dwmon_inst_can_takeover_by_prim_info, primary name: CJC01, instance name: CJC02, n_pinfo = 1, ok_cnt = 1, return TRUE![monitor]         2024-06-03 13:28:08: [!!! 实例CJC01[PRIMARY, OPEN, ISTAT_SAME:TRUE]故障,实例CJC02[STANDBY, OPEN, ISTAT_SAME:TRUE]符合自动接管条件 !!!][monitor]         2024-06-03 13:28:08: 检测到PRIMARY实例故障,开始对组(GRPCJC)执行自动接管[monitor]         2024-06-03 13:28:08: <AUTO TAKEOVER CJC02>[monitor]         2024-06-03 13:28:08: 通知组(GRPCJC)当前活动的守护进程设置MID[monitor]         2024-06-03 13:28:08: Begin to wait site(CJC02) complete...[monitor]         2024-06-03 13:28:08: [!!! dwmon_tcp_msg_send, get master tcp port failed, send cmd msg(cmd:5, name_sendto:GRPCJC) to dmwatcher() !!!][monitor]         2024-06-03 13:28:08: Wait site(CJC02) finished, code=0![monitor]         2024-06-03 13:28:08: 通知组(GRPCJC)当前活动的守护进程设置MID成功[monitor]         2024-06-03 13:28:08: 开始使用实例CJC02接管[monitor]         2024-06-03 13:28:08: 通知守护进程CJC02切换TAKEOVER状态[monitor]         2024-06-03 13:28:09: 守护进程(CJC02)状态切换 [OPEN-->TAKEOVER][monitor]         2024-06-03 13:28:10: 切换守护进程CJC02为TAKEOVER状态成功[monitor]         2024-06-03 13:28:10: 实例CJC02开始执行SP_SET_GLOBAL_DW_STATUS(0, 7)语句[monitor]         2024-06-03 13:28:10: 实例CJC02执行SP_SET_GLOBAL_DW_STATUS(0, 7)语句成功[monitor]         2024-06-03 13:28:10: 实例CJC02开始执行SP_APPLY_KEEP_PKG()语句[monitor]         2024-06-03 13:28:10: 实例CJC02执行SP_APPLY_KEEP_PKG()语句成功[monitor]         2024-06-03 13:28:10: 实例CJC02开始执行ALTER DATABASE MOUNT语句[monitor]         2024-06-03 13:28:10: Begin to wait site(CJC02) complete...[monitor]         2024-06-03 13:28:10: Wait site(CJC02) finished, code=0![monitor]         2024-06-03 13:28:10: 实例CJC02执行ALTER DATABASE MOUNT语句成功[monitor]         2024-06-03 13:28:10: 实例CJC02开始执行ALTER DATABASE PRIMARY语句[monitor]         2024-06-03 13:28:10: 实例CJC02执行ALTER DATABASE PRIMARY语句成功[monitor]         2024-06-03 13:28:10: 通知实例CJC02修改所有归档状态无效[monitor]         2024-06-03 13:28:10: 修改所有实例归档为无效状态成功[monitor]         2024-06-03 13:28:10: 实例CJC02开始执行ALTER DATABASE OPEN FORCE语句[monitor]         2024-06-03 13:28:10: ohis_inst_info_copy_low, inst(CJC02) apply info changed, old info[p_db_magic:694012907, n_apply_ep:1], new info to set[p_db_magic:1894906507, n_apply_ep:0]![monitor]         2024-06-03 13:28:10: 实例CJC02执行ALTER DATABASE OPEN FORCE语句成功[monitor]         2024-06-03 13:28:10: 实例CJC02开始执行SP_SET_GLOBAL_DW_STATUS(7, 0)语句[monitor]         2024-06-03 13:28:11: 实例CJC02执行SP_SET_GLOBAL_DW_STATUS(7, 0)语句成功[monitor]         2024-06-03 13:28:11: 通知守护进程CJC02切换OPEN状态[monitor]         2024-06-03 13:28:12: 守护进程(CJC02)状态切换 [TAKEOVER-->OPEN][monitor]         2024-06-03 13:28:12: 切换守护进程CJC02为OPEN状态成功[monitor]         2024-06-03 13:28:12: 通知组(GRPCJC)的守护进程执行清理操作[monitor]         2024-06-03 13:28:12: Notify instance(CJC02) to clear monitor info and wait complete![monitor]         2024-06-03 13:28:12: dwmon_cmd_msg_send_low failed, get tcp_port failed(inst_name:, ip:192.168.0.11, port:55339)![monitor]         2024-06-03 13:28:12: [!!! dwmon_tcp_msg_send to master tcp_port failed, code:-6010, (inst_name:, ip:192.168.0.11, port:55339, vio:0) !!!][monitor]         2024-06-03 13:28:12: Begin to wait site(CJC02) complete...[monitor]         2024-06-03 13:28:12: 清理守护进程(CJC02)请求成功[monitor]         2024-06-03 13:28:12: Wait site(CJC02) finished, code=0![monitor]         2024-06-03 13:28:12: 使用实例CJC02接管成功[monitor]         2024-06-03 13:28:12: </AUTO TAKEOVER CJC02>[monitor]         2024-06-03 13:28:12: 组(GRPCJC)使用实例CJC02自动接管成功[monitor]         2024-06-03 13:28:49:                  GROUP            OGUID       MON_CONFIRM MODE         MPP_FLAG                  GRPCJC         666666      TRUE        AUTO         FALSE                  <<DATABASE GLOBAL INFO:>>                  DW_IP        MAL_DW_PORT  WTIME                WTYPE     WCTLSTAT  WSTATUS        INAME            INST_OK   N_EP  N_OK  ISTATUS     IMODE     DSC_STATUS     RTYPE     RSTAT                  192.168.0.12  55339        2024-06-03 13:28:18  GLOBAL    VALID     OPEN           CJC02          OK        1     1     OPEN        PRIMARY   DSC_OPEN       REALTIME  VALID                      EP INFO:                  INST_IP      INST_PORT  INST_OK   INAME            ISTATUS     IMODE     DSC_SEQNO  DSC_CTL_NODE     RTYPE     RSTAT       FSEQ            FLSN            CSEQ            CLSN            DW_STAT_FLAG                  192.168.0.12  15239      OK        CJC02          OPEN        PRIMARY   0          0                REALTIME  VALID       688606          11554511        688606          11554511        NONE                  ERROR INSTANCES:                  <<DATABASE GLOBAL INFO:>>                  DW_IP        MAL_DW_PORT  WTIME                WTYPE     WCTLSTAT  WSTATUS        INAME            INST_OK   N_EP  N_OK  ISTATUS     IMODE     DSC_STATUS     RTYPE     RSTAT                  192.168.0.11  55339        2024-06-03 13:27:14  GLOBAL    VALID     ERROR          CJC01          OK        1     1     OPEN        PRIMARY   DSC_OPEN       REALTIME  VALID                  EP INFO:                  INST_IP      INST_PORT  INST_OK   INAME            ISTATUS     IMODE     DSC_SEQNO  DSC_CTL_NODE     RTYPE     RSTAT       FSEQ            FLSN            CSEQ            CLSN            DW_STAT_FLAG                  192.168.0.11  15239      OK        CJC01          OPEN        PRIMARY   0          0                REALTIME  VALID       688604          11554054        688604          11554054        NONE[monitor]         2024-06-03 13:29:49:                  GROUP            OGUID       MON_CONFIRM MODE         MPP_FLAG                  GRPCJC         666666      TRUE        AUTO         FALSE                  <<DATABASE GLOBAL INFO:>>                  DW_IP        MAL_DW_PORT  WTIME                WTYPE     WCTLSTAT  WSTATUS        INAME            INST_OK   N_EP  N_OK  ISTATUS     IMODE     DSC_STATUS     RTYPE     RSTAT                  192.168.0.12  55339        2024-06-03 13:29:18  GLOBAL    VALID     OPEN           CJC02          OK        1     1     OPEN        PRIMARY   DSC_OPEN       REALTIME  VALID
EP INFO: INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG 192.168.0.12 15239 OK CJC02 OPEN PRIMARY 0 0 REALTIME VALID 688606 11554511 688606 11554511 NONE ERROR INSTANCES: <<DATABASE GLOBAL INFO:>> DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT 192.168.0.11 55339 2024-06-03 13:27:14 GLOBAL VALID ERROR CJC01 OK 1 1 OPEN PRIMARY DSC_OPEN REALTIME VALID EP INFO: INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG 192.168.0.11 15239 OK CJC01 OPEN PRIMARY 0 0 REALTIME VALID 688604 11554054 688604 11554054 NONE

检查主库实例日志:

检查CJC01原主库实core日志:

检查core目录,没有新文件生成,看来不是数据库BUG导致的自动切换。

检查CJC01原主库实例日志:

通过日志可以看到: 

时间:2024-06-03 13:27:30.770主库实例出现异常,socket_err_should_retry errno:110;时间:2024-06-03 13:27:32.077mal通信出现异常, mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:10时间:2024-06-03 13:28:08.129主库无法和备库通信,Can't connect to DM server on '192.168.0.12' port(36339) errno(115)时间:2024-06-03 13:29:07.746主库向备库发送arch失败,[realtime arch] mal send to site(CJC02, mal_seqno:1) failed时间:2024-06-03 15:05:31.897主库自动重启, [for dem]SYSTEM SHUTDOWN ABORT...,utsk_dw_sql_exec dmserver shutdown abort!!!

详细实例日志如下:

2024-06-03 13:26:07.692 [INFO] database P0000006806 T0000000000000006859  ckpt2_log_adjust: ckpt_lsn(11540754), ckpt_fil(0), ckpt_off(864264192), cur_lsn(11554054), l_next_seq(688605), g_next_seq(688605), cur_free(867622912), total_space(2147475456), used_space(3358720), free_space(2144116736), n_ep(1), db_open_id(10)2024-06-03 13:26:07.693 [INFO] database P0000006806 T0000000000000006859  checkpoint end, 0 pages flushed, used_space[3358720], free_space[2144116736].2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006785  socket_err_should_retry errno:1102024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006786  socket_err_should_retry errno:1102024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006783  socket_err_should_retry errno:1102024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006782  socket_err_should_retry errno:1102024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006780  socket_err_should_retry errno:1102024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006781  socket_err_should_retry errno:1102024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006778  socket_err_should_retry errno:1102024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006779  socket_err_should_retry errno:1102024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006784  socket_err_should_retry errno:1102024-06-03 13:27:31.330 [INFO] database P0000006806 T0000000000000026999  socket_err_should_retry errno:1102024-06-03 13:27:31.330 [INFO] database P0000006806 T0000000000000027000  socket_err_should_retry errno:1102024-06-03 13:27:31.809 [INFO] database P0000006806 T0000000000000026993  socket_err_should_retry errno:1102024-06-03 13:27:31.809 [INFO] database P0000006806 T0000000000000026992  socket_err_should_retry errno:1102024-06-03 13:27:31.809 [INFO] database P0000006806 T0000000000000026996  socket_err_should_retry errno:1102024-06-03 13:27:31.809 [INFO] database P0000006806 T0000000000000026994  socket_err_should_retry errno:1102024-06-03 13:27:31.810 [INFO] database P0000006806 T0000000000000026998  socket_err_should_retry errno:1102024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006809  socket_err_should_retry errno:1102024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006807  socket_err_should_retry errno:1102024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006805  socket_err_should_retry errno:1102024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006804  socket_err_should_retry errno:1102024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006836  socket_err_should_retry errno:1102024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006837  socket_err_should_retry errno:1102024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006835  socket_err_should_retry errno:1102024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006803  socket_err_should_retry errno:1102024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006838  socket_err_should_retry errno:1102024-06-03 13:27:32.077 [INFO] database P0000006806 T0000000000000006852  mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:102024-06-03 13:27:33.077 [INFO] database P0000006806 T0000000000000006852  mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:12024-06-03 13:27:33.077 [INFO] database P0000006806 T0000000000000006852  mal_site_link_check from mal_site(0) to mal_site(1) failed, call mal_site_port_close!2024-06-03 13:27:33.077 [WARNING] database P0000006806 T0000000000000006852  MAL mgr thrd site(0) lost connect to site(1), ctl_handle(48), data_handle(49), dsc_handle(0)2024-06-03 13:27:33.077 [WARNING] database P0000006806 T0000000000000006846  mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again2024-06-03 13:27:33.077 [WARNING] database P0000006806 T0000000000000006847  mal_site_letter_recv code=-6007, errno=107, site(0) recv from site(1) failed, socket handle = 02024-06-03 13:27:33.078 [INFO] database P0000006806 T0000000000000006846  send  CMD_MAL_LINK_CHECK(350): (mal_id:87333158777, stmt_id:4320122, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:-1)2024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006847  MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0)2024-06-03 13:27:33.077 [WARNING] database P0000006806 T0000000000000006845  mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again2024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006848  mal_site_letter_recv code=-6007, errno=0, site(0) recv from site(1) failed, socket handle = 02024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006847  site(0) ctl_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again2024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006848  MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0)2024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006848  site(0) data_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again2024-06-03 13:27:33.249 [INFO] database P0000006806 T0000000000000026991  socket_err_should_retry errno:1102024-06-03 13:27:34.369 [INFO] database P0000006806 T0000000000000006839  socket_err_should_retry errno:1102024-06-03 13:27:35.329 [INFO] database P0000006806 T0000000000000006787  socket_err_should_retry errno:1102024-06-03 13:27:35.409 [INFO] database P0000006806 T0000000000000026997  socket_err_should_retry errno:1102024-06-03 13:27:35.409 [INFO] database P0000006806 T0000000000000026995  socket_err_should_retry errno:1102024-06-03 13:27:35.490 [INFO] database P0000006806 T0000000000000025591  socket_err_should_retry errno:1102024-06-03 13:27:35.490 [INFO] database P0000006806 T0000000000000025594  socket_err_should_retry errno:1102024-06-03 13:27:35.490 [INFO] database P0000006806 T0000000000000025587  socket_err_should_retry errno:1102024-06-03 13:27:35.489 [INFO] database P0000006806 T0000000000000025592  socket_err_should_retry errno:1102024-06-03 13:27:35.489 [INFO] database P0000006806 T0000000000000025598  socket_err_should_retry errno:1102024-06-03 13:28:08.129 [ERROR] database P0000006806 T0000000000000006852  Can't connect to DM server on '192.168.0.12' port(36339) errno(115)2024-06-03 13:29:07.740 [INFO] database P0000006806 T0000000000000006894  checkpoint requested by CKPT_INTERVAL, rlog free space[2144116736], used space[3358720]2024-06-03 13:29:07.740 [INFO] database P0000006806 T0000000000000006894  checkpoint generate by ckpt_interval2024-06-03 13:29:07.740 [INFO] database P0000006806 T0000000000000006859  checkpoint begin, used_space[3358720], free_space[2144116736]...2024-06-03 13:29:07.746 [ERROR] database P0000006806 T0000000000000006860  [realtime arch] mal send to site(CJC02, mal_seqno:1) failed, begin lsn:11554054, end lsn:11554054, code:-60102024-06-03 13:29:07.746 [ERROR] database P0000006806 T0000000000000006860  send realtime archive to instance[CJC02] failed, code = -6010, begin_lsn = 11554054, end_lsn = 11554054!2024-06-03 13:29:07.746 [INFO] database P0000006806 T0000000000000006860  rlog4_process_arch_failed, need_suspend:12024-06-03 13:29:07.747 [INFO] database P0000006806 T0000000000000006860  Send archive log to remote instance failed, switch all ep to SUSPEND status success!2024-06-03 14:10:39.591 [INFO] database P0000006806 T0000000000000006865  (1) License daily checking by time trigger!2024-06-03 15:05:31.897 [INFO] database P0000006806 T0000000000000006897  utsk_dw_sql_exec, exec sql SHUTDOWN ABORT!2024-06-03 15:05:31.897 [FATAL] database P0000006806 T0000000000000006897  [for dem]SYSTEM SHUTDOWN ABORT.2024-06-03 15:05:31.897 [FATAL] database P0000006806 T0000000000000006897  utsk_dw_sql_exec dmserver shutdown abort!!!2024-06-03 15:05:31.897 [INFO] database P0000006806 T0000000000000006897  total 3 rfil opened!2024-06-03 15:05:44.947 [INFO] database P0000008699 T0000000000000008699  INI parameter DW_PORT changed, the original value 0, new value 155392024-06-03 15:05:44.955 [INFO] database P0000008699 T0000000000000008699  INI parameter DPC_2PC changed, the original value 1, new value 02024-06-03 15:05:44.975 [INFO] database P0000008699 T0000000000000008699  version info: enterprise2024-06-03 15:05:44.977 [INFO] database P0000008699 T0000000000000008699  os_sema2_create_low, create and inc sema success, key:226172672, sem_id:131072, sem_value:1!2024-06-03 15:05:45.845 [INFO] database P0000008699 T0000000000000008699  DM Database Server 64 V8 03134284094-20231220-212751-20067 startup...2024-06-03 15:05:45.846 [INFO] database P0000008699 T0000000000000008699  INI parameter ROLLSEG_POOLS changed, the original value 19, new value 12024-06-03 15:05:46.257 [WARNING] database P0000008699 T0000000000000008699  fail to load libraster.so, libodbc.so.2: cannot open shared object file: No such file or directory2024-06-03 15:05:46.257 [INFO] database P0000008699 T0000000000000008699  fil_sys_init2024-06-03 15:05:46.409 [INFO] database P0000008699 T0000000000000008699  Database mode = 1, oguid = 6666662024-06-03 15:05:46.419 [INFO] database P0000008699 T0000000000000008699  Initialize temp tablespace, file path: /dm8/data/CJC/TEMP.DBF, code: 02024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008730  dm_mal_recv_thd started, src_site:0, dest_site:0, port_data2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008735  dm_mal_recv_thd started, src_site:0, dest_site:1, port_ctl2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008733  dm_mal_tsk_thd started, src_site:0, dest_site:1, port_data2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008729  dm_mal_tsk_thd started, src_site:0, dest_site:0, pot_ctl2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008736  dm_mal_recv_thd started, src_site:0, dest_site:1, port_data2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008734  dm_mal_tsk_thd started, src_site:0, dest_site:1, pot_ctl2024-06-03 15:05:46.452 [INFO] database P0000008699 T0000000000000008739  mal_lsnr_thread created success, port_num=363392024-06-03 15:05:47.452 [INFO] database P0000008699 T0000000000000008740  mal_site_ctl_link_create startup from mal_site(0) to mal_site(0)!2024-06-03 15:05:47.453 [INFO] database P0000008699 T0000000000000008740  mal_site_magic_gen site_magic[46784], src_site:0, dst_site:02024-06-03 15:05:47.453 [INFO] database P0000008699 T0000000000000008740  site[0] mal_site_ctl_port_set to site[0, IP: 192.168.0.11, port_num: 36339], socket handle = 10, site_magic = 46784, link_seq = 1

检查主库操作系统history:

前面通过监控器日志、实例日志等可以看到达梦主、备库自动切换过程,但是没有主库故障原因的相关日志,继续查看OS日志、history等:

在CJC01实例服务器上检查history:

终于找到原因,XXX人员未按操作规范,私自在服务器上执行重启网络服务、ifdown ethxxx等操作,时间完全吻合。

history部分日志如下:

xxx 2024-06-03-13:26:14 [root]service network restartxxx 2024-06-03-13:27:15 [root]ifdown ethxxxxxx 2024-06-03-13:38:38 [root]ip axxx 2024-06-03-13:39:15 [root]route -nxxx 2024-06-03-13:39:39 [root]historyxxx 2024-06-03-15:04:55 [root]ip axxx 2024-06-03-15:05:29 [root]ifup ethxxxxxx 2024-06-03-15:05:33 [root]ip a

重启主库网卡期间,由于监控器和备库都无法和主库通信,而监控器和备库之间可以通信,所以会认为主库故障,自动将主库踢出集群,并将备库提升为主库。

自动切换后,有哪些影响?

1.由于自动切换时间很短,应用无感知,无影响;

2.检查达梦数据库自动备份执行情况:

自动备份是通过主库内部JOB实现的,每晚自动备份到主库服务器指定目录下,检查主库,自动切换后,没有生成备份文件,那么是否可以认为达梦备份失败了?

检查JOB执行情况,备份无异常,正常执行。

SQL> set pagesize 1000SQL> SELECT EXEC_ID,NAME,START_TIME,END_TIME,ERRCODE,JOBID FROM SYSJOB.SYSJOBHISTORIES2 ORDER BY START_TIME;行号     EXEC_ID     NAME                     START_TIME          END_TIME            ERRCODE     JOBID      ---------- ----------- ------------------------ ------------------- ------------------- ----------- -----------.....79         xxxxxxx07   DM_BAK_FULL_CJC        2024-06-03 22:00:17 2024-06-03 22:00:23 0           yyyyyyyyy480         xxxxxxx08   DM_BAK_FULL_CJC_DELETE 2024-06-03 23:00:17 2024-06-03 23:00:17 0           yyyyyyyyy581         xxxxxxx09   DM_BAK_FULL_CJC        2024-06-04 22:00:17 2024-06-04 22:00:22 0           yyyyyyyyy482         xxxxxxx10   DM_BAK_FULL_CJC_DELETE 2024-06-04 23:00:17 2024-06-04 23:00:17 0           yyyyyyyyy583         xxxxxxx11   DM_BAK_FULL_CJC        2024-06-05 22:00:18 2024-06-05 22:00:22 0           yyyyyyyyy484         xxxxxxx12   DM_BAK_FULL_CJC_DELETE 2024-06-05 23:00:18 2024-06-05 23:00:18 0           yyyyyyyyy5.....

检查备库,对应目录下发现了备份文件,也就是这种类型的自动备份只能在主库执行,主备切换后,备份任务会自动在新主库上自动接管。

最后为了方便运维,手动执行了回切操作。

###chenjuchao 20240615###

欢迎关注我的公众号《IT小Chen》

       
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论