MySQL从库异常宕机-I/O和SQL线程起不来【bug】

原创冯刚 2022-10-18

1569

前言

上一篇文章介绍了因为bug导致主从数据不一致的案例

MySQL主从数据不一致-BUG：https://www.modb.pro/db/518430

此案例介绍碰到的另外一个bug：因为所在主机宕机，MySQL从库I/O和SQL线程启动报错问题。

1 环境信息

自建MySQL：5.7.22

OS 版本：CentOS Linux release 7.6

2 错误描述

2022-08-17 06:45:04 从库所在主机告警，提示宕机，并伴随MySQL异常告警；

2022-08-17 06:45:53 宕机告警恢复；

2022-08-17 06:46:52 提示MySQL I/O和SQL线程停止；

3 排查

3.1 登录实例，查看从库同步状态

此时I/O和SQL线程都是"NO"。

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: XXX.XXX.XXX.XXX
                  Master_User: XXXXXX
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.003334
          Read_Master_Log_Pos: 102990348
               Relay_Log_File: relay.011586
                Relay_Log_Pos: 111435792
        Relay_Master_Log_File: mysql-bin.003334
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1872
                   Last_Error: Slave failed to initialize relay log info structure from the repository
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 111435579
              Relay_Log_Space: 0
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1872
               Last_SQL_Error: Slave failed to initialize relay log info structure from the repository
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0
                  Master_UUID: acd7aa9e-cc1c-11eb-8e2e-0c42a1474c26
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 220817 09:17:54
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set: 3f6aeb45-13b5-11e9-be3e-7cd30ae4123c:1-154276204,
5157c677-13b5-11e9-9a24-7cd30ae43356:1-2847256991,
f537f1cc-a575-11ea-9d39-52540014ff9c:1-6
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.00 sec)

3.2 查看错误日志

## 下面日志是主机启动后，MySQL实例的启动日志：2022-08-17T06:45:17.529616+08:00 0 [Warning] 'NO_ZERO_DATE', 'NO_ZERO_IN_DATE' and 'ERROR_FOR_DIVISION_BY_ZERO' sql modes should be used with strict mode. They will be merged wi
th strict mode in a future release.
2022-08-17T06:45:17.532546+08:00 0 [Note] /usr/local/mysql57/sbin/mysqld (mysqld 5.7.22-log) starting as process 2277 ...
2022-08-17T06:45:17.550221+08:00 0 [Warning] option 'innodb-sort-buffer-size': unsigned value 134217728 adjusted to 67108864
2022-08-17T06:45:17.552817+08:00 0 [Note] InnoDB: PUNCH HOLE support available
2022-08-17T06:45:17.552844+08:00 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2022-08-17T06:45:17.552850+08:00 0 [Note] InnoDB: Uses event mutexes
2022-08-17T06:45:17.552852+08:00 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2022-08-17T06:45:17.552855+08:00 0 [Note] InnoDB: Compressed tables use zlib 1.2.3
2022-08-17T06:45:17.552877+08:00 0 [Note] InnoDB: Using Linux native AIO
2022-08-17T06:45:17.555735+08:00 0 [Note] InnoDB: Number of pools: 1
2022-08-17T06:45:17.557545+08:00 0 [Note] InnoDB: Using CPU crc32 instructions
2022-08-17T06:45:17.561520+08:00 0 [Note] InnoDB: Initializing buffer pool, total size = 8G, instances = 16, chunk size = 128M
2022-08-17T06:45:17.561541+08:00 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_INTERLEAVE
2022-08-17T06:45:18.001568+08:00 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_DEFAULT
2022-08-17T06:45:18.001601+08:00 0 [Note] InnoDB: Completed initialization of buffer pool
2022-08-17T06:45:18.054769+08:00 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2022-08-17T06:45:18.070395+08:00 0 [Note] InnoDB: Highest supported file format is Barracuda.
2022-08-17T06:45:18.134991+08:00 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 13018018506772
2022-08-17T06:45:18.197995+08:00 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 13018019491644
2022-08-17T06:45:18.198032+08:00 0 [Note] InnoDB: Database was not shutdown normally!
2022-08-17T06:45:18.198036+08:00 0 [Note] InnoDB: Starting crash recovery.
2022-08-17T06:45:18.307661+08:00 0 [Note] InnoDB: Transaction 10209047171 was in the XA prepared state.
2022-08-17T06:45:18.310909+08:00 0 [Note] InnoDB: Transaction 10209047171 was in the XA prepared state.
2022-08-17T06:45:18.311385+08:00 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo
2022-08-17T06:45:18.311399+08:00 0 [Note] InnoDB: Trx id counter is 10209047552
2022-08-17T06:45:18.311601+08:00 0 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 
2022-08-17T06:45:19.168448+08:00 0 [Note] InnoDB: Apply batch completed
2022-08-17T06:45:19.168500+08:00 0 [Note] InnoDB: Last MySQL binlog file position 0 358270216, file name mysql_bin.003520
2022-08-17T06:45:19.332107+08:00 0 [Note] InnoDB: Starting in background the rollback of uncommitted transactions
2022-08-17T06:45:19.332133+08:00 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2022-08-17T06:45:19.332171+08:00 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2022-08-17T06:45:19.332190+08:00 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2022-08-17T06:45:19.332240+08:00 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2022-08-17T06:45:19.344599+08:00 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2022-08-17T06:45:19.345375+08:00 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2022-08-17T06:45:19.345396+08:00 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2022-08-17T06:45:19.345772+08:00 0 [Note] InnoDB: 5.7.22 started; log sequence number 13018019491644
2022-08-17T06:45:19.346378+08:00 0 [Note] InnoDB: Loading buffer pool(s) from /data/mysql/mysql57_3307/data/ib_buffer_pool
2022-08-17T06:45:19.347415+08:00 0 [Note] Plugin 'FEDERATED' is disabled.
220817  6:45:19 server_audit: INCL_DML_USERS and EXCL_DML_USERS specified simultaneously - both set to empty
220817  6:45:19 server_audit: server_audit_incl_users set to ''.
220817  6:45:19 server_audit: MariaDB Audit Plugin version 1.4.4 STARTED.
220817  6:45:19 server_audit: Could not create file '/xxxxx/auditlog/audit.log'.
2022-08-17T06:45:19.354790+08:00 0 [Note] Recovering after a crash using /xxxxx/binlog/mysql_bin
2022-08-17T06:45:21.272713+08:00 0 [Note] InnoDB: Buffer pool(s) load completed at 220817  6:45:21
2022-08-17T06:45:24.365162+08:00 0 [Note] Starting crash recovery...
2022-08-17T06:45:24.365218+08:00 0 [Note] InnoDB: Starting recovery for XA transactions...
2022-08-17T06:45:24.365224+08:00 0 [Note] InnoDB: Transaction 10209047171 in prepared state after recovery
2022-08-17T06:45:24.365227+08:00 0 [Note] InnoDB: Transaction contains changes to 2 rows
2022-08-17T06:45:24.365230+08:00 0 [Note] InnoDB: 1 transactions in prepared state after recovery
2022-08-17T06:45:24.365232+08:00 0 [Note] Found 1 prepared transaction(s) in InnoDB
2022-08-17T06:45:24.365654+08:00 0 [Note] Crash recovery finished.
2022-08-17T06:45:25.533770+08:00 0 [Warning] Failed to set up SSL because of the following SSL library error: SSL context is not usable without certificate and private key
2022-08-17T06:45:25.533805+08:00 0 [Note] Server hostname (bind-address): '*'; port: 3307
2022-08-17T06:45:25.533832+08:00 0 [Note] IPv6 is available.
2022-08-17T06:45:25.533837+08:00 0 [Note]   - '::' resolves to '::';
2022-08-17T06:45:25.533851+08:00 0 [Note] Server socket created on IP: '::'.
2022-08-17T06:45:25.541233+08:00 0 [Warning] 'user' entry 'dba-docp@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541250+08:00 0 [Warning] 'user' entry 'mysql.session@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541256+08:00 0 [Warning] 'user' entry 'zabbix_monitor@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541291+08:00 0 [Warning] 'db' entry 'performance_schema mysql.session@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541295+08:00 0 [Warning] 'db' entry 'sys mysql.sys@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.541304+08:00 0 [Warning] 'proxies_priv' entry '@ root@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.552222+08:00 0 [Warning] 'tables_priv' entry 'user mysql.session@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.552241+08:00 0 [Warning] 'tables_priv' entry 'sys_config mysql.sys@localhost' ignored in --skip-name-resolve mode.
2022-08-17T06:45:25.559579+08:00 0 [Note] Slave: MTS group recovery relay log info based on Worker-Id 0, group_relay_log_name /xxxxx/relaylog/relay.011626, group_relay_log_pos 4217146 group_master_log_name mysql-bin.005810, group_master_log_pos 4216933## 实例启动后，slave进程的报错
2022-08-17T06:45:25.559872+08:00 0 [ERROR] Error looking for file after /xxxxx/relaylog/relay.011627.
2022-08-17T06:45:25.560208+08:00 0 [ERROR] Slave: Failed to initialize the master info structure for channel ''; its record may still be present in 'mysql.slave_master_info' table, consider deleting it.
2022-08-17T06:45:25.560215+08:00 0 [ERROR] Failed to create or recover replication info repositories.
2022-08-17T06:45:25.560221+08:00 0 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2022-08-17T06:45:25.560224+08:00 0 [ERROR] /usr/local/mysql57/sbin/mysqld: Slave failed to initialize relay log info structure from the repository
2022-08-17T06:45:25.560225+08:00 0 [ERROR] Failed to start slave threads for channel ''
2022-08-17T06:45:25.560227+08:00 0 [Note] Some of the channels are not created/initialized properly. Check for additional messages above. You will not be able to start replication on those channels until the issue is resolved and the server restarted.
2022-08-17T06:45:25.581309+08:00 0 [Note] Event Scheduler: Loaded 0 events
2022-08-17T06:45:25.581497+08:00 0 [Note] /usr/local/mysql57/sbin/mysqld: ready for connections.
Version: '5.7.22-log'  socket: '/xxxxx/socketdir/mysql.sock'  port: 3307  MySQL Community Server (GPL)

3.3 尝试手动启动slave

报错：ERROR 1872 (HY000): Slave failed to initialize relay log info structure from the repository

mysql> start slave;
ERROR 1872 (HY000): Slave failed to initialize relay log info structure from the repository

2022-08-17T09:02:30.689881+08:00 551 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2022-08-17T09:02:33.382601+08:00 551 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2022-08-17T09:03:31.630680+08:00 551 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2022-08-17T09:05:21.955292+08:00 564 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.

4 原因

根据报错，推测是MySQL BUG导致。

4.1 bug名称

Bug #83713 : Slave failed to initialize relay log info after OS crash when use MTS and GTID

MySQL BUG链接：https://bugs.mysql.com/bug.php?id=83713

4.2 bug描述

Description:
When i tried to configurate a crash safe slave with MTS and GTID based replication, but after a OS crash
replication failed to be start.

4.3 参数配置

[root@localhost] {18:09:49} ((none)) [5]> show VARIABLES where variable_name in('log_slave_updates', 'gtid_mode', 'enforce_gtid_consistency', 'master_info_repository', 'relay_log_info_repository', 'relay_log_recovery', 'sync_binlog', 'innodb_flush_log_at_trx_commit', 'slave_parallel_workers','sync_master_info', 'sync_relay_log', 'sync_relay_log_info'); 
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| enforce_gtid_consistency       | ON    |
| gtid_mode                      | ON    |
| innodb_flush_log_at_trx_commit | 1     |
| log_slave_updates              | ON    |
| master_info_repository         | TABLE |
| relay_log_info_repository      | TABLE |
| relay_log_recovery             | ON    |
| slave_parallel_workers         | 16    |
| sync_binlog                    | 1     |
| sync_master_info               | 10000 |
| sync_relay_log                 | 10000 |
| sync_relay_log_info            | 10000 |
+--------------------------------+-------+
12 rows in set (0.01 sec)

4.4 复现

STEP1:
run a script to execute update sql on the master continuously

STEP2:
use "kill -9" kill the kvm process of the slave in the physical machine

STEP3:  
start mysql of the slave

按照上面方法，没有复现。下面是具体步骤

步骤1 主库会话持续插入数据

CREATE TABLE `t2` (
  `id` bigint(11) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(50) NOT NULL,
  `sex` char(1) NOT NULL,
  `score` char(2) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;

drop procedure p_1;

delimiter //
create procedure p_1(in num int)
BEGIN
	DECLARE i INT DEFAULT 0;
	DECLARE sex CHAR(1);
	DECLARE score CHAR(2);
	SET score= FLOOR(40 + (RAND() *60));
	SET sex=FLOOR(0 + (RAND() * 2));
	WHILE i < 1000000
	DO
		INSERT INTO t2(name,sex,score) VALUES (concat('f',i),sex,score);
		SET i = i + 1;
	END WHILE;
END
//
delimiter ;
call p_1(1000000);

步骤2 从库主机层面kill实例进程

[ root@mysql10-dba:~ ]# kill -9 53372

步骤3 主库启动MySQL实例，启动slave

[ root@mysql10-dba:~ ]# systemctl start mysql57@7777

4.5 原因/建议

Suggested fix:
The error ocuurs when mts_recovery_groups() reading the corrupted relay log files in MTS group recovery.
But with "gtid_mode=ON" and "relay_log_recovery = 1", the relay log files will be discard later, and so, at the beginning should skip the relay log read.
当mts_recovery_groups()读取MTS组恢复中损坏的中继日志文件时发生错误。
但是对于"gtid_mode=ON"和"relay_log_recovery = 1"，中继日志文件稍后将被丢弃，因此，在开始时应该跳过中继日志读取。
workaround:
The following steps can start slave successfully

    reset slave;
    start slave IO_THREAD;
    stop slave IO_THREAD;
    reset slave;
    start slave;

5 解决方案

[root@localhost:] {09:18:17} ((none)) [1]> reset slave;
Query OK, 0 rows affected (0.01 sec)
[root@localhost:] {09:18:36} ((none)) [3]> start slave;
Query OK, 0 rows affected (0.04 sec)

6 补充

reset slave干的那些事：

删除slave_master_info ，slave_relay_log_info两个表中数据；
删除所有relay log文件，并重新创建新的relay log文件；
不会改变gtid_executed 或者 gtid_purged的值；

官方文档链接描述：

RESET SLAVE makes the replica forget its replication position in the source's binary log. This statement is meant to be used for a clean start: It clears the replication metadata repositories, deletes
all the relay log files, and starts a new relay log file. It also resets to 0 the replication delay specified with the MASTER_DELAY option to CHANGE MASTER TO.

All relay log files are deleted, even if they have not been completely executed by the replication SQL thread. (This is a condition likely to exist on a replica if
you have issued a STOP SLAVE statement or if the replica is highly loaded.)
For a server where GTIDs are in use (gtid_mode is ON), issuing RESET SLAVE has no effect on the GTID execution history. The statement does not change the values of gtid_executed or
gtid_purged, or the mysql.gtid_executed table. If you need to reset the GTID execution history, use RESET MASTER, even if the GTID-enabled server is a replica where binary logging is disabled.
RESET SLAVE requires the RELOAD privilege.

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者