MySQL slave io线程报错一例

原创冯刚 2022-10-21

1858

前言

此案例介绍半同步复制环境中，因为主库binary log index file里面出现两个相同binlog文件名，导致该实例的slave库I/O线程停止。

1 环境信息

自建MySQL：5.7.22

OS 版本：CentOS Linux release 7.6

2 错误描述

早晨接到报警，提示报表库（见上面拓扑图）I/O线程异常。

2.1 登录报表库，查看复制状态

[ root@报表库:~ ]# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 132984
Server version: 5.7.22-log MySQL Community Server (GPL)

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 备份库地址
                  Master_User: repl_user
                  Master_Port: 3306
                Connect_Retry: 1
              Master_Log_File: mysql_bin.003427
          Read_Master_Log_Pos: 4
               Relay_Log_File: relay.001627
                Relay_Log_Pos: 241
        Relay_Master_Log_File: mysql_bin.003427
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4
              Relay_Log_Space: 525
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not open log file'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 10002
                  Master_UUID: 85cb9384-4db4-11ec-b654-fa1b79c33900
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 221021 04:29:36
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 8077181f-4db4-11ec-8225-fa417cfd7e00:169899218-205014272
            Executed_Gtid_Set: 3c07e8e0-399a-11ed-9be9-5254006919a1:1-156029,
8077181f-4db4-11ec-8225-fa417cfd7e00:1-205014272,
88958222-4db4-11ec-b637-faa9e751b600:1-243629236
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

根据报错，提示打不开binlog文件，报错时间为：221021 04:29:36。

2.2 查看对应时间报表库错误日志：

2022-10-21T04:29:36.176599+08:00 1 [ERROR] Error reading packet from server for channel '': Could not open log file (server_errno=1236)
2022-10-21T04:29:36.176638+08:00 1 [ERROR] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'Could not open log file', Error_code: 1
236
2022-10-21T04:29:36.176642+08:00 1 [Note] Slave I/O thread exiting for channel '', read up to log 'mysql_bin.003427', position 4

可以看到读到mysql_bin.003427文件position 4的时候报错。

3 排查

3.1 登录备份库，查看binlog

[ root@备份库:/xxxxxx/binlog ]# ll
total 247390320
.................................省略..............................................................省略.............................
-rw-r----- 1 mysql mysql 1080514038 Oct 21 04:23 mysql_bin.003425
-rw-r----- 1 mysql mysql 1080022172 Oct 21 04:29 mysql_bin.003426
-rw-r----- 1 mysql mysql 1073745885 Oct 21 10:45 mysql_bin.003427
-rw-r----- 1 mysql mysql 1073742926 Oct 21 14:56 mysql_bin.003428
-rw-r----- 1 mysql mysql 1073749274 Oct 21 16:56 mysql_bin.003429
-rw-r----- 1 mysql mysql  147230918 Oct 21 17:32 mysql_bin.003430
-rw-r----- 1 mysql mysql      12642 Oct 21 16:56 mysql_bin.index
[ root@备份库:/xxxxxxx/binlog ]# cat mysql_bin.index
.................................省略.............................
.................................省略.............................
/xxxxxx/mysql_bin.003425
/xxxxxx/mysql_bin.003426
/xxxxxx/mysql_bin.003427
/xxxxxx/mysql_bin.003427

看到备份库正在写mysql_bin.003427文件，且只有一个。但是mysql_bin.index里面记录了两个mysql_bin.003427文件。

3.2 查看错误日志

2022-10-21T04:29:36.171970+08:00 815671 [ERROR] Failed to open log (file '/xxxxxx/binlog/mysql_bin.003427', errno 2)
2022-10-21T04:29:36.171982+08:00 669961 [ERROR] Failed to open log (file '/xxxxxx/binlog/mysql_bin.003427', errno 2)
2022-10-21T04:29:36.172200+08:00 815671 [Note] Stop asynchronous binlog_dump to slave (server_id: 1)
2022-10-21T04:29:36.172200+08:00 669961 [Note] Stop asynchronous binlog_dump to slave (server_id: 330647)

可以看到报错和从库一致。

3.3 解析备份库binlog

根据2.1步骤，报表库/从库最后一个事务gtid为： '8077181f-4db4-11ec-8225-fa417cfd7e00:205014272'

先不考虑为什么有两个 mysql_bin.003427 文件，先解析备份库binlog，

mysql_bin.003427日志的第一个事物gtid为：'8077181f-4db4-11ec-8225-fa417cfd7e00:205221231'

[ root@备份库:/xxxxxx/binlog ]# mysqlbinlog --base64-output=decode-rows -v -v  mysql_bin.003427|more
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#221021  8:15:54 server id 10002  end_log_pos 123 CRC32 0xd51f2be9 	Start: binlog v 4, server v 5.7.22-log created 221021  8:15:54
# at 123
#221021  8:15:54 server id 10002  end_log_pos 234 CRC32 0x7d2337cf 	Previous-GTIDs
# 8077181f-4db4-11ec-8225-fa417cfd7e00:1-205221230,
# 88958222-4db4-11ec-b637-faa9e751b600:1-243629236
# at 234
#221021  8:15:54 server id 10003  end_log_pos 299 CRC32 0xcaae95a6 	GTID	last_committed=0	sequence_number=1	rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= '8077181f-4db4-11ec-8225-fa417cfd7e00:205221231'/*!*/;
# at 299

上一个日志mysql_bin.003426的最后一个事物gtid为：'8077181f-4db4-11ec-8225-fa417cfd7e00:205014272'

可以看到，两个binlog日志文件差了 205221231-205014272 将近21万个gtid事务。

4 原因

之前有人复现过，主库手动删除在用的binlog日志文件+延迟会复现上面的报错，参考文章：

文章链接，转载注明出处：https://cdn.modb.pro/db/401771

简单说就是手动删除主库在用的binlog文件，会一定概率导致从库I/O线程停止。

但是此次问题不是这种情形。从拓扑图可以看到，出现两个同名binlog的虽然是主库，但是是线上的备份库，只读的，不会有人手动清理binlog，并且日志清理默认是15天。

mysql> show variables like '%expire%';
+--------------------------------+-------+
| Variable_name                  | Value |
+--------------------------------+-------+
| disconnect_on_expired_password | ON    |
| expire_logs_days               | 15    |
+--------------------------------+-------+
2 rows in set (0.00 sec)

备份库每天凌晨1点物理备份全库，应该不是这个原因。

还有一个就是，发现该实例一直在新启动一个新的mysql进程（间隔大约是100秒一次），但是一直失败。不知道是不是这个原因（后面通过停止pid为73985的进程解决这个问题，没有再出现启动新进程）

[ root@备份库:/etc/systemd/system ]# ps -ef|grep mysqld
mysql      6708      1  5 Jun24 ?        6-00:05:03 /usr/local/mysql57/sbin/mysqld --defaults-file=/etc/mysql57/my_3306.cnf --basedir=/usr/local/mysql57 --plugin-dir=/usr/local/mysql57/lib64/mysql/plugin
mysql     73985      1  3 14:23 ?        00:00:01 /usr/local/mysql57/sbin/mysqld --defaults-file=/etc/mysql57/my_3306.cnf --basedir=/usr/local/mysql57 --plugin-dir=/usr/local/mysql57/lib64/mysql/plugin
root      74130  69736  0 14:24 pts/2    00:00:00 grep --colour mysqld

错误日志：

2022-10-21T03:50:07.523347+08:00 0 [Note] /usr/local/mysql57/sbin/mysqld (mysqld 5.7.22-log) starting as process 109889 ...
2022-10-21T03:50:07.527315+08:00 0 [Warning] option 'innodb-sort-buffer-size': unsigned value 134217728 adjusted to 67108864
2022-10-21T03:50:07.527829+08:00 0 [Note] InnoDB: PUNCH HOLE support available
2022-10-21T03:50:07.527858+08:00 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2022-10-21T03:50:07.527865+08:00 0 [Note] InnoDB: Uses event mutexes
2022-10-21T03:50:07.527870+08:00 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2022-10-21T03:50:07.527875+08:00 0 [Note] InnoDB: Compressed tables use zlib 1.2.3
2022-10-21T03:50:07.527881+08:00 0 [Note] InnoDB: Using Linux native AIO
2022-10-21T03:50:07.531363+08:00 0 [Note] InnoDB: Number of pools: 1
2022-10-21T03:50:07.531516+08:00 0 [Note] InnoDB: Using CPU crc32 instructions
2022-10-21T03:50:07.535524+08:00 0 [Note] InnoDB: Initializing buffer pool, total size = 8G, instances = 16, chunk size = 128M
2022-10-21T03:50:07.535555+08:00 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_INTERLEAVE
2022-10-21T03:50:08.222880+08:00 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_DEFAULT
2022-10-21T03:50:08.222925+08:00 0 [Note] InnoDB: Completed initialization of buffer pool
2022-10-21T03:50:08.434086+08:00 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2022-10-21T03:50:08.434440+08:00 0 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 11
2022-10-21T03:50:08.434463+08:00 0 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files.
2022-10-21T03:50:08.434470+08:00 0 [Note] InnoDB: Retrying to lock the first data file
2022-10-21T03:50:09.434584+08:00 0 [ERROR] InnoDB: Unable to lock ./ibdata1 error: 11

主机日志：

Oct 21 14:20:04 systemd[1]: mysql57@3306.service: main process exited, code=exited, status=1/FAILURE
Oct 21 14:20:04 kill[73396]: Usage:
Oct 21 14:20:04 kill[73396]: kill [options] <pid|name> [...]
Oct 21 14:20:04 kill[73396]: Options:
Oct 21 14:20:04 kill[73396]: -a, --all              do not restrict the name-to-pid conversion to processes
Oct 21 14:20:04 kill[73396]: with the same uid as the present process
Oct 21 14:20:04 kill[73396]: -s, --signal <sig>     send specified signal
Oct 21 14:20:04 kill[73396]: -q, --queue <sig>      use sigqueue(2) rather than kill(2)
Oct 21 14:20:04 kill[73396]: -p, --pid              print pids without signaling them
Oct 21 14:20:04 kill[73396]: -l, --list [=<signal>] list signal names, or convert one to a name
Oct 21 14:20:04 kill[73396]: -L, --table            list signal names and numbers
Oct 21 14:20:04 kill[73396]: -h, --help     display this help and exit
Oct 21 14:20:04 kill[73396]: -V, --version  output version information and exit
Oct 21 14:20:04 kill[73396]: For more details see kill(1).
Oct 21 14:20:04 systemd[1]: mysql57@3306.service: control process exited, code=exited status=1
Oct 21 14:20:04 systemd[1]: Unit mysql57@3306.service entered failed state.
Oct 21 14:20:04 systemd[1]: mysql57@3306.service failed.
Oct 21 14:20:04 systemd[1]: mysql57@3306.service holdoff time over, scheduling restart.
Oct 21 14:20:04 systemd[1]: Stopped MySQL Community Server.
Oct 21 14:20:04 systemd[1]: Starting MySQL Community Server...
Oct 21 14:20:05 systemd[1]: Started MySQL Community Server.

5 解决方案

既然备份库的binlog中缺失了gtid，并且缺失的gtid区间为：205014273-205221230 。那就解析线上主库对应时间段的binlog日志文件，确定对应gtid的postion，然后通过mysqlbinlog --start-positon和--stop-position生成sql文件，到报表库执行即可。

执行期间，通过show master status;可以看到Executed_Gtid_Set的事务id一直在增长。

最后start slave即可。

最后修改时间：2022-10-24 08:57:03

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者