
前言
写一个 Event
导致主从停滞的案例
在我的公众号的发消息界面点击 “加我滴群” 即可获取我的群的二维码。
故障描述
集群的主库配置了一个 event
自动执行一些任务,主从切换后,新主无法执行 event
,而旧主却还在执行 event
,导致数据不一致。
这种情况怎么办?
拯救的方法就是手动切回旧主,然后改造高可用软件的配套脚本。
测试故障过程
故障复盘
我这里用 mha 配了一套一主 2 从的集群,且从配置为 read_only
。
root@localhost [test] 22:29:45 > show slave hosts;
+-----------+------+------+-----------+--------------------------------------+
| Server_id | Host | Port | Master_id | Slave_UUID |
+-----------+------+------+-----------+--------------------------------------+
| 50 | | 3306 | 30 | 196bb870-d495-11ed-a359-000c293313f2 |
| 40 | | 3306 | 30 | 3bc6a7ce-d618-11ed-a81a-000c2939d7eb |
+-----------+------+------+-----------+--------------------------------------+
2 rows in set, 1 warning (0.01 sec)
全都开启了事件调度程序。
root@localhost [test] 22:35:23 > show variables like '%event_scheduler%';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| event_scheduler | ON |
+-----------------+-------+
1 row in set (0.01 sec)
在主上创建测试表和 event
create table test.testevent(id int primary key auto_increment,info varchar(64));
CREATE EVENT insert_testevent
ON SCHEDULE EVERY 20 SECOND
DO
insert into test.testevent(info) values(now());
该 event
在主上的状态为 ENABLED
,Originator
是 server id
,主的 server id
为 30
root@localhost [test] 22:36:29 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: ENABLED
Originator: 30
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.02 sec)
在从上的状态为 SLAVESIDE_DISABLED
,表示从自动不会执行 event
的事件,而是使用 io 拉取主上的 binlog 再回放事件。
root@localhost [test] 22:30:36 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: SLAVESIDE_DISABLED
Originator: 30
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
这样看起来是不是没什么问题。
我先启动 mha 开始测试
masterha_manager --conf=/etc/mha/app1.cnf --ignore_last_failover < /dev/null >/var/log/mha/app1/manager.log 2>&1 &
主从结构
192.168.58.30(192.168.58.30:3306) (current master)
+--192.168.58.40(192.168.58.40:3306)
+--192.168.58.50(192.168.58.50:3306)
主从都是正常获取到了 event
的数据。
root@localhost [test] 22:50:17 > select * from testevent;
+----+---------------------+
| id | info |
+----+---------------------+
| 1 | 2024-11-21 22:36:29 |
| 2 | 2024-11-21 22:36:49 |
| 3 | 2024-11-21 22:37:09 |
| 4 | 2024-11-21 22:37:29 |
| 5 | 2024-11-21 22:37:49 |
| 6 | 2024-11-21 22:38:09 |
| 7 | 2024-11-21 22:38:29 |
| 8 | 2024-11-21 22:38:49 |
| 9 | 2024-11-21 22:39:09 |
| 10 | 2024-11-21 22:39:29 |
现在将 192.168.58.30
上的 mysqld 杀掉。看看新主的 event
状态
[root@master ~]# ps -ef|grep mysqld
root 63264 28534 0 22:52 pts/0 00:00:00 grep --color=auto mysqld
mysql 80821 1 4 17:20 ? 00:15:51 /application/mysql_8_3306/bin/mysqld --defaults-file=/etc/my.cnf
[root@master ~]# kill -9 80821
选择了 192.168.58.40
成为新主
Selected 192.168.58.40(192.168.58.40:3306) as a new master.
我们来看一下 event
状态。
root@localhost [test] 23:13:38 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: SLAVESIDE_DISABLED
Originator: 30
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.01 sec)
状态为 SLAVESIDE_DISABLED
,大家可以看到 Originator: 30
,代表它获取的主的 server id
是 30,还在原主上没有调整过来
可以执行 event
里面的语句吗?我们来查一下 testevent
表
| 65 | 2024-11-21 23:10:29 |
+----+---------------------+
65 rows in set (0.00 sec)
root@localhost [test] 23:15:23 >
最新数据为 2024-11-21 23:10:29
。现在时间为 23:15:23
。显然不能。
我们看一下另一个从的状态:
root@localhost [test] 23:17:06 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: SLAVESIDE_DISABLED
Originator: 30
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.01 sec)
也是 SLAVESIDE_DISABLED
,大家可以看到 Originator: 30
也不正常
现在,我们启一下 192.168.58.30
那个原主。把它连接到新主。
change replication source to source_host='192.168.58.40',source_user='replica',source_password='123456',source_auto_position=1,get_source_public_key=1,source_port=3306;
start replica;
好了启动了
root@localhost [(none)] 23:24:01 > show replica status\G
*************************** 1. row ***************************
Replica_IO_State: Waiting for source to send event
Source_Host: 192.168.58.40
Source_User: replica
Source_Port: 3306
Connect_Retry: 60
Source_Log_File: mysql-bin.000001
Read_Source_Log_Pos: 157
Relay_Log_File: mysql-relay.000002
Relay_Log_Pos: 373
Relay_Source_Log_File: mysql-bin.000001
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
看下 event
的状态:
root@localhost [(none)] 23:24:23 > use test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
root@localhost [test] 23:24:26 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: ENABLED
Originator: 30
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.01 sec)
状态为 ENABLED,这不正常。
而且它还在继续执行 SQL,这就导致了数据不一致,而且不能把 event 切回。
| 86 | 2024-11-21 23:28:09 |
| 87 | 2024-11-21 23:28:29 |
+----+---------------------+
87 rows in set (0.00 sec)
root@localhost [test] 23:28:48 >
手动指定 event 状态
这时有些小伙伴就要问了,为什么不手动指定 event 状态呢?
好,那我们来测试一下手动执行的结果:
root@localhost [test] 23:22:45 > alter event insert_testevent enable;
Query OK, 0 rows affected (0.01 sec)
可以看到我在主上执行了 enable,后果是什么?
我们来看下一主两从的 event 状态:
可以看到新主是启动了,Originator: 40
也是指向自己
root@localhost [test] 23:32:01 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: ENABLED
Originator: 40
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.01 sec)
另外一个从也是正常的:
root@localhost [test] 23:19:09 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: SLAVESIDE_DISABLED
Originator: 40
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
这时,我们来看一下原主:
root@localhost [test] 23:31:51 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: SLAVESIDE_DISABLED
Originator: 40
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
Originator: 40
指向了新主,状态也是 SLAVESIDE_DISABLED
。这时,会有一个什么现象?
Replica_IO_Running: Yes
Replica_SQL_Running: No
······
Last_Errno: 1062
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '3bc6a7ce-d618-11ed-a81a-000c2939d7eb:2' at master log mysql-bin.000001, end_log_pos 657. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
主从报错了。为什么?我们解析一下主的 binlog 看看
mysqlbinlog --base64-output=DECODE-ROWS --stop-position=657 -vv mysql-bin.000001
找到对应的输出结果
#241121 23:32:10 server id 40 end_log_pos 657 CRC32 0x8778cca6 Write_rows: table id 123 flags: STMT_END_F
### INSERT INTO `test`.`testevent`
### SET
### @1=66 /* INT meta=0 nullable=0 is_null=0 */
### @2='2024-11-21 23:32:10' /* VARSTRING(256) meta=256 nullable=1 is_null=0 */
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
可以看到结果是 test
.testevent
表的第一个字段为 66 第二个字段为 '2024-11-21 23:32:10'
无法执行。
为什么?
因为刚才旧主的 event
连上新主时的状态为 enable
,运行状态,已经运行了一段时间了。
root@localhost [test] 23:40:54 > select * from testevent where id=66;
+----+---------------------+
| id | info |
+----+---------------------+
| 66 | 2024-11-21 23:21:29 |
+----+---------------------+
1 row in set (0.01 sec)
所以主从报错了。
这个时候思考一个问题:
我这还是手动指定的 event
的状态,如果是企业几十个,几千个实例,你能发现是为什么主从报错吗?
这个时候,你就要思考一个问题:现在的高可用软件,能否满足所有的应用?
基于应用改造
下面我就来教大家怎么改造,防止出现刚才的故障场景:
首先如何防止出现数据不一致情况:
开启 super_read_only
再启动 mysqld,然后在新主enable
每个event以 event_scheduler=OFF
启动 mysqld
我们直接测试一下这 2 种情况,以下步骤请用脚本完成,我只演示过程。
开启 super_read_only
再启动 mysqld
我这边恢复快照了
还是一样开启 MHA
masterha_manager --conf=/etc/mha/app1.cnf --ignore_last_failover < /dev/null >/var/log/mha/app1/manager.log 2>&1 &
192.168.58.30(192.168.58.30:3306) (current master)
+--192.168.58.40(192.168.58.40:3306)
+--192.168.58.50(192.168.58.50:3306)
kill 主
[root@master ~]# ps -ef|grep mysqld
mysql 65928 1 4 23:51 ? 00:00:21 /application/mysql_8_3306/bin/mysqld --defaults-file=/etc/my.cnf
root 70040 66341 0 23:59 pts/2 00:00:00 grep --color=auto mysqld
[root@master ~]# kill -9 65928
选主完成:
Selected 192.168.58.40(192.168.58.40:3306) as a new master.
这时候就需要
echo 'super_read_only=ON'>> /etc/my.cnf
然后再启动mysqld,看看原主的 event 状态
root@localhost [test] 00:03:17 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: ENABLED
Originator: 30
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.02 sec)
新主上的值
| 58 | 2024-11-21 23:59:49 |
+----+---------------------+
58 rows in set (0.00 sec)
root@localhost [test] 00:04:33 >
旧主上的值
| 58 | 2024-11-21 23:59:49 |
+----+---------------------+
58 rows in set (0.00 sec)
root@localhost [test] 00:05:14 >
为什么会这样?
因为启用 super_read_only
系统变量会阻止其更新 events
数据字典表中的事件“上次执行”时间戳。这会导致事件调度程序在下次尝试执行计划事件时停止,并将消息写入服务器错误日志。
2024-11-21T16:02:49.890898Z 5 [ERROR] [MY-010045] [Server] Event Scheduler: Unable to schedule event: The MySQL server is running with the --super-read-only option so it cannot execute this statement
可以看到就是一条这样的 ERROR
条目。报告无法执行事件。
旧主连接新主
change replication source to source_host='192.168.58.40',source_user='replica',source_password='123456',source_auto_position=1,get_source_public_key=1,source_port=3306;
start replica;
然后我们在新主 enable
事件
root@localhost [test] 00:08:55 > alter event insert_testevent enable;
Query OK, 0 rows affected (0.02 sec)
root@localhost [test] 00:09:06 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: ENABLED
Originator: 40
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
旧主也正常了。
root@localhost [test] 00:09:39 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: SLAVESIDE_DISABLED
Originator: 40
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.01 sec)
高可用软件能否管理 super_read_only
请自行考虑
以 event_scheduler=OFF
启动 mysqld
还是恢复快照,启动MHA之类的不再赘述,直接到启动旧主那一步。
将 event_scheduler=OFF
写入配置文件。
vi /etc/my.cnf
event_scheduler=OFF
启动后可以看到 event_scheduler
值为 OFF
root@localhost [test] 00:18:42 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: ENABLED
Originator: 30
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.01 sec)
root@localhost [test] 00:18:45 > show variables like 'event_scheduler';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| event_scheduler | OFF |
+-----------------+-------+
1 row in set (0.02 sec)
然后在新主执行 enable
alter event test.insert_testevent enable;
可以看到状态已经改过来了。
root@localhost [test] 00:22:29 > show events\G
*************************** 1. row ***************************
Db: test
Name: insert_testevent
Definer: root@localhost
Time zone: SYSTEM
Type: RECURRING
Execute at: NULL
Interval value: 20
Interval field: SECOND
Starts: 2024-11-21 22:36:29
Ends: NULL
Status: SLAVESIDE_DISABLED
Originator: 40
character_set_client: utf8mb4
collation_connection: utf8mb4_0900_ai_ci
Database Collation: utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
root@localhost [test] 00:22:39 > show variables like 'event_scheduler';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| event_scheduler | OFF |
+-----------------+-------+
1 row in set (0.01 sec)
也能正确接受到新主的 event
| 57 | 2024-11-22 00:22:49 |
| 58 | 2024-11-22 00:23:09 |
| 59 | 2024-11-22 00:23:29 |
+----+---------------------+
59 rows in set (0.00 sec)
root@localhost [test] 00:23:30 >
然后再把旧主的 event_scheduler
开启就好了,防止下次切换
往期回顾
INSERT 加什么锁 - 聊聊 INSERT 的加锁情况 MySQL 临时表空间 - 临时表空间(Temporary Tablespaces 及相关的变量 MySQL/PG 数据脱敏 - data_masking 插件和 postgresql_anonymizer 扩展初体验 MySQL/PG 对事务 DDL 的支持程度 - 从隐式提交聊到DDL事务 ONLINE DDL 收尾篇|如何有效实现 ONLINE - 关于 ONLINE DDL 的最佳实践 MySQL/PG 索引对于排序规则的依赖性 - 排序规则与索引的相关性问题,涉及到索引是否会失效 如何正确杀会话, KILL processlist_id/KILL QUERY? - 你知道如何正确杀掉活跃会话吗?
感谢阅读!关注我获取更多精彩内容。




