故障原因
生产环境的故障原因:
(Galera集群MariaDB 10.0.15)
节点1加字段成功,节点3由于没空间,加字段失败,节点2加字段成功。等节点3有空间了,节点3还可以正常加入集群,是以少字段的形式在复制。虽然节点3的错误日志中有记录,表加字段失败了。但是后面还是可以正常加班集群的,也可以正常复制。所以节点3就一直没有这个字段,并且继续复制。
MariaDB 10.0.15 -Galera集群缺字段复制
garela cluster集群中,有一台或2台服务器的表中少一个字段,同步数据的时候不报错。
集群版本:
mysql Ver 15.1 Distrib 10.0.15-MariaDB, for Linux (x86_64) using readline 5.1
在节点1上建张表,并插入几笔数据
mysql> CREATE TABLE `sbtest2` (
-> `id` int(11) NOT NULL AUTO_INCREMENT,
-> `k` int(11) NOT NULL DEFAULT '0',
-> `c` char(120) NOT NULL DEFAULT '',
-> `pad` char(60) NOT NULL DEFAULT '',
-> PRIMARY KEY (`id`),
-> KEY `k_1` (`k`)
-> ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (1.04 sec)
mysql> insert into sbtest2(k,c,pad) select k,c,pad from sbtest1 limit 5;
Query OK, 5 rows affected (0.04 sec)
Records: 5 Duplicates: 0 Warnings: 0
之后业务在1节点添加字段:
alter table sbtest2 add column check_flag varchar(1) default '1';
检查集群的3台服务器,此时数据和表结构是一致的。
此时执行第3节点以单点模式运行,并删除字段check_flag:
alter table sbtest2 drop column check_flag;
后面三节点重新开启集群参数,加入节点:


验证过程
现在开始对表进行DML操作,对比有字段check_flag和无字段check_flag的服务器数据
第1节点服务器(有字段check_flag):
mysql> insert into sbtest2(id,k,c,pad,check_flag) values(100,100,'100k','100c','0');
Query OK, 1 row affected (0.00 sec)
mysql> select * from sbtest2 where id=100;
+-----+-----+------+------+------------+
| id | k | c | pad | check_flag |
+-----+-----+------+------+------------+
| 100 | 100 | 100k | 100c | 0 |
+-----+-----+------+------+------------+
1 row in set (0.00 sec)
第3节点服务器(无字段check_flag):
MariaDB [sbtest]> select * from sbtest2 where id=100;
+-----+-----+------+------+
| id | k | c | pad |
+-----+-----+------+------+
| 100 | 100 | 100k | 100c |
+-----+-----+------+------+
1 row in set (0.00 sec)
第1节点服务器(有字段check_flag):
mysql> update sbtest2 set pad='111c' where id=100;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from sbtest2 where id=100;
+-----+-----+------+------+------------+
| id | k | c | pad | check_flag |
+-----+-----+------+------+------------+
| 100 | 100 | 100k | 111c | 0 |
+-----+-----+------+------+------------+
1 row in set (0.00 sec)
第3节点服务器(无字段check_flag):
MariaDB [sbtest]> select * from sbtest2 where id=100;
+-----+-----+------+------+
| id | k | c | pad |
+-----+-----+------+------+
| 100 | 100 | 100k | 111c |
+-----+-----+------+------+
1 row in set (0.00 sec)
第1节点服务器(有字段check_flag):
mysql> delete from sbtest2 where id=100;
Query OK, 1 row affected (0.00 sec)
mysql> select * from sbtest2 where id=100;
Empty set (0.00 sec)
第3节点服务器(无字段check_flag):
MariaDB [sbtest]> select * from sbtest2 where id=100;
Empty set (0.00 sec)
第1节点服务器(有字段check_flag):
mysql> insert into sbtest2(id,k,c,pad,check_flag) values(110,110,'110k','110c','0');
Query OK, 1 row affected (0.01 sec)
mysql> update sbtest2 set check_flag=1 where id=110;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from sbtest2 where id=110;
+-----+-----+------+------+------------+
| id | k | c | pad | check_flag |
+-----+-----+------+------+------------+
| 110 | 110 | 110k | 110c | 1 |
+-----+-----+------+------+------------+
1 row in set (0.00 sec)
第3节点服务器(无字段check_flag):
MariaDB [sbtest]> select * from sbtest2 where id=110;
+-----+-----+------+------+
| id | k | c | pad |
+-----+-----+------+------+
| 110 | 110 | 110k | 110c |
+-----+-----+------+------+
1 row in set (0.00 sec)
反过来操作,在无字段check_flag的服务器上操作:
第3节点服务器(无字段check_flag):
MariaDB [sbtest]> select * from sbtest2 where id=110;
+-----+-----+------+------+
| id | k | c | pad |
+-----+-----+------+------+
| 110 | 110 | 110k | 110c |
+-----+-----+------+------+
1 row in set (0.00 sec)
MariaDB [sbtest]> update sbtest2 set check_flag='0' where id=110;
ERROR 1054 (42S22): Unknown column 'check_flag' in 'field list'
MariaDB [sbtest]> delete from sbtest2 where check_flag='0';
ERROR 1054 (42S22): Unknown column 'check_flag' in 'where clause'
第3节点服务器(无字段check_flag):
MariaDB [sbtest]> insert into sbtest2(id,k,c,pad) values(120,120,'110k','110c');
Query OK, 1 row affected (0.01 sec)
MariaDB [sbtest]> select * from sbtest2 where id=120;
+-----+-----+------+------+
| id | k | c | pad |
+-----+-----+------+------+
| 120 | 120 | 110k | 110c |
+-----+-----+------+------+
1 row in set (0.00 sec)
第1节点服务器(有字段check_flag):
mysql> select * from sbtest2 where id=120;
+-----+-----+------+------+------------+
| id | k | c | pad | check_flag |
+-----+-----+------+------+------------+
| 120 | 120 | 110k | 110c | 1 |
+-----+-----+------+------+------------+
1 row in set (0.00 sec)
在有字段check_flag的服务器只操作check_flag字段
第1节点服务器(有字段check_flag):
mysql> update sbtest2 set check_flag='0' where id=120;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select * from sbtest2 where id=120;
+-----+-----+------+------+------------+
| id | k | c | pad | check_flag |
+-----+-----+------+------+------------+
| 120 | 120 | 110k | 110c | 0 |
+-----+-----+------+------+------------+
1 row in set (0.00 sec)
mysql> delete from sbtest2 where check_flag='0';
Query OK, 1 row affected (0.00 sec)
mysql> select * from sbtest2 where id=120;
Empty set (0.00 sec)
第3节点服务器(无字段check_flag):
MariaDB [sbtest]> select * from sbtest2 where id=120;
Empty set (0.00 sec)
数据也成功同步
测试结论
从以上测试可以看出,只要在无字段check_flag的服务器上执行DML,不相关check_flag字段就不会报错。
而在有字段check_flag的服务器上执行DML,数据同样会复制到无字段check_flag的服务器上,并且不报错。
这是可能导致系统运行很长时间,都没有发现数据不一致的情况。
MySQL 8.0.25-MGR集群测试
第3节点 单独执行dml语句之后,MGR集群会拒绝它加入集群,并且报错:
2021-06-23T02:18:29.463798Z 0 [ERROR] [MY-011526] [Repl] Plugin group_replication reported: 'This member has more executed transactions than those present in the group. Local transactions: b2347c8c-c373-11eb-a6bc-e2cf689cd558:1-24, be1ba1b2-c376-11eb-9484-000c2902fb88:1 > Group transactions: b2347c8c-c373-11eb-a6bc-e2cf689cd558:1-24'
测试结论:
1.MGR在节点加字段报错的时候,就会把节点提出集群。
2.本次gelera节点发生3节点字段添加没成功,在生产上的原因是由于3节点磁盘满了。等磁盘有空间后,会继续同步,且不会引起报错,导致数据不一致。




