查询/DML异常
信息收集内容
集群版本、sql_mode、建表语句、表内相关数据导入语句、结果异常的语句、复现步骤、sqlnode发往datanode的语句、语句执行报的warning,error_log等。
涉及到语句的,尽量提供文本格式,避免直接提供截图。
收集方法
集群版本:
执行语句show variables like '%commit%hash%';,提供commit_hash
提供安装包名称。
若为包含rapid引擎版本,需要额外备注说明。
sqlmode:
收集当前session的sql_mode和全局sql_mode
show variables like 'sql_mode';
show global variables like 'sql_mode';
建表语句:
-- 前端表结构
show create table xxx;
-- 后端表结构
-- 先通过information_schema.greatdb_table_distributiuon查询后端表结构所在shard
-- 登录到shard的主节点查询表结构(后端表结构多一个__hidden_pk__字段)。
-- 若后端表结构有异常,需要贴上后端异常的表结构。
表内数据:
若执行语句失败,需要导出涉及的数据,格式不限。可为sql文件或csv文件,若为csv文件,需附上对应导入语句。
数据量较大时,可只导入涉及到查询的数据。
结果异常语句:
贴上执行异常的语句,标注错误原因,附上期望的正确执行结果。若结果集不正确,附上错误的结果集。
复现步骤:
以sql语句的形式呈现。以添加索引失败,如下所示:
create table t1(c1 int primary key, c2 int);
insert into t1 values (1,1),(2,2)....;
-- 标注异常语句
alter table t1 add key(c2); -- error
sqlnode发往datanode语句:
通过开启datanode general_log的方式定位具体语句。首先要定位语句受影响的节点,对于normal表比较适合。而对partition表、若分区数达到shard数,可能涉及所有shard,定位受影响节点不再有意义。通过以下语句定位受影响节点:
mysql> show create table t1 \G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`c1` int NOT NULL,
`c2` int DEFAULT NULL,
PRIMARY KEY (`c1`)
) ENGINE=GreatDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
mysql> select * from information_schema.greatdb_table_distribution;
+-------------+------------+----------------+-----------------+------------+----------------+--------------+
| SCHEMA_NAME | TABLE_NAME | BACKEND_ENGINE | DISTRIBUTE_MODE | SHARD_NAME | PARTITION_NAME | PARTITION_ID |
+-------------+------------+----------------+-----------------+------------+----------------+--------------+
| test | t1 | InnoDB | NORMAL | sd1 | | 0 |
+-------------+------------+----------------+-----------------+------------+----------------+--------------+
mysql> select node_id, node_name, shard_id, shard_name, host, port, node_type, node_state from information_schema.greatdb_datanodes;
+---------+-----------+----------+------------+-----------+------+-----------+--------------+
| node_id | node_name | shard_id | shard_name | host | port | node_type | node_state |
+---------+-----------+----------+------------+-----------+------+-----------+--------------+
| 11 | n1 | 11 | sd1 | 127.0.0.1 | 3337 | NODE_MGR | STATE_ACTIVE |
| 20 | n2 | 20 | sd2 | 127.0.0.1 | 3338 | NODE_MGR | STATE_ACTIVE |
| 29 | n3 | 29 | sd3 | 127.0.0.1 | 3339 | NODE_MGR | STATE_ACTIVE |
+---------+-----------+----------+------------+-----------+------+-----------+--------------+
开启datanode的general_log
-- 登陆到对应datanode,查看general_log相关变量,默认为关闭
mysql> show variables like 'general_log%';
+------------------+---------------------------------------------------------------------+
| Variable_name | Value |
+------------------+---------------------------------------------------------------------+
| general_log | OFF |
| general_log_file | /ssd-dir/data/compile/artifacts/mysqldata/greatlh-OptiPlex-5060.log |
+------------------+---------------------------------------------------------------------+
-- 开启general_log
set global general_log=ON;
开启general_log后,执行错误语句后应及时关闭general_log,减少无用的general_log日志内容,方便过滤。
set global general_log=OFF;
处理general_log,过滤无用内容,以sqlnode执行update t1 set c2=c2+1 where c2>=2为例,datanode部分general_log如下。
....
2021-11-30T02:38:15.382336Z 117 Query SELECT m.`MEMBER_STATE`, m.`MEMBER_ROLE`, @@gtid_executed, @@server_uuid, cs.`Received_transaction_set`, @@gdb_sqld_version FROM `performance_schema`.`replication_group_members` AS m LEFT JOIN `performance_schema`.`replication_connection_status` AS cs ON cs.`Channel_name` = 'group_replication_applier' WHERE (m.`MEMBER_ID`=@@server_uuid OR m.`MEMBER_ID`='')
2021-11-30T02:38:17.281634Z 60 Query begin;
2021-11-30T02:38:17.281784Z 60 Query SELECT `c1`, `c2`, `_hidden_pk_` FROM `test`.`t1`WHERE ( (`c2` >= 2) ) FOR UPDATE
2021-11-30T02:38:17.282533Z 60 Query UPDATE /*gdb update*/ `test`.`t1` AS m, (SELECT 1 AS `_hidden_pk_`, 1 AS `c1`, 6 AS `c2` UNION ALL SELECT 2, 2, 3 UNION ALL SELECT 3, 3, 4 UNION ALL SELECT 4, 4, 5) AS r SET m.`c1` = r.`c1`, m.`c2` = r.`c2` WHERE m.`_hidden_pk_` = r.`_hidden_pk_`
2021-11-30T02:38:17.283957Z 60 Query commit
2021-11-30T02:38:17.284043Z 94 Query SELECT `TABLE_ROWS`, `DATA_LENGTH`, `MAX_DATA_LENGTH`, `INDEX_LENGTH`, `DATA_FREE` FROM information_schema.tables WHERE `TABLE_SCHEMA` = 'test' AND `TABLE_NAME`='t1'
2021-11-30T02:38:17.383310Z 117 Query SELECT m.`MEMBER_STATE`, m.`MEMBER_ROLE`, @@gtid_executed, @@server_uuid, cs.`Received_transaction_set`, @@gdb_sqld_version FROM `performance_schema`.`replication_group_members` AS m LEFT JOIN `performance_schema`.`replication_connection_status` AS cs ON cs.`Channel_name` = 'group_replication_applier' WHERE (m.`MEMBER_ID`=@@server_uuid OR m.`MEMBER_ID`='')
....
首先按照语句和对应的执行时间找到对应执行语句内容(若未触发direct update,greatdb执行先查询后更新;若启用了direct update,则整合为一条update语句):
2021-11-30T02:38:17.281784Z 60 Query SELECT `c1`, `c2`, `_hidden_pk_` FROM `test`.`t1`WHERE ( (`c2` >= 2) ) FOR UPDATE
2021-11-30T02:38:17.282533Z 60 Query UPDATE /*gdb update*/ `test`.`t1` AS m, (SELECT 1 AS `_hidden_pk_`, 1 AS `c1`, 6 AS `c2` UNION ALL SELECT 2, 2, 3 UNION ALL SELECT 3, 3, 4 UNION ALL SELECT 4, 4, 5) AS r SET m.`c1` = r.`c1`, m.`c2` = r.`c2` WHERE m.`_hidden_pk_` = r.`_hidden_pk_`
再根据日志中第二个字段60(session_id),找到日志的上下文内容,整合后的完整日志内容为:
2021-11-30T02:38:17.281634Z 60 Query begin;
2021-11-30T02:38:17.281784Z 60 Query SELECT `c1`, `c2`, `_hidden_pk_` FROM `test`.`t1`WHERE ( (`c2` >= 2) ) FOR UPDATE
2021-11-30T02:38:17.282533Z 60 Query UPDATE /*gdb update*/ `test`.`t1` AS m, (SELECT 1 AS `_hidden_pk_`, 1 AS `c1`, 6 AS `c2` UNION ALL SELECT 2, 2, 3 UNION ALL SELECT 3, 3, 4 UNION ALL SELECT 4, 4, 5) AS r SET m.`c1` = r.`c1`, m.`c2` = r.`c2` WHERE m.`_hidden_pk_` = r.`_hidden_pk_`
2021-11-30T02:38:17.283957Z 60 Query commit
语句执行警告
若异常语句执行后有warnings,执行show warnings打印所有的警告。
error log
检查error log在语句执行时是否产生新的错误日志,若存在,需要贴出来。
DDL异常
信息收集内容
DDL执行时需要收集的信息大部分与DML相同,但需要额外收集后台任务信息。收集后台任务的原因是部分alter table语句走backend task。走后台任务的ddl有两类:
1. add key, add primary key
此类alter语句生成的后端任务走inplace流程,由后端datanode的任务自己连自己后直接执行ddl语句
2. modify column、drop column
此类alter语句生成的后端任务走copy流程。后端datanode任务首先建立一个修改结构后的影子表,然后建立两个连接:一个连接conn1操作源表、一个连接conn2操作影子表。conn1首先获取源表一致性点,然后导出数据到csv(会自动切割、持续进行)。conn2通过load data语法导入已导出的文件。数据导入完毕后再追此期间数据变化的binlog,当快追齐后、锁源表,全部追齐后源表和影子表进行rename操作。全部成功后drop掉rename后的源表。
因此,需要额外收集后端task的状态,同时general log的内容要收集上述conn1和conn2的语句。涉及的收集内容有三张表和若干information_schema:
-- 物理表
mysql.greatdb_sqlnode_tasks -- 数据存在sqlnode
greatdb_sqlnode_subtasks -- 数据存在sqlnode
mysql.greatdb_datanode_tasks -- 存在datanode
-- information_schema视图
information_schema.greatdb_alter_table_task -- sqlnode
information_schema.greatdb_alter_table_subtask -- sqlnode
对于其它DDL,由sqlnode执行向后端对应的datanode发送语句,每个分片表一个连接,同一datanode上的多个分片不再共享连接。
信息收集方法
以modify column为例,在sqlnode执行alter table t1 modify column c2 bigint:
mysql.greatdb_sqlnode_task:
该表为task任务的主表,其中task_owner表示执行alter table的sqlnode的server uuid。该表除alter table任务外还有后台daemon任务、表迁移任务。可按task_type过滤得到alter table任务。create_time表示执行alter语句的时间。
mysql> select * from mysql.greatdb_sqlnode_tasks;
+---------+---------------+-------------+--------------------------------------+-------------+----------------------------------------+---------------------+---------+
| task_id | task_type | task_status | task_owner | object_name | info | create_time | version |
+---------+---------------+-------------+--------------------------------------+-------------+----------------------------------------+---------------------+---------+
| 1 | SHARD_RECOVER | RUNNING | 8ad17d08-4d0d-11ec-bc47-2cf05daaf22a | N/A | shard recover daemon task | 2021-11-24 18:02:56 | 4 |
| 11 | ALTER_TABLE | COMPLETE | 8ad17d08-4d0d-11ec-bc47-2cf05daaf22a | `test`.`t1` | alter table t1 modify column c2 bigint | 2021-11-30 11:59:28 | 0 |
+---------+---------------+-------------+--------------------------------------+-------------+----------------------------------------+---------------------+---------+
2 rows in set (0.00 sec)
greatdb_sqlnode_subtasks:
若alter的表为分区表,则每个分区包含一个subtask。shard_name表示该分区所在的shard。task_id值与主任务表关联,id值为该subtask的序号、按task_id分组从1开始。
mysql> select * from mysql.greatdb_sqlnode_subtasks;
+----+---------+------------+
| id | task_id | shard_name |
+----+---------+------------+
| 1 | 11 | sd1 |
+----+---------+------------+
mysql.greatdb_datanode_tasks:
该表仅在datanode上存在数据,包含datanode后端alter任务的详细信息,通过task_id和id与subtask表关联。其中task_info包含执行alter的必要信息,“inplace_strategy”表示是否走inplace流程(modify column为false,表示走copy流程)。
mysql> select * from mysql.greatdb_datanode_tasks \G
*************************** 1. row ***************************
task_id: 11
id: 1
task_type: ALTER_TABLE
task_status: COMPLETE
task_stage: COMPLETE
task_info: {"host": "127.0.0.1", "port": 3337, "user": "root", "table": "t1", "database": "test", "password": "", "sql_mode": "ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION", "outfile_name": "test.t1", "alter_statement": "MODIFY COLUMN `c2` bigint DEFAULT NULL", "timezone_offset": "+08:00", "inplace_strategy": false, "revert_statement": "", "import_fields_str": "*", "select_fields_str": "*", "droped_columns_set": [], "process_fields_str": "", "outfile_wait_transport": false}
create_time: 2021-11-30 11:59:28
err_code: 0
err_msg:
information_schema.greatdb_alter_table_task:
alter table任务各个阶段的可视化内容,以上述三个表作为数据源、收集datanode和sqlnode的信息并展示。copy流程和inplace流程有不同的stage。
mysql> select * from information_schema.greatdb_alter_table_task;
+---------+-------------+----------------------------------------+--------------+----------------+--------------+-----------------------+---------------------+-------------+
| TASK_ID | TABLE_NAME | INFO | ALTER_STATUS | STAGE_SEQUENCE | STAGE | COMPLETED_SUBTASK_NUM | UNKNOWN_SUBTASK_NUM | SUBTASK_NUM |
+---------+-------------+----------------------------------------+--------------+----------------+--------------+-----------------------+---------------------+-------------+
| 11 | `test`.`t1` | alter table t1 modify column c2 bigint | COMPLETE | 1 | CREATE TABLE | 1 | 0 | 1 |
| 11 | `test`.`t1` | alter table t1 modify column c2 bigint | COMPLETE | 2 | COPY DATA | 1 | 0 | 1 |
| 11 | `test`.`t1` | alter table t1 modify column c2 bigint | COMPLETE | 3 | APPLY LOG | 1 | 0 | 1 |
| 11 | `test`.`t1` | alter table t1 modify column c2 bigint | COMPLETE | 4 | CATCH LOG | 1 | 0 | 1 |
| 11 | `test`.`t1` | alter table t1 modify column c2 bigint | COMPLETE | 5 | RENAME | 1 | 0 | 1 |
+---------+-------------+----------------------------------------+--------------+----------------+--------------+-----------------------+---------------------+-------------+
information_schema.greatdb_alter_table_subtask:
用于展示每个subtask的状态,可定位表的任务导致导致是由哪个分片引起的。
mysql> select * from information_schema.greatdb_alter_table_subtask;
+---------+-------------+------------+-------------+--------------------+----------+----------+-------+
| TASK_ID | SUB_TASK_ID | SHARD_NAME | TABLE_NAME | BACKEND_TABLE_NAME | STAGE | PROGRESS | ERROR |
+---------+-------------+------------+-------------+--------------------+----------+----------+-------+
| 11 | 1 | sd1 | `test`.`t1` | `test`.`t1` | COMPLETE | 100% | |
+---------+-------------+------------+-------------+--------------------+----------+----------+-------+



