前言
一大早就被电话打醒了,昨晚看了世界杯揭幕战,脑子还有点糊涂,同事电话说需要紧急support了,有个mysql数据库sql性能有问题,已经分析了好一会儿了还没解决,而且主节点执行快,备节点执行慢。其实我处理mysql的sql性能问题大多是因为或是索引、或是统计信息、或是参数导致的,再深入的问题较少碰到。听到这种诡异的问题,精神抖擞,赶紧打开电脑,撸起来。
问题分析
一、查看执行计划
登录主从两个节点,通过show processlist和慢日志确定了慢SQL,三个表做关联。
select
t1.ind_id id,
t1.ind_cname name,
t1.ind_frequency frequency,
t1.ind_unit unit,
t1.data_begin_date dataBeginDate,
t1.data_end_date dataEndDate,
t1.data_update_date dataUpdateDate,
t1.ind_grade_tag gradeTag,
t1.ind_ra_name analystName,
t1.ind_ra_id analystId,
t1.create_user createUser,
t1.data_source dataSourceId,
t1.ind_scope scope,
t3.source_cname dataSourceName,
t1.IND_DESCRIPTION description
from IND_DER_BASE t1
join IND_MAPPING_TREE_DER_IND t2 on t1.ind_id = t2.ind_id
left join ZJTB_FILE_SOURCE t3 on t1.data_source=t3.source_id
where t1.remove_flag=0
and t2.remove_flag=0
and t2.node_id=110035
and t1.IND_STATUS=1
order by order_number看下主从节点的执行计划,主库的执行顺序是扫描t2表,过滤条件是node_id和remove_flag,过滤后的结果集再与t1表通过ind_id关联,走的是主键索引,t2和t1关联后的结果集再与t3表关联,也走了索引。
+----+-------------+-------+------------+--------+---------------+---------+---------+----------------------+-------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+----------------------+-------+----------+-----------------------------+
| 1 | SIMPLE | t2 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 72150 | 1.00 | Using where; Using filesort |
| 1 | SIMPLE | t1 | NULL | eq_ref | PRIMARY | PRIMARY | 154 | RSIND.t2.IND_ID | 1 | 5.00 | Using where |
| 1 | SIMPLE | t3 | NULL | eq_ref | id_idx | id_idx | 8 | RSIND.t1.DATA_SOURCE | 1 | 100.00 | Using where |
+----+-------------+-------+------------+--------+---------------+---------+---------+----------------------+-------+----------+-----------------------------+
3 rows in set, 1 warning (0.00 sec)从节点的执行顺序是先扫描t1表,通过remove_flag和ind_status过滤后,与t3表关联,走的普通索引,得出的结果集再与t2表关联,走的t2表的主键索引。单从以上执行计划上看都走了索引,感觉应该不会慢。
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 86644 | 1.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | t3 | NULL | eq_ref | id_idx | id_idx | 8 | RSIND.t1.DATA_SOURCE | 1 | 100.00 | Using where |
| 1 | SIMPLE | t2 | NULL | eq_ref | PRIMARY | PRIMARY | 162 | RSIND.t1.IND_ID,const | 1 | 10.00 | Using where |
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.01 sec)====》看起来主从节点的执行计划是不一样的。这时候第一个想到的是不是统计信息不一致。
二、查看统计信息
通过information_schema里的tables表看看表的统计信息,发现主从节点的统计信息是不一样的,这个时候突然有一阵窃喜,哈哈哈,这不解决了吗,下一步就是收集统计信息啊。
主库:
mysql> select table_name,table_rows,data_length/1024/1024/1024,index_length/1024/1024/1024,data_free from information_schema.tables where table_name in ('ZJTB_FILE_SOURCE','IND_MAPPING_TREE_DER_IND','IND_DER_BASE')
-> ;
+--------------------------+------------+----------------------------+-----------------------------+-----------+
| table_name | table_rows | data_length/1024/1024/1024 | index_length/1024/1024/1024 | data_free |
+--------------------------+------------+----------------------------+-----------------------------+-----------+
| IND_DER_BASE | 80947 | 0.024002075195 | 0.000000000000 | 7340032 |
| IND_MAPPING_TREE_DER_IND | 72150 | 0.009323120117 | 0.000000000000 | 4194304 |
| ZJTB_FILE_SOURCE | 700 | 0.000091552734 | 0.000076293945 | 0 |
+--------------------------+------------+----------------------------+-----------------------------+-----------+从库:mysql> select table_name,table_rows,data_length/1024/1024/1024,index_length/1024/1024/1024,data_free from information_schema.tables where table_name in ('ZJTB_FILE_SOURCE','IND_MAPPING_TREE_DER_IND','IND_DER_BASE');
+--------------------------+------------+----------------------------+-----------------------------+-----------+
| table_name | table_rows | data_length/1024/1024/1024 | index_length/1024/1024/1024 | data_free |
+--------------------------+------------+----------------------------+-----------------------------+-----------+
| IND_DER_BASE | 86644 | 0.019119262695 | 0.000000000000 | 3145728 |
| IND_MAPPING_TREE_DER_IND | 103057 | 0.007369995117 | 0.000000000000 | 3145728 |
| ZJTB_FILE_SOURCE | 703 | 0.000076293945 | 0.000045776367 | 0 |
+--------------------------+------------+----------------------------+-----------------------------+-----------+
3 rows in set (0.01 sec)
收集前,再确认下表的实际条数,看看差距多大。哎呀,我去,不看不要紧,合着是主库的统计信息不准,从库还是准的啊,而且看起来主库的碎片也比较淡,怎么着也不能比从库的性能好啊。
mysql> select count(*) from IND_MAPPING_TREE_DER_IND;
+----------+
| count(*) |
+----------+
| 103324 |
+----------+
1 row in set (0.05 sec)
mysql> select count(*) from IND_DER_BASE;
+----------+
| count(*) |
+----------+
| 88294 |
+----------+
1 row in set (0.03 sec)
mysql> select count(*) from ZJTB_FILE_SOURCE;
+----------+
| count(*) |
+----------+
| 703 |
+----------+
1 row in set (0.00 sec)临时没有别的辙,收集一个试试吧,果然,收集完以后,统计信息差距不大,自然执行计划也没变。
mysql> select * from information_schema.tables where table_name in ('ZJTB_FILE_SOURCE','IND_MAPPING_TREE_DER_IND','IND_DER_BASE')\G
*************************** 1. row ***************************
TABLE_CATALOG: def
TABLE_SCHEMA: RSIND
TABLE_NAME: IND_DER_BASE
TABLE_TYPE: BASE TABLE
ENGINE: InnoDB
VERSION: 10
ROW_FORMAT: Dynamic
TABLE_ROWS: 85587
AVG_ROW_LENGTH: 276
DATA_LENGTH: 23658496
MAX_DATA_LENGTH: 0
INDEX_LENGTH: 0
DATA_FREE: 2097152
AUTO_INCREMENT: NULL
CREATE_TIME: 2022-11-21 09:28:51
UPDATE_TIME: 2022-11-21 10:35:21
*************************** 2. row ***************************
TABLE_CATALOG: def
TABLE_SCHEMA: RSIND
TABLE_NAME: IND_MAPPING_TREE_DER_IND
TABLE_TYPE: BASE TABLE
ENGINE: InnoDB
VERSION: 10
ROW_FORMAT: Dynamic
TABLE_ROWS: 103057
AVG_ROW_LENGTH: 86
DATA_LENGTH: 8962048
MAX_DATA_LENGTH: 0
INDEX_LENGTH: 0
DATA_FREE: 2097152
AUTO_INCREMENT: NULL
CREATE_TIME: 2022-11-21 09:29:20
UPDATE_TIME: NULL
CHECK_TIME: NULL
*************************** 3. row ***************************
TABLE_CATALOG: def
TABLE_SCHEMA: RSIND
TABLE_NAME: ZJTB_FILE_SOURCE
TABLE_TYPE: BASE TABLE
ENGINE: InnoDB
VERSION: 10
ROW_FORMAT: Dynamic
TABLE_ROWS: 703
AVG_ROW_LENGTH: 116
DATA_LENGTH: 81920
MAX_DATA_LENGTH: 0
INDEX_LENGTH: 49152
DATA_FREE: 0
AUTO_INCREMENT: 1399
CREATE_TIME: 2022-11-21 09:29:39
UPDATE_TIME: NULL
CHECK_TIME: NULL
3 rows in set (0.00 sec)三、优化器参数
接下来我想到了比较下优化器参数,怕是由于优化器参数导致的不一致,得咧,主备库都一样,那就奇怪了,既然优化器参数都一样,在统计信息更准的备库上执行计划反而更差了呢,我还还以过主从数据是不是不一致的问题,最终也确认了,主从同步是没问题的。。我们还是回到执行计划本身。
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| innodb_ft_num_word_optimize | 2000 |
| innodb_optimize_fulltext_only | OFF |
| optimizer_prune_level | 1 |
| optimizer_search_depth | 62 |
| optimizer_switch | index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on,prefer_ordering_index=on |
| optimizer_trace | enabled=off,one_line=off |
| optimizer_trace_features | greedy_search=on,range_optimizer=on,dynamic_range=on,repeated_subselect=on |
| optimizer_trace_limit | 1 |
| optimizer_trace_max_mem_size | 16384 |
| optimizer_trace_offset | -1 |
| range_optimizer_max_mem_size | 8388608 |
| replication_optimize_for_static_plugin_config | OFF |
+-----------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+四、再读执行计划
仔细分析从库的执行计划,第一步,对t1表进行扫描86644行,这是一个评估值,fitered是1,这个1表示的是1%,也就是说通过where条件能筛选出86644*1%=866行,这说明调价的筛选率还是不错的,但是这个t1表的两个筛选字段是remove_flag和ind_status,这两个字段一眼看去基数不高的样子。
SQL语句的关联条件和where条件: from IND_DER_BASE t1
join IND_MAPPING_TREE_DER_IND t2 on t1.ind_id = t2.ind_id
left join ZJTB_FILE_SOURCE t3 on t1.data_source=t3.source_id
where t1.remove_flag=0
and t2.remove_flag=0
and t2.node_id=110035
and t1.IND_STATUS=1
从库执行计划:+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 86644 | 1.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | t3 | NULL | eq_ref | id_idx | id_idx | 8 | RSIND.t1.DATA_SOURCE | 1 | 100.00 | Using where |
| 1 | SIMPLE | t2 | NULL | eq_ref | PRIMARY | PRIMARY | 162 | RSIND.t1.IND_ID,const | 1 | 10.00 | Using where |
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+那我们看看通过remove_flag=0和ind_status=1筛选的数据的情况,果然,这两个字段压根没啥筛选的能力,但是优化器评估出来却是1%。而主库的执行计划的第一步驱动表是t2表,筛选条件是node_id和remove_flag,这个筛选条件就比较好了,其实主要还是这个node_id的筛选性比较好,
mysql> select count(*) from IND_DER_BASE t1 where t1.remove_flag=0 and t1.IND_STATUS=1 ;
+----------+
| count(*) |
+----------+
| 82488 |
+----------+
1 row in set (0.03 sec)mysql> select count(*) from IND_MAPPING_TREE_DER_IND t2 where t2.remove_flag=0 and t2.node_id=110035;
+----------+
| count(*) |
+----------+
| 6 |
+----------+
1 row in set (0.00 sec)mysql> select count(*) from IND_MAPPING_TREE_DER_IND t2 where t2.node_id=110035;
+----------+
| count(*) |
+----------+
| 6 |
+----------+
1 row in set (0.00 sec)
那么现在看,主要还是由于备库上这个t1表的filtered评估有问题导致的,这个filterd应该是100,结果评估出来却是1,至于说这个评估为什么不准,我在网上找到一篇文章:
这里大概意思是说mysql没有非索引列的数据分布的统计信息,但是这又没办法解释为什么主库的filter是没问题的。所以这个问题如果有明白的大佬,可以帮忙解解惑。
那么找到了问题,就好解决了,我们可以在t2表的node_id上新建一个索引,这样执行计划就变成了:
+----+-------------+-------+------------+--------+---------------------+-------------+---------+----------------------+------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------------+-------------+---------+----------------------+------+----------+----------------------------------------------------+
| 1 | SIMPLE | t2 | NULL | ref | PRIMARY,idx_node_id | idx_node_id | 8 | const | 6 | 10.00 | Using index condition; Using where; Using filesort |
| 1 | SIMPLE | t1 | NULL | eq_ref | PRIMARY | PRIMARY | 154 | RSIND.t2.IND_ID | 1 | 5.00 | Using where |
| 1 | SIMPLE | t3 | NULL | eq_ref | id_idx | id_idx | 8 | RSIND.t1.DATA_SOURCE | 1 | 100.00 | Using where |
+----+-------------+-------+------------+--------+---------------------+-------------+---------+----------------------+------+----------+----------------------------------------------------+总结
关于主备节点执行计划不一致的问题,还是第一次碰到,本文总结了几个需要排查的方向。针对这个问题虽然解决了,但是更深层的原因,还有待进一步的分析。如果有大佬能帮忙解释下,真是感激不尽。




