暂无图片
暂无图片
2
暂无图片
暂无图片
暂无图片

【First Time! 记一次MySQL主从节点执行计划不一致的诡异问题】

原创 张sir 2022-11-22
778

前言

      一大早就被电话打醒了,昨晚看了世界杯揭幕战,脑子还有点糊涂,同事电话说需要紧急support了,有个mysql数据库sql性能有问题,已经分析了好一会儿了还没解决,而且主节点执行快,备节点执行慢。其实我处理mysql的sql性能问题大多是因为或是索引、或是统计信息、或是参数导致的,再深入的问题较少碰到。听到这种诡异的问题,精神抖擞,赶紧打开电脑,撸起来。

问题分析

一、查看执行计划

登录主从两个节点,通过show processlist和慢日志确定了慢SQL,三个表做关联。

select
        t1.ind_id id,
        t1.ind_cname name,
        t1.ind_frequency frequency,
        t1.ind_unit unit,
        t1.data_begin_date dataBeginDate,
        t1.data_end_date dataEndDate,
        t1.data_update_date dataUpdateDate,
        t1.ind_grade_tag gradeTag,
        t1.ind_ra_name analystName,
        t1.ind_ra_id analystId,
        t1.create_user createUser,
        t1.data_source dataSourceId,
        t1.ind_scope scope,
        t3.source_cname dataSourceName,
        t1.IND_DESCRIPTION description
        from IND_DER_BASE t1
        join IND_MAPPING_TREE_DER_IND t2 on t1.ind_id = t2.ind_id
        left join ZJTB_FILE_SOURCE t3 on t1.data_source=t3.source_id
        where t1.remove_flag=0
        and t2.remove_flag=0
        and t2.node_id=110035
        and t1.IND_STATUS=1
        order by order_number

看下主从节点的执行计划,主库的执行顺序是扫描t2表,过滤条件是node_id和remove_flag,过滤后的结果集再与t1表通过ind_id关联,走的是主键索引,t2和t1关联后的结果集再与t3表关联,也走了索引。

+----+-------------+-------+------------+--------+---------------+---------+---------+----------------------+-------+----------+-----------------------------+
| id | select_type | table | partitions | type   | possible_keys | key     | key_len | ref                  | rows  | filtered | Extra                       |
+----+-------------+-------+------------+--------+---------------+---------+---------+----------------------+-------+----------+-----------------------------+
|  1 | SIMPLE      | t2    | NULL       | ALL    | PRIMARY       | NULL    | NULL    | NULL                 | 72150 |     1.00 | Using where; Using filesort |
|  1 | SIMPLE      | t1    | NULL       | eq_ref | PRIMARY       | PRIMARY | 154     | RSIND.t2.IND_ID      |     1 |     5.00 | Using where                 |
|  1 | SIMPLE      | t3    | NULL       | eq_ref | id_idx        | id_idx  | 8       | RSIND.t1.DATA_SOURCE |     1 |   100.00 | Using where                 |
+----+-------------+-------+------------+--------+---------------+---------+---------+----------------------+-------+----------+-----------------------------+
3 rows in set, 1 warning (0.00 sec)

从节点的执行顺序是先扫描t1表,通过remove_flag和ind_status过滤后,与t3表关联,走的普通索引,得出的结果集再与t2表关联,走的t2表的主键索引。单从以上执行计划上看都走了索引,感觉应该不会慢。

+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type   | possible_keys | key     | key_len | ref                   | rows  | filtered | Extra                                        |
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | t1    | NULL       | ALL    | PRIMARY       | NULL    | NULL    | NULL                  | 86644 |     1.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | t3    | NULL       | eq_ref | id_idx        | id_idx  | 8       | RSIND.t1.DATA_SOURCE  |     1 |   100.00 | Using where                                  |
|  1 | SIMPLE      | t2    | NULL       | eq_ref | PRIMARY       | PRIMARY | 162     | RSIND.t1.IND_ID,const |     1 |    10.00 | Using where                                  |
+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.01 sec)

====》看起来主从节点的执行计划是不一样的。这时候第一个想到的是不是统计信息不一致。

二、查看统计信息

通过information_schema里的tables表看看表的统计信息,发现主从节点的统计信息是不一样的,这个时候突然有一阵窃喜,哈哈哈,这不解决了吗,下一步就是收集统计信息啊。

主库:
mysql> select table_name,table_rows,data_length/1024/1024/1024,index_length/1024/1024/1024,data_free from information_schema.tables  where table_name in ('ZJTB_FILE_SOURCE','IND_MAPPING_TREE_DER_IND','IND_DER_BASE')
    -> ;
+--------------------------+------------+----------------------------+-----------------------------+-----------+
| table_name               | table_rows | data_length/1024/1024/1024 | index_length/1024/1024/1024 | data_free |
+--------------------------+------------+----------------------------+-----------------------------+-----------+
| IND_DER_BASE             |      80947 |             0.024002075195 |              0.000000000000 |   7340032 |
| IND_MAPPING_TREE_DER_IND |      72150 |             0.009323120117 |              0.000000000000 |   4194304 |
| ZJTB_FILE_SOURCE         |        700 |             0.000091552734 |              0.000076293945 |         0 |
+--------------------------+------------+----------------------------+-----------------------------+-----------+从库:mysql> select table_name,table_rows,data_length/1024/1024/1024,index_length/1024/1024/1024,data_free from information_schema.tables  where table_name in ('ZJTB_FILE_SOURCE','IND_MAPPING_TREE_DER_IND','IND_DER_BASE');
+--------------------------+------------+----------------------------+-----------------------------+-----------+
| table_name               | table_rows | data_length/1024/1024/1024 | index_length/1024/1024/1024 | data_free |
+--------------------------+------------+----------------------------+-----------------------------+-----------+
| IND_DER_BASE             |      86644 |             0.019119262695 |              0.000000000000 |   3145728 |
| IND_MAPPING_TREE_DER_IND |     103057 |             0.007369995117 |              0.000000000000 |   3145728 |
| ZJTB_FILE_SOURCE         |        703 |             0.000076293945 |              0.000045776367 |         0 |
+--------------------------+------------+----------------------------+-----------------------------+-----------+
3 rows in set (0.01 sec)

收集前,再确认下表的实际条数,看看差距多大。哎呀,我去,不看不要紧,合着是主库的统计信息不准,从库还是准的啊,而且看起来主库的碎片也比较淡,怎么着也不能比从库的性能好啊。

mysql> select count(*) from IND_MAPPING_TREE_DER_IND;
+----------+
| count(*) |
+----------+
|   103324 |
+----------+
1 row in set (0.05 sec)

mysql> select count(*) from IND_DER_BASE;
+----------+
| count(*) |
+----------+
|    88294 |
+----------+
1 row in set (0.03 sec)

mysql> select count(*) from ZJTB_FILE_SOURCE;
+----------+
| count(*) |
+----------+
|      703 |
+----------+
1 row in set (0.00 sec)

临时没有别的辙,收集一个试试吧,果然,收集完以后,统计信息差距不大,自然执行计划也没变。

mysql> select * from information_schema.tables  where table_name in ('ZJTB_FILE_SOURCE','IND_MAPPING_TREE_DER_IND','IND_DER_BASE')\G
*************************** 1. row ***************************
  TABLE_CATALOG: def
   TABLE_SCHEMA: RSIND
     TABLE_NAME: IND_DER_BASE
     TABLE_TYPE: BASE TABLE
         ENGINE: InnoDB
        VERSION: 10
     ROW_FORMAT: Dynamic
     TABLE_ROWS: 85587
 AVG_ROW_LENGTH: 276
    DATA_LENGTH: 23658496
MAX_DATA_LENGTH: 0
   INDEX_LENGTH: 0
      DATA_FREE: 2097152
 AUTO_INCREMENT: NULL
    CREATE_TIME: 2022-11-21 09:28:51
    UPDATE_TIME: 2022-11-21 10:35:21
*************************** 2. row ***************************
  TABLE_CATALOG: def
   TABLE_SCHEMA: RSIND
     TABLE_NAME: IND_MAPPING_TREE_DER_IND
     TABLE_TYPE: BASE TABLE
         ENGINE: InnoDB
        VERSION: 10
     ROW_FORMAT: Dynamic
     TABLE_ROWS: 103057
 AVG_ROW_LENGTH: 86
    DATA_LENGTH: 8962048
MAX_DATA_LENGTH: 0
   INDEX_LENGTH: 0
      DATA_FREE: 2097152
 AUTO_INCREMENT: NULL
    CREATE_TIME: 2022-11-21 09:29:20
    UPDATE_TIME: NULL
     CHECK_TIME: NULL
*************************** 3. row ***************************
  TABLE_CATALOG: def
   TABLE_SCHEMA: RSIND
     TABLE_NAME: ZJTB_FILE_SOURCE
     TABLE_TYPE: BASE TABLE
         ENGINE: InnoDB
        VERSION: 10
     ROW_FORMAT: Dynamic
     TABLE_ROWS: 703
 AVG_ROW_LENGTH: 116
    DATA_LENGTH: 81920
MAX_DATA_LENGTH: 0
   INDEX_LENGTH: 49152
      DATA_FREE: 0
 AUTO_INCREMENT: 1399
    CREATE_TIME: 2022-11-21 09:29:39
    UPDATE_TIME: NULL
     CHECK_TIME: NULL
3 rows in set (0.00 sec)

三、优化器参数

接下来我想到了比较下优化器参数,怕是由于优化器参数导致的不一致,得咧,主备库都一样,那就奇怪了,既然优化器参数都一样,在统计信息更准的备库上执行计划反而更差了呢,我还还以过主从数据是不是不一致的问题,最终也确认了,主从同步是没问题的。。我们还是回到执行计划本身。

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| innodb_ft_num_word_optimize                   | 2000                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| innodb_optimize_fulltext_only                 | OFF                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| optimizer_prune_level                         | 1                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| optimizer_search_depth                        | 62                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| optimizer_switch                              | index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on,prefer_ordering_index=on |
| optimizer_trace                               | enabled=off,one_line=off                                                                                                                                                                                                                                                                                                                                                                                                                  |
| optimizer_trace_features                      | greedy_search=on,range_optimizer=on,dynamic_range=on,repeated_subselect=on                                                                                                                                                                                                                                                                                                                                                                |
| optimizer_trace_limit                         | 1                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| optimizer_trace_max_mem_size                  | 16384                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| optimizer_trace_offset                        | -1                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| range_optimizer_max_mem_size                  | 8388608                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| replication_optimize_for_static_plugin_config | OFF                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+-----------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

四、再读执行计划

    仔细分析从库的执行计划,第一步,对t1表进行扫描86644行,这是一个评估值,fitered是1,这个1表示的是1%,也就是说通过where条件能筛选出86644*1%=866行,这说明调价的筛选率还是不错的,但是这个t1表的两个筛选字段是remove_flag和ind_status,这两个字段一眼看去基数不高的样子。

SQL语句的关联条件和where条件:      from IND_DER_BASE t1
        join IND_MAPPING_TREE_DER_IND t2 on t1.ind_id = t2.ind_id
        left join ZJTB_FILE_SOURCE t3 on t1.data_source=t3.source_id
        where t1.remove_flag=0
        and t2.remove_flag=0
        and t2.node_id=110035
        and t1.IND_STATUS=1
从库执行计划:+----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+ | 1 | SIMPLE | t1 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 86644 | 1.00 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | t3 | NULL | eq_ref | id_idx | id_idx | 8 | RSIND.t1.DATA_SOURCE | 1 | 100.00 | Using where | | 1 | SIMPLE | t2 | NULL | eq_ref | PRIMARY | PRIMARY | 162 | RSIND.t1.IND_ID,const | 1 | 10.00 | Using where | +----+-------------+-------+------------+--------+---------------+---------+---------+-----------------------+-------+----------+----------------------------------------------+

那我们看看通过remove_flag=0和ind_status=1筛选的数据的情况,果然,这两个字段压根没啥筛选的能力,但是优化器评估出来却是1%。而主库的执行计划的第一步驱动表是t2表,筛选条件是node_id和remove_flag,这个筛选条件就比较好了,其实主要还是这个node_id的筛选性比较好,

mysql> select  count(*) from IND_DER_BASE t1  where t1.remove_flag=0 and  t1.IND_STATUS=1 ;
+----------+
| count(*) |
+----------+
|    82488 |
+----------+
1 row in set (0.03 sec)mysql> select  count(*) from IND_MAPPING_TREE_DER_IND t2 where   t2.remove_flag=0 and t2.node_id=110035;
+----------+
| count(*) |
+----------+
|        6 |
+----------+
1 row in set (0.00 sec)mysql> select  count(*) from IND_MAPPING_TREE_DER_IND t2 where   t2.node_id=110035;
+----------+
| count(*) |
+----------+
|        6 |
+----------+
1 row in set (0.00 sec)

那么现在看,主要还是由于备库上这个t1表的filtered评估有问题导致的,这个filterd应该是100,结果评估出来却是1,至于说这个评估为什么不准,我在网上找到一篇文章:

https://blog.csdn.net/weixin_32712183/article/details/113200152?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1-113200152-blog-122202911.pc_relevant_recovery_v2&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1-113200152-blog-122202911.pc_relevant_recovery_v2&utm_relevant_index=2

这里大概意思是说mysql没有非索引列的数据分布的统计信息,但是这又没办法解释为什么主库的filter是没问题的。所以这个问题如果有明白的大佬,可以帮忙解解惑。


那么找到了问题,就好解决了,我们可以在t2表的node_id上新建一个索引,这样执行计划就变成了:

+----+-------------+-------+------------+--------+---------------------+-------------+---------+----------------------+------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type   | possible_keys       | key         | key_len | ref                  | rows | filtered | Extra                                              |
+----+-------------+-------+------------+--------+---------------------+-------------+---------+----------------------+------+----------+----------------------------------------------------+
|  1 | SIMPLE      | t2    | NULL       | ref    | PRIMARY,idx_node_id | idx_node_id | 8       | const                |    6 |    10.00 | Using index condition; Using where; Using filesort |
|  1 | SIMPLE      | t1    | NULL       | eq_ref | PRIMARY             | PRIMARY     | 154     | RSIND.t2.IND_ID      |    1 |     5.00 | Using where                                        |
|  1 | SIMPLE      | t3    | NULL       | eq_ref | id_idx              | id_idx      | 8       | RSIND.t1.DATA_SOURCE |    1 |   100.00 | Using where                                        |
+----+-------------+-------+------------+--------+---------------------+-------------+---------+----------------------+------+----------+----------------------------------------------------+

总结

      关于主备节点执行计划不一致的问题,还是第一次碰到,本文总结了几个需要排查的方向。针对这个问题虽然解决了,但是更深层的原因,还有待进一步的分析。如果有大佬能帮忙解释下,真是感激不尽。

最后修改时间:2022-11-24 12:10:42
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论