暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

记录一则OB4.x扩缩容问题

原创 张瑞远 2024-10-29
485

背景:

有一套1-1-1集群在同步数据期间发现存储不够了,打算通过扩容节点到2-2-2的方式来解决该问题。

知识点:

在介绍问题以及处理问题之前,我想再复习下官网上关于4.x架构的一些知识点。

从 V4.0.0 版本开始,OceanBase 数据库要求租户内每个 Zone 的 Unit 个数必须保持一致。为了方便统一管理各个 Zone 的 Unit,系统引入了 Unit Group 机制,即不同 Zone 之间相同编号(UNIT_GROUP_ID)的 Unit 属于同一个 Unit Group。调大或调小 Unit Number 本质上是以 Unit Group 为单位创建和删除 Unit。日志流是跟unit group有关的,如果设置租户primary_zone为单个zone的情况下,正常一个zone有几个unit,就有几个unit group,也就有几个日志流,如果是random的情况下,如果有三个zone,每个zone有2个unit,还是2个unit group,那么就是3*2=6个日志流。

所以我们通过上面这句话知道了两个注意点:

1.每个租户的unit在不同zone里是对等的,不能像3.x那样自定义配置了

2.不同unit的unit group是不同的,扩容unit、均衡副本、切换leader的完成,不再是3.x的分区单位了(3.x是以分区为单位,在obcp的ppt里有比较明显的演示,先在要扩容的unit中创建副本,切换leader,再删除老的副本),4.x引入了transfer的概念,单位是tablet,其实也跟3.x类似,只是以整个日志流为单位了(整个日志流transfer完成才进行切主和老副本的删除)。

还有个重点就是“每个租户的unit在不同zone里是对等的”引出来的,我们一般通过ocp白屏创建的租户每个zone都有一个单独的resource pool,3.x是可以单独调整每个zone里的unit num的,但是4.x的resource pool与租户绑定后就不可以单独操作了,只能整体操作,操作命令如下。

ALTER RESOURCE TENANT mysql001 UNIT_NUM = 2;
ALTER RESOURCE TENANT mysql001 UNIT_NUM = 1 DELETE UNIT_GROUP =(1006);

过程:

简单介绍下这次操作的过程以及遇到的几个情况吧

1.第一次添加机器后修改unit_num,ocp任务报错了

1729154217

因为检查到新加机器配置有些遗漏,怀疑跟机器的环境有关系,回退了该任务,发现前台租户的状态还是运维中,unit_num前台显示还是2,后台检查dba_ob_jobs;

--状态是执行中
MySQL [oceanbase]> select  * from  dba_ob_tenant_jobs where JOB_STATUS<>'SUCCESS' limit 1 \G
*************************** 1. row ***************************
     JOB_ID: 7
   JOB_TYPE: ALTER_RESOURCE_TENANT_UNIT_NUM
 JOB_STATUS: INPROGRESS
RESULT_CODE: NULL
   PROGRESS: 0
 START_TIME: 2024-10-15 10:25:06.476864
MODIFY_TIME: 2024-10-15 10:25:06.476864
  TENANT_ID: 1002
   SQL_TEXT: ALTER RESOURCE TENANT `ccc2` UNIT_NUM = 2
 EXTRA_INFO: FROM: '1', TO: '2'
  RS_SVR_IP: 111.11.11.2  
RS_SVR_PORT: 2882
1 row in set (0.004 sec)

这时候前台页面调整unit_num因为租户状态不对,无法操作,只能后台缩容unit_num,后台操作完,前台也正常了。

MySQL [oceanbase]> select *  from  dba_ob_units where tenant_id=1002 ;
+---------+-----------+--------+------------------+---------------+----------------------------+----------------------------+-------+---------------+----------+---------------------+-----------------------+----------------+----------------+---------+---------+--------------+---------------+---------------------+---------------------+-------------+
| UNIT_ID | TENANT_ID | STATUS | RESOURCE_POOL_ID | UNIT_GROUP_ID | CREATE_TIME                | MODIFY_TIME                | ZONE  | SVR_IP        | SVR_PORT | MIGRATE_FROM_SVR_IP | MIGRATE_FROM_SVR_PORT | MANUAL_MIGRATE | UNIT_CONFIG_ID | MAX_CPU | MIN_CPU | MEMORY_SIZE  | LOG_DISK_SIZE | MAX_IOPS            | MIN_IOPS            | IOPS_WEIGHT |
+---------+-----------+--------+------------------+---------------+----------------------------+----------------------------+-------+---------------+----------+---------------------+-----------------------+----------------+----------------+---------+---------+--------------+---------------+---------------------+---------------------+-------------+
|    1001 |      1002 | ACTIVE |             1001 |          1001 | 2024-09-29 15:31:43.686630 | 2024-09-29 15:31:43.762276 | zone1 | 111.11.11.1   |     2882 | NULL                |                  NULL | NULL           |           1005 |      50 |      50 | 214748364800 |  805306368000 | 9223372036854775807 | 9223372036854775807 |          50 |
|    1002 |      1002 | ACTIVE |             1002 |          1001 | 2024-09-29 15:31:43.691645 | 2024-09-29 15:31:44.454323 | zone3 | 111.11.11.3   |     2882 | NULL                |                  NULL | NULL           |           1006 |      50 |      50 | 214748364800 |  805306368000 | 9223372036854775807 | 9223372036854775807 |          50 |
|    1003 |      1002 | ACTIVE |             1003 |          1001 | 2024-09-29 15:31:43.695862 | 2024-09-29 15:31:44.453268 | zone2 | 111.11.11.2   |     2882 | NULL                |                  NULL | NULL           |           1004 |      50 |      50 | 214748364800 |  805306368000 | 9223372036854775807 | 9223372036854775807 |          50 |
|    1004 |      1002 | ACTIVE |             1001 |          1002 | 2024-10-15 10:25:06.524487 | 2024-10-15 10:25:06.524487 | zone1 | 111.11.11.113 |     2882 | NULL                |                  NULL | NULL           |           1005 |      50 |      50 | 214748364800 |  805306368000 | 9223372036854775807 | 9223372036854775807 |          50 |
|    1005 |      1002 | ACTIVE |             1002 |          1002 | 2024-10-15 10:25:07.243192 | 2024-10-15 10:25:07.243192 | zone3 | 111.11.11.112 |     2882 | NULL                |                  NULL | NULL           |           1006 |      50 |      50 | 214748364800 |  805306368000 | 9223372036854775807 | 9223372036854775807 |          50 |
|    1006 |      1002 | ACTIVE |             1003 |          1002 | 2024-10-15 10:25:07.991457 | 2024-10-15 10:25:07.991457 | zone2 | 111.11.11.111 |     2882 | NULL                |                  NULL | NULL           |           1004 |      50 |      50 | 214748364800 |  805306368000 | 9223372036854775807 | 9223372036854775807 |          50 |
+---------+-----------+--------+------------------+---------------+----------------------------+----------------------------+-------+---------------+----------+---------------------+-----------------------+----------------+----------------+---------+---------+--------------+---------------+---------------------+---------------------+-------------+
6 rows in set (0.007 sec)
MySQL [oceanbase]> alter resource tenant ccc2  unit_num=1  delete unit_group=(1002);   --对应的是 unit_group_id
Query OK, 0 rows affected (0.026 sec)
-- dba_ob_tenant_jobs : 展示租户即便的任务信息
MySQL [oceanbase]> select  * from  dba_ob_tenant_jobs where JOB_STATUS<>'SUCCESS' limit 1 \G   --查看租户任务信息
*************************** 1. row ***************************
     JOB_ID: 7
   JOB_TYPE: ALTER_RESOURCE_TENANT_UNIT_NUM
 JOB_STATUS: FAILED
RESULT_CODE: -4072
   PROGRESS: 0
 START_TIME: 2024-10-15 10:25:06.476864
MODIFY_TIME: 2024-10-15 14:12:11.560211
  TENANT_ID: 1002
   SQL_TEXT: ALTER RESOURCE TENANT `ccc2` UNIT_NUM = 2
 EXTRA_INFO: FROM: '1', TO: '2'
  RS_SVR_IP: 111.11.11.2  
RS_SVR_PORT: 2882
1 row in set (0.000 sec)

2.调整机器环境后重新添加机器,调整unit_num,结果还是卡在那一步,后台检查任务状态

MySQL [oceanbase]>  select * from  dba_ob_tenant_jobs where JOB_STATUS ='INPROGRESS' \G
*************************** 1. row ***************************
     JOB_ID: 12
   JOB_TYPE: ALTER_RESOURCE_TENANT_UNIT_NUM
 JOB_STATUS: INPROGRESS
RESULT_CODE: NULL
   PROGRESS: 0
 START_TIME: 2024-10-15 14:40:24.420958
MODIFY_TIME: 2024-10-15 14:40:24.420958
  TENANT_ID: 1002
   SQL_TEXT: ALTER RESOURCE TENANT `ccc2` UNIT_NUM = 2
 EXTRA_INFO: FROM: '1', TO: '2'
  RS_SVR_IP: 111.11.11.2  
RS_SVR_PORT: 2882

MySQL [oceanbase]> select * from __all_virtual_balance_job\G;  
*************************** 1. row ***************************
              tenant_id: 1002
                 job_id: 946840423
             gmt_create: 2024-10-15 14:40:31.623727
           gmt_modified: 2024-10-15 14:40:31.623727
  balance_strategy_name: LS balance by expand
               job_type: LS_BALANCE
        target_unit_num: 2
target_primary_zone_num: 1
                 status: DOING
                comment: NULL
1 row in set (0.002 sec)

查询cdb_ob_transfer_tasks:展示所有租户当前正在执行的transfer任务。

发现任务状态一直是 INIT 状态 没有发起  

init:任务创建  

start:开始执行transfer  

doing:正在执行transfer  

aborted:transfer任务执行失败或终止

completed:任务执行成功  

failed:任务执行失败

canceled:任务取消

MySQL [oceanbase]> select * from  cdb_ob_transfer_tasks where tenant_id=1002\G	  
*************************** 1. row ***************************
              TENANT_ID: 1002
                TASK_ID: 513
            CREATE_TIME: 2024-10-15 16:44:09.778632
            MODIFY_TIME: 2024-10-15 16:44:09.778632
                 SRC_LS: 1001
                DEST_LS: 1003
              PART_LIST: 874579:875311,874579:875312,874579:875313,874579:875314,874579:875315,874579:875316,874579:875317,874579:875318,874579:875319,882323:882701,882323:882702,882323:882703,882323:882704,882323:882705,882323:882706,882323:882707,882323:882708,882323:882709,882323:882710,882323:882711,882323:882712,882323:882713,882323:882714,882323:882715,882323:882716,882323:882717,882323:882718,882323:882719,882323:882720,882323:882721,882323:882722,882323:882723,882323:882724,882323:882725,882323:882726,882323:882727,882323:882728,882323:882729,882323:882730,882323:882731,882323:882732,882323:882733,882323:882734,882323:882735,882323:882736,882323:882737,882323:882738,882323:882739,882323:882740,882323:882741,882323:882742,882323:882743,882323:882744,882323:882745,882323:882746,882323:882747,882323:882748,882323:882749,882323:882750,882323:882751,882323:882752,882323:882753,882323:882754,882323:882755,882323:882756,882323:882757,882323:882758,882323:882759,882323:882760,882323:882761,882323:882762,882323:882763,882323:882764,882323:882765,882323:882766,882323:882767,882323:882768,882323:882769,882323:882770,882323:882771,882323:882772,882323:882773,882323:882774,882323:882775,882323:882776,882323:882777,882323:882778,882323:882779,882323:882780,882323:882781,882323:882782,882323:882783,882323:882784,882323:882785,882323:882786,882323:882787,882323:882788,882323:882789,882323:882790,882323:882791
             PART_COUNT: 100
    NOT_EXIST_PART_LIST: NULL
LOCK_CONFLICT_PART_LIST: NULL
 TABLE_LOCK_TABLET_LIST: NULL
            TABLET_LIST: NULL
           TABLET_COUNT: 0
              START_SCN: 0
             FINISH_SCN: 0
                 STATUS: INIT
               TRACE_ID: YB42865514E0-000623F9F429DBE4-0-0
                 RESULT: -1
        BALANCE_TASK_ID: 946840638
    TABLE_LOCK_OWNER_ID: -1
                COMMENT:
-- cdb_ob_transfer_task_history :展示所有租户一致性的transfer任务历史				
select * from cdb_ob_transfer_task_history where tenant_id=1002 and balance_task_id =	946840638 order by FINISH_TIME desc limit 5;		

确保rebalance和transfer参数是开启的

ALTER SYSTEM SET enable_rebalance = true;
ALTER SYSTEM SET enable_transfer = true;

按照工单指导检查observer日志有6003报错

1729156277

但是检查该obj_id:290520和table_id_:18446744073709551615没有在库里找到对象

在检查发现还有4023的报错

1729156451

根据这个table_id从业务租户查到信息

1729156652

检查租户ddl

select * from __all_virtual_ddl_task_status where tenant_id = 1002;

1729156742

再检查表锁信息,执行很多次才能抓到一次信息,都是瞬时的信息。

 -- __all_virtual_obj_lock :查看 对象锁的持有情况和类型信息
select * from __all_virtual_obj_lock where op_type = 'OUT_TRANS_LOCK' and (tenant_id, owner_id) not in (select tenant_id,table_lock_owner_id as owner_id from __all_virtual_transfer_task union (select tenant_id, task_id as owner_id from __all_virtual_ddl_task_status)) order by create_timestamp asc; 
 +---------------+----------+-----------+-------+------------------------------+-----------+------------+-----------------+----------------+-----------+----------------------+------------------+-----------------------+--------------------------+-------------------+-----------------+---------+
| svr_ip        | svr_port | tenant_id | ls_id | lock_id                      | lock_mode | owner_id   | create_trans_id | op_type        | op_status | trans_version        | create_timestamp | create_schema_version | extra_info               | time_after_create | obj_type        | obj_id  |
+---------------+----------+-----------+-------+------------------------------+-----------+------------+-----------------+----------------+-----------+----------------------+------------------+-----------------------+--------------------------+-------------------+-----------------+---------+
| 111.11.11.2   |     2882 |      1002 |     1 | {obj_type:7, obj_id:1412970} | RX        | 1043743623 |      1043743598 | OUT_TRANS_LOCK | DOING     | 18446744073709551615 | 1729064913346564 |                    -1 | count:0, position:tx_ctx |              2341 | ONLINE_DDL_TABL | 1412970 |
+---------------+----------+-----------+-------+------------------------------+-----------+------------+-----------------+----------------+-----------+----------------------+------------------+-----------------------+--------------------------+-------------------+-----------------+---------+
--检查ddl报错的message的试图
select *  from  __all_virtual_ddl_error_message where tenant_id= 1002 and task_id= 1043743623 ;
+-----------+------------+---------------------+-----------+------------------+---------------+----------+----------------------------+----------------------------+----------+----------+---------------+----------------+----------------+
| tenant_id | task_id    | target_object_id    | object_id | schema_version   | svr_ip        | svr_port | gmt_create                 | gmt_modified               | ret_code | ddl_type | affected_rows | user_message   | dba_message    |
+-----------+------------+---------------------+-----------+------------------+---------------+----------+----------------------------+----------------------------+----------+----------+---------------+----------------+----------------+
|      1002 | 1043743623 |                  -1 |   1412977 | 1729064913315080 | 111.11.11.2   |     2882 | 2024-10-16 15:48:34.304039 | 2024-10-16 15:48:34.304039 |        0 |        5 |             0 | Successful ddl | Successful ddl |
|      1002 | 1043743623 |              443084 |   1412970 | 1729064913315080 | 111.11.11.2   |     2882 | 2024-10-16 15:48:33.786022 | 2024-10-16 15:48:33.786022 |        0 |        5 |             0 | Successful ddl | Successful ddl |
|      1002 | 1043743623 | 1152921504607502375 |   1412970 | 1729064913315080 | 111.11.11.2   |     2882 | 2024-10-16 15:48:33.783206 | 2024-10-16 15:48:33.783206 |        0 |        5 |             0 | Successful ddl | Successful ddl |
+-----------+------------+---------------------+-----------+------------------+---------------+----------+----------------------------+----------------------------+----------+----------+---------------+----------------+----------------+

工单判断是这张表上有DDL导致加锁失败,但看了当前DDL任务并没有这个表上,怀疑是bug导致,后来再查select * from  cdb_ob_transfer_tasks where tenant_id=1002。

1729157281

create_time已经是新生成的了,通过trace_id过滤也没有6003的报错了

1729157443

但是transfer任务还是一直是init的状态,工单怀疑是一直有活跃会话导致,建议把业务停了或等到低峰期,应该会自动transfer完。(确实有oms迁移任务一直在进行)

暂停oms的迁移任务,transfer任务变成doing了。

  select * from  cdb_ob_transfer_tasks where tenant_id=1002\G
*************************** 1. row ***************************
              TENANT_ID: 1002
                TASK_ID: 2123
            CREATE_TIME: 2024-10-17 09:27:19.765950
            MODIFY_TIME: 2024-10-17 09:27:21.037063
                 SRC_LS: 1001
                DEST_LS: 1003
              PART_LIST: 882323:883011,882323:883012,882323:883013,882323:883014,882323:883015,882323:883016,882323:883017,882323:883018,882323:883019,882323:883020,882323:883021,882323:883022,882323:883023,882323:883024,882323:883025,882323:883026,882323:883027,882323:883028,882323:883029,882323:883030,882323:883031,882323:883032,882323:883033,882323:883034,882323:883035,882323:883036,882323:883037,882323:883038,882323:883039,882323:883040,882323:883041,882323:883042,882323:883043,882323:883044
             PART_COUNT: 34
    NOT_EXIST_PART_LIST: NULL
LOCK_CONFLICT_PART_LIST: NULL
 TABLE_LOCK_TABLET_LIST: 290830,290831,290832,290833,290834,290835,290836,290837,290838,290839,290840,290841,290842,290843,290844,290845,290846,290847,290848,290849,290850,290851,290852,290853,290854,290855,290856,290857,290858,290859,290860,290861,290862,290863
            TABLET_LIST: 290830:0,290831:0,290832:0,290833:0,290834:0,290835:0,290836:0,290837:0,290838:0,290839:0,290840:0,290841:0,290842:0,290843:0,290844:0,290845:0,290846:0,290847:0,290848:0,290849:0,290850:0,290851:0,290852:0,290853:0,290854:0,290855:0,290856:0,290857:0,290858:0,290859:0,290860:0,290861:0,290862:0,290863:0,1152921504607139342:0,1152921504607139343:0,1152921504607139344:0,1152921504607139345:0,1152921504607139346:0,1152921504607139347:0,1152921504607139348:0,1152921504607139349:0,1152921504607139350:0,1152921504607139351:0,1152921504607139352:0,1152921504607139353:0,1152921504607139354:0,1152921504607139355:0,1152921504607139356:0,1152921504607139357:0,1152921504607139358:0,1152921504607139359:0,1152921504607139360:0,1152921504607139361:0,1152921504607139362:0,1152921504607139363:0,1152921504607139364:0,1152921504607139365:0,1152921504607139366:0,1152921504607139367:0,1152921504607139368:0,1152921504607139369:0,1152921504607139370:0,1152921504607139371:0,1152921504607139372:0,1152921504607139373:0,1152921504607139374:0,1152921504607139375:0,1152921504607153887:0,1152921504607153888:0,1152921504607153889:0,1152921504607153890:0,1152921504607153891:0,1152921504607153892:0,1152921504607153893:0,1152921504607153894:0,1152921504607153895:0,1152921504607153896:0,1152921504607153897:0,1152921504607153898:0,1152921504607153899:0,1152921504607153900:0,1152921504607153901:0,1152921504607153902:0,1152921504607153903:0,1152921504607153904:0,1152921504607153905:0,1152921504607153906:0,1152921504607153907:0,1152921504607153908:0,1152921504607153909:0,1152921504607153910:0,1152921504607153911:0,1152921504607153912:0,1152921504607153913:0,1152921504607153914:0,1152921504607153915:0,1152921504607153916:0,1152921504607153917:0,1152921504607153918:0,1152921504607153919:0,1152921504607153920:0
           TABLET_COUNT: 102
              START_SCN: 1729128440782661910
             FINISH_SCN: 0
                 STATUS: DOING
               TRACE_ID: YB42865514E0-000623F9F42BD5DA-0-0
                 RESULT: 0
        BALANCE_TASK_ID: 946840638
    TABLE_LOCK_OWNER_ID: 1109135717
                COMMENT:

等 cdb_ob_transfer_tasks查询没有结果之后,重试ocp任务执行成功。

查看job状态变成 SUCCESS。

select  * from  dba_ob_tenant_jobs \G 
 *************************** 9. row ***************************
     JOB_ID: 12
   JOB_TYPE: ALTER_RESOURCE_TENANT_UNIT_NUM
 JOB_STATUS: SUCCESS
RESULT_CODE: 0
   PROGRESS: 100
 START_TIME: 2024-10-15 14:40:24.420958
MODIFY_TIME: 2024-10-17 09:36:18.014531
  TENANT_ID: 1002
   SQL_TEXT: ALTER RESOURCE TENANT `ccc2` UNIT_NUM = 2
 EXTRA_INFO: FROM: '1', TO: '2'
  RS_SVR_IP: 111.11.11.2  
RS_SVR_PORT: 2882

接下来就是等待数据迁移完成,负载均衡,等完成自动负载后leader切换完成。

结论:

扩缩容的操作,尽量还是在业务低峰或者窗口期间操作。避免一些ddl和活跃会话占用资源,引入额外的变更风险。

行之所向,莫问远方。

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论