背景:
有一套1-1-1集群在同步数据期间发现存储不够了,打算通过扩容节点到2-2-2的方式来解决该问题。
知识点:
在介绍问题以及处理问题之前,我想再复习下官网上关于4.x架构的一些知识点。
从 V4.0.0 版本开始,OceanBase 数据库要求租户内每个 Zone 的 Unit 个数必须保持一致。为了方便统一管理各个 Zone 的 Unit,系统引入了 Unit Group 机制,即不同 Zone 之间相同编号(UNIT_GROUP_ID)的 Unit 属于同一个 Unit Group。调大或调小 Unit Number 本质上是以 Unit Group 为单位创建和删除 Unit。日志流是跟unit group有关的,如果设置租户primary_zone为单个zone的情况下,正常一个zone有几个unit,就有几个unit group,也就有几个日志流,如果是random的情况下,如果有三个zone,每个zone有2个unit,还是2个unit group,那么就是3*2=6个日志流。
所以我们通过上面这句话知道了两个注意点:
1.每个租户的unit在不同zone里是对等的,不能像3.x那样自定义配置了
2.不同unit的unit group是不同的,扩容unit、均衡副本、切换leader的完成,不再是3.x的分区单位了(3.x是以分区为单位,在obcp的ppt里有比较明显的演示,先在要扩容的unit中创建副本,切换leader,再删除老的副本),4.x引入了transfer的概念,单位是tablet,其实也跟3.x类似,只是以整个日志流为单位了(整个日志流transfer完成才进行切主和老副本的删除)。
还有个重点就是“每个租户的unit在不同zone里是对等的”引出来的,我们一般通过ocp白屏创建的租户每个zone都有一个单独的resource pool,3.x是可以单独调整每个zone里的unit num的,但是4.x的resource pool与租户绑定后就不可以单独操作了,只能整体操作,操作命令如下。
ALTER RESOURCE TENANT mysql001 UNIT_NUM = 2;
ALTER RESOURCE TENANT mysql001 UNIT_NUM = 1 DELETE UNIT_GROUP =(1006);过程:
简单介绍下这次操作的过程以及遇到的几个情况吧
1.第一次添加机器后修改unit_num,ocp任务报错了

因为检查到新加机器配置有些遗漏,怀疑跟机器的环境有关系,回退了该任务,发现前台租户的状态还是运维中,unit_num前台显示还是2,后台检查dba_ob_jobs;
--状态是执行中 MySQL [oceanbase]> select * from dba_ob_tenant_jobs where JOB_STATUS<>'SUCCESS' limit 1 \G *************************** 1. row *************************** JOB_ID: 7 JOB_TYPE: ALTER_RESOURCE_TENANT_UNIT_NUM JOB_STATUS: INPROGRESS RESULT_CODE: NULL PROGRESS: 0 START_TIME: 2024-10-15 10:25:06.476864 MODIFY_TIME: 2024-10-15 10:25:06.476864 TENANT_ID: 1002 SQL_TEXT: ALTER RESOURCE TENANT `ccc2` UNIT_NUM = 2 EXTRA_INFO: FROM: '1', TO: '2' RS_SVR_IP: 111.11.11.2 RS_SVR_PORT: 2882 1 row in set (0.004 sec)
这时候前台页面调整unit_num因为租户状态不对,无法操作,只能后台缩容unit_num,后台操作完,前台也正常了。
MySQL [oceanbase]> select * from dba_ob_units where tenant_id=1002 ; +---------+-----------+--------+------------------+---------------+----------------------------+----------------------------+-------+---------------+----------+---------------------+-----------------------+----------------+----------------+---------+---------+--------------+---------------+---------------------+---------------------+-------------+ | UNIT_ID | TENANT_ID | STATUS | RESOURCE_POOL_ID | UNIT_GROUP_ID | CREATE_TIME | MODIFY_TIME | ZONE | SVR_IP | SVR_PORT | MIGRATE_FROM_SVR_IP | MIGRATE_FROM_SVR_PORT | MANUAL_MIGRATE | UNIT_CONFIG_ID | MAX_CPU | MIN_CPU | MEMORY_SIZE | LOG_DISK_SIZE | MAX_IOPS | MIN_IOPS | IOPS_WEIGHT | +---------+-----------+--------+------------------+---------------+----------------------------+----------------------------+-------+---------------+----------+---------------------+-----------------------+----------------+----------------+---------+---------+--------------+---------------+---------------------+---------------------+-------------+ | 1001 | 1002 | ACTIVE | 1001 | 1001 | 2024-09-29 15:31:43.686630 | 2024-09-29 15:31:43.762276 | zone1 | 111.11.11.1 | 2882 | NULL | NULL | NULL | 1005 | 50 | 50 | 214748364800 | 805306368000 | 9223372036854775807 | 9223372036854775807 | 50 | | 1002 | 1002 | ACTIVE | 1002 | 1001 | 2024-09-29 15:31:43.691645 | 2024-09-29 15:31:44.454323 | zone3 | 111.11.11.3 | 2882 | NULL | NULL | NULL | 1006 | 50 | 50 | 214748364800 | 805306368000 | 9223372036854775807 | 9223372036854775807 | 50 | | 1003 | 1002 | ACTIVE | 1003 | 1001 | 2024-09-29 15:31:43.695862 | 2024-09-29 15:31:44.453268 | zone2 | 111.11.11.2 | 2882 | NULL | NULL | NULL | 1004 | 50 | 50 | 214748364800 | 805306368000 | 9223372036854775807 | 9223372036854775807 | 50 | | 1004 | 1002 | ACTIVE | 1001 | 1002 | 2024-10-15 10:25:06.524487 | 2024-10-15 10:25:06.524487 | zone1 | 111.11.11.113 | 2882 | NULL | NULL | NULL | 1005 | 50 | 50 | 214748364800 | 805306368000 | 9223372036854775807 | 9223372036854775807 | 50 | | 1005 | 1002 | ACTIVE | 1002 | 1002 | 2024-10-15 10:25:07.243192 | 2024-10-15 10:25:07.243192 | zone3 | 111.11.11.112 | 2882 | NULL | NULL | NULL | 1006 | 50 | 50 | 214748364800 | 805306368000 | 9223372036854775807 | 9223372036854775807 | 50 | | 1006 | 1002 | ACTIVE | 1003 | 1002 | 2024-10-15 10:25:07.991457 | 2024-10-15 10:25:07.991457 | zone2 | 111.11.11.111 | 2882 | NULL | NULL | NULL | 1004 | 50 | 50 | 214748364800 | 805306368000 | 9223372036854775807 | 9223372036854775807 | 50 | +---------+-----------+--------+------------------+---------------+----------------------------+----------------------------+-------+---------------+----------+---------------------+-----------------------+----------------+----------------+---------+---------+--------------+---------------+---------------------+---------------------+-------------+ 6 rows in set (0.007 sec) MySQL [oceanbase]> alter resource tenant ccc2 unit_num=1 delete unit_group=(1002); --对应的是 unit_group_id Query OK, 0 rows affected (0.026 sec) -- dba_ob_tenant_jobs : 展示租户即便的任务信息 MySQL [oceanbase]> select * from dba_ob_tenant_jobs where JOB_STATUS<>'SUCCESS' limit 1 \G --查看租户任务信息 *************************** 1. row *************************** JOB_ID: 7 JOB_TYPE: ALTER_RESOURCE_TENANT_UNIT_NUM JOB_STATUS: FAILED RESULT_CODE: -4072 PROGRESS: 0 START_TIME: 2024-10-15 10:25:06.476864 MODIFY_TIME: 2024-10-15 14:12:11.560211 TENANT_ID: 1002 SQL_TEXT: ALTER RESOURCE TENANT `ccc2` UNIT_NUM = 2 EXTRA_INFO: FROM: '1', TO: '2' RS_SVR_IP: 111.11.11.2 RS_SVR_PORT: 2882 1 row in set (0.000 sec)
2.调整机器环境后重新添加机器,调整unit_num,结果还是卡在那一步,后台检查任务状态
MySQL [oceanbase]> select * from dba_ob_tenant_jobs where JOB_STATUS ='INPROGRESS' \G *************************** 1. row *************************** JOB_ID: 12 JOB_TYPE: ALTER_RESOURCE_TENANT_UNIT_NUM JOB_STATUS: INPROGRESS RESULT_CODE: NULL PROGRESS: 0 START_TIME: 2024-10-15 14:40:24.420958 MODIFY_TIME: 2024-10-15 14:40:24.420958 TENANT_ID: 1002 SQL_TEXT: ALTER RESOURCE TENANT `ccc2` UNIT_NUM = 2 EXTRA_INFO: FROM: '1', TO: '2' RS_SVR_IP: 111.11.11.2 RS_SVR_PORT: 2882 MySQL [oceanbase]> select * from __all_virtual_balance_job\G; *************************** 1. row *************************** tenant_id: 1002 job_id: 946840423 gmt_create: 2024-10-15 14:40:31.623727 gmt_modified: 2024-10-15 14:40:31.623727 balance_strategy_name: LS balance by expand job_type: LS_BALANCE target_unit_num: 2 target_primary_zone_num: 1 status: DOING comment: NULL 1 row in set (0.002 sec)
查询cdb_ob_transfer_tasks:展示所有租户当前正在执行的transfer任务。
发现任务状态一直是 INIT 状态 没有发起
init:任务创建
start:开始执行transfer
doing:正在执行transfer
aborted:transfer任务执行失败或终止
completed:任务执行成功
failed:任务执行失败
canceled:任务取消
MySQL [oceanbase]> select * from cdb_ob_transfer_tasks where tenant_id=1002\G *************************** 1. row *************************** TENANT_ID: 1002 TASK_ID: 513 CREATE_TIME: 2024-10-15 16:44:09.778632 MODIFY_TIME: 2024-10-15 16:44:09.778632 SRC_LS: 1001 DEST_LS: 1003 PART_LIST: 874579:875311,874579:875312,874579:875313,874579:875314,874579:875315,874579:875316,874579:875317,874579:875318,874579:875319,882323:882701,882323:882702,882323:882703,882323:882704,882323:882705,882323:882706,882323:882707,882323:882708,882323:882709,882323:882710,882323:882711,882323:882712,882323:882713,882323:882714,882323:882715,882323:882716,882323:882717,882323:882718,882323:882719,882323:882720,882323:882721,882323:882722,882323:882723,882323:882724,882323:882725,882323:882726,882323:882727,882323:882728,882323:882729,882323:882730,882323:882731,882323:882732,882323:882733,882323:882734,882323:882735,882323:882736,882323:882737,882323:882738,882323:882739,882323:882740,882323:882741,882323:882742,882323:882743,882323:882744,882323:882745,882323:882746,882323:882747,882323:882748,882323:882749,882323:882750,882323:882751,882323:882752,882323:882753,882323:882754,882323:882755,882323:882756,882323:882757,882323:882758,882323:882759,882323:882760,882323:882761,882323:882762,882323:882763,882323:882764,882323:882765,882323:882766,882323:882767,882323:882768,882323:882769,882323:882770,882323:882771,882323:882772,882323:882773,882323:882774,882323:882775,882323:882776,882323:882777,882323:882778,882323:882779,882323:882780,882323:882781,882323:882782,882323:882783,882323:882784,882323:882785,882323:882786,882323:882787,882323:882788,882323:882789,882323:882790,882323:882791 PART_COUNT: 100 NOT_EXIST_PART_LIST: NULL LOCK_CONFLICT_PART_LIST: NULL TABLE_LOCK_TABLET_LIST: NULL TABLET_LIST: NULL TABLET_COUNT: 0 START_SCN: 0 FINISH_SCN: 0 STATUS: INIT TRACE_ID: YB42865514E0-000623F9F429DBE4-0-0 RESULT: -1 BALANCE_TASK_ID: 946840638 TABLE_LOCK_OWNER_ID: -1 COMMENT: -- cdb_ob_transfer_task_history :展示所有租户一致性的transfer任务历史 select * from cdb_ob_transfer_task_history where tenant_id=1002 and balance_task_id = 946840638 order by FINISH_TIME desc limit 5;
确保rebalance和transfer参数是开启的
ALTER SYSTEM SET enable_rebalance = true;
ALTER SYSTEM SET enable_transfer = true;按照工单指导检查observer日志有6003报错

但是检查该obj_id:290520和table_id_:18446744073709551615没有在库里找到对象
在检查发现还有4023的报错

根据这个table_id从业务租户查到信息

检查租户ddl
select * from __all_virtual_ddl_task_status where tenant_id = 1002;
再检查表锁信息,执行很多次才能抓到一次信息,都是瞬时的信息。
-- __all_virtual_obj_lock :查看 对象锁的持有情况和类型信息 select * from __all_virtual_obj_lock where op_type = 'OUT_TRANS_LOCK' and (tenant_id, owner_id) not in (select tenant_id,table_lock_owner_id as owner_id from __all_virtual_transfer_task union (select tenant_id, task_id as owner_id from __all_virtual_ddl_task_status)) order by create_timestamp asc; +---------------+----------+-----------+-------+------------------------------+-----------+------------+-----------------+----------------+-----------+----------------------+------------------+-----------------------+--------------------------+-------------------+-----------------+---------+ | svr_ip | svr_port | tenant_id | ls_id | lock_id | lock_mode | owner_id | create_trans_id | op_type | op_status | trans_version | create_timestamp | create_schema_version | extra_info | time_after_create | obj_type | obj_id | +---------------+----------+-----------+-------+------------------------------+-----------+------------+-----------------+----------------+-----------+----------------------+------------------+-----------------------+--------------------------+-------------------+-----------------+---------+ | 111.11.11.2 | 2882 | 1002 | 1 | {obj_type:7, obj_id:1412970} | RX | 1043743623 | 1043743598 | OUT_TRANS_LOCK | DOING | 18446744073709551615 | 1729064913346564 | -1 | count:0, position:tx_ctx | 2341 | ONLINE_DDL_TABL | 1412970 | +---------------+----------+-----------+-------+------------------------------+-----------+------------+-----------------+----------------+-----------+----------------------+------------------+-----------------------+--------------------------+-------------------+-----------------+---------+
--检查ddl报错的message的试图 select * from __all_virtual_ddl_error_message where tenant_id= 1002 and task_id= 1043743623 ; +-----------+------------+---------------------+-----------+------------------+---------------+----------+----------------------------+----------------------------+----------+----------+---------------+----------------+----------------+ | tenant_id | task_id | target_object_id | object_id | schema_version | svr_ip | svr_port | gmt_create | gmt_modified | ret_code | ddl_type | affected_rows | user_message | dba_message | +-----------+------------+---------------------+-----------+------------------+---------------+----------+----------------------------+----------------------------+----------+----------+---------------+----------------+----------------+ | 1002 | 1043743623 | -1 | 1412977 | 1729064913315080 | 111.11.11.2 | 2882 | 2024-10-16 15:48:34.304039 | 2024-10-16 15:48:34.304039 | 0 | 5 | 0 | Successful ddl | Successful ddl | | 1002 | 1043743623 | 443084 | 1412970 | 1729064913315080 | 111.11.11.2 | 2882 | 2024-10-16 15:48:33.786022 | 2024-10-16 15:48:33.786022 | 0 | 5 | 0 | Successful ddl | Successful ddl | | 1002 | 1043743623 | 1152921504607502375 | 1412970 | 1729064913315080 | 111.11.11.2 | 2882 | 2024-10-16 15:48:33.783206 | 2024-10-16 15:48:33.783206 | 0 | 5 | 0 | Successful ddl | Successful ddl | +-----------+------------+---------------------+-----------+------------------+---------------+----------+----------------------------+----------------------------+----------+----------+---------------+----------------+----------------+
工单判断是这张表上有DDL导致加锁失败,但看了当前DDL任务并没有这个表上,怀疑是bug导致,后来再查select * from cdb_ob_transfer_tasks where tenant_id=1002。

create_time已经是新生成的了,通过trace_id过滤也没有6003的报错了

但是transfer任务还是一直是init的状态,工单怀疑是一直有活跃会话导致,建议把业务停了或等到低峰期,应该会自动transfer完。(确实有oms迁移任务一直在进行)
暂停oms的迁移任务,transfer任务变成doing了。
select * from cdb_ob_transfer_tasks where tenant_id=1002\G *************************** 1. row *************************** TENANT_ID: 1002 TASK_ID: 2123 CREATE_TIME: 2024-10-17 09:27:19.765950 MODIFY_TIME: 2024-10-17 09:27:21.037063 SRC_LS: 1001 DEST_LS: 1003 PART_LIST: 882323:883011,882323:883012,882323:883013,882323:883014,882323:883015,882323:883016,882323:883017,882323:883018,882323:883019,882323:883020,882323:883021,882323:883022,882323:883023,882323:883024,882323:883025,882323:883026,882323:883027,882323:883028,882323:883029,882323:883030,882323:883031,882323:883032,882323:883033,882323:883034,882323:883035,882323:883036,882323:883037,882323:883038,882323:883039,882323:883040,882323:883041,882323:883042,882323:883043,882323:883044 PART_COUNT: 34 NOT_EXIST_PART_LIST: NULL LOCK_CONFLICT_PART_LIST: NULL TABLE_LOCK_TABLET_LIST: 290830,290831,290832,290833,290834,290835,290836,290837,290838,290839,290840,290841,290842,290843,290844,290845,290846,290847,290848,290849,290850,290851,290852,290853,290854,290855,290856,290857,290858,290859,290860,290861,290862,290863 TABLET_LIST: 290830:0,290831:0,290832:0,290833:0,290834:0,290835:0,290836:0,290837:0,290838:0,290839:0,290840:0,290841:0,290842:0,290843:0,290844:0,290845:0,290846:0,290847:0,290848:0,290849:0,290850:0,290851:0,290852:0,290853:0,290854:0,290855:0,290856:0,290857:0,290858:0,290859:0,290860:0,290861:0,290862:0,290863:0,1152921504607139342:0,1152921504607139343:0,1152921504607139344:0,1152921504607139345:0,1152921504607139346:0,1152921504607139347:0,1152921504607139348:0,1152921504607139349:0,1152921504607139350:0,1152921504607139351:0,1152921504607139352:0,1152921504607139353:0,1152921504607139354:0,1152921504607139355:0,1152921504607139356:0,1152921504607139357:0,1152921504607139358:0,1152921504607139359:0,1152921504607139360:0,1152921504607139361:0,1152921504607139362:0,1152921504607139363:0,1152921504607139364:0,1152921504607139365:0,1152921504607139366:0,1152921504607139367:0,1152921504607139368:0,1152921504607139369:0,1152921504607139370:0,1152921504607139371:0,1152921504607139372:0,1152921504607139373:0,1152921504607139374:0,1152921504607139375:0,1152921504607153887:0,1152921504607153888:0,1152921504607153889:0,1152921504607153890:0,1152921504607153891:0,1152921504607153892:0,1152921504607153893:0,1152921504607153894:0,1152921504607153895:0,1152921504607153896:0,1152921504607153897:0,1152921504607153898:0,1152921504607153899:0,1152921504607153900:0,1152921504607153901:0,1152921504607153902:0,1152921504607153903:0,1152921504607153904:0,1152921504607153905:0,1152921504607153906:0,1152921504607153907:0,1152921504607153908:0,1152921504607153909:0,1152921504607153910:0,1152921504607153911:0,1152921504607153912:0,1152921504607153913:0,1152921504607153914:0,1152921504607153915:0,1152921504607153916:0,1152921504607153917:0,1152921504607153918:0,1152921504607153919:0,1152921504607153920:0 TABLET_COUNT: 102 START_SCN: 1729128440782661910 FINISH_SCN: 0 STATUS: DOING TRACE_ID: YB42865514E0-000623F9F42BD5DA-0-0 RESULT: 0 BALANCE_TASK_ID: 946840638 TABLE_LOCK_OWNER_ID: 1109135717 COMMENT:
等 cdb_ob_transfer_tasks查询没有结果之后,重试ocp任务执行成功。
查看job状态变成 SUCCESS。
select * from dba_ob_tenant_jobs \G *************************** 9. row *************************** JOB_ID: 12 JOB_TYPE: ALTER_RESOURCE_TENANT_UNIT_NUM JOB_STATUS: SUCCESS RESULT_CODE: 0 PROGRESS: 100 START_TIME: 2024-10-15 14:40:24.420958 MODIFY_TIME: 2024-10-17 09:36:18.014531 TENANT_ID: 1002 SQL_TEXT: ALTER RESOURCE TENANT `ccc2` UNIT_NUM = 2 EXTRA_INFO: FROM: '1', TO: '2' RS_SVR_IP: 111.11.11.2 RS_SVR_PORT: 2882
接下来就是等待数据迁移完成,负载均衡,等完成自动负载后leader切换完成。
结论:
扩缩容的操作,尽量还是在业务低峰或者窗口期间操作。避免一些ddl和活跃会话占用资源,引入额外的变更风险。
行之所向,莫问远方。




