暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Klustron Shard故障隔离测试

点击上方蓝字关注我们





导读
注意
如无特别说明,文中的版本号可以使用任何已发布版本的版本号代替。所有已发布版本详见:http://doc.klustron.com/zh/Release_notes.html

本文内容:
Klustron的金融级高可靠性,由一系列能力和技术组成了一个完整的技术体系,这些能力和机制坚如磐石,组成了Klustron坚不可摧的金融级更高可靠性技术体系,这其中当然也包括了故障隔离能力。本文主要测试当集群中的某一个shard主备完全故障之后,Klustron集群仍旧能够部分正常工作。故障shard上面存储的数据无法操作,但其他shard的数据则可以进行正常的读写和提交。
关键词:Klustron、sharding、故障隔离


1

测试环境


测试集群相关机器的配置信息如下:


2

环境准备


创建测试数据库和用户
    psql -h 192.168.40.152 -p 47001 postgres
    create user kunlun_test with password 'kunlun';
    create database testdb owner kunlun_test;
    grant all privileges on database testdb to kunlun_test;
    准备测试表sales_order,其按月分区。2023年1月分区的数据放在了shard_3上,2月数据放在了shard_1上,3月的数据则放在了shard_2上。
      psql -h 192.168.40.152 -p 47001 -U kunlun_test testdb


      CREATE TABLE sales_order
      (
      order_number INT NOT NULL,
      customer_number INT NOT NULL,
      product_code INT NOT NULL,
      order_date DATETIME NOT NULL,
      entry_date DATETIME NOT NULL,
      order_amount DECIMAL(18,2) NOT NULL,
      PRIMARY KEY (order_number,order_date)
      ) partition by range(order_date);
      CREATE TABLE sales_order_202301 PARTITION OF sales_order
      FOR VALUES FROM ('2023-01-01') TO ('2023-02-01') with (Shard=5);
      CREATE TABLE sales_order_202302 PARTITION OF sales_order
      FOR VALUES FROM ('2023-02-01') TO ('2023-03-01') with (Shard=6);
      CREATE TABLE sales_order_202303 PARTITION OF sales_order
      FOR VALUES FROM ('2023-03-01'TO ('2023-04-01'with (Shard=7);

      并往其中灌入10000条数据
        create or replace procedure generate_order_data()
        AS $$
        DECLARE
        v_customer_number integer;
        v_product_code integer;
        v_order_date date;
        v_amount integer;
        start_date date := to_date('2023-01-01','yyyy-mm-dd');
        i integer :=1;
        BEGIN
        while i<=10000 loop
        v_customer_number := FLOOR(1+RANDOM()*6);
        v_product_code := FLOOR(1+RANDOM()*1000);
        v_order_date := start_date + CAST(FLOOR(RANDOM()*90) AS INT);
        v_amount := FLOOR(1000+RANDOM()*9000);
        INSERT INTO sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
        commit;
        i := i+1;
        end loop;
        END; $$
        LANGUAGE plpgsql;
        call generate_order_data();
        analyze sales_order;
        testdb=> select count(*) from sales_order;
        count
        -------
        10000
        (1 row)


        testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
        count
        -------
        3417
        (1 row)


        testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
        count
        -------
        3086
        (1 row)


        testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
        count
        -------
        3497
        (1 row)


        3

        模拟整个Shard故障


        此时需要模拟shard_1所有的主备节点都发生故障。由于Klustron有自动拉起失败服务的功能,所以先需要将shard_1的数据文件目录移动位置,在shard_1的主备节点都执行
          cd /kunlun/storage_datadir
          mv 57003 57003_bak
          然后再分别在shard_1的主备节点使用kill -9命令杀掉在57003端口上的mysqld和mysqld_safe进程。


          4

          故障隔离测试


          回到计算节点重新查询2023年2月的数据,则会抛出异常:
            testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
            ERROR:  Kunlun-db: Failed to connect to mysql storage node at (192.168.40.15157003): 2002, Can't connect to server on '192.168.40.151' (111)

            查询1月和3月的数据则能正常返回结果
              testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
              count
              -------
              3417
              (1 row)


              testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
              count
              -------
              3497
              (1 row)

              同样的,1月和3月数据对应的shard还能正常插入。
                testdb=> begin;
                BEGIN
                testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
                count
                -------
                3417
                (1 row)


                testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
                count
                -------
                3497
                (1 row)


                testdb=> insert into sales_order values(10001,1,1000,'2023-01-31','2023-01-31',1800);
                INSERT 0 1
                testdb=> insert into sales_order values(10002,1,1000,'2023-03-31','2023-03-31',2000);
                INSERT 0 1
                testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
                count
                -------
                3418
                (1 row)


                testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
                count
                -------
                3498
                (1 row)


                testdb=> commit;



                但是如果事务中涉及操作故障shard的数据,则会报错。
                  testdb=> begin;
                  BEGIN
                  testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
                  count
                  -------
                  3498
                  (1 row)


                  testdb=> update sales_order set order_amount=1000 where order_date='2023-03-31';
                  UPDATE 113
                  testdb=> insert into sales_order values(10003,1,1000,'2023-02-14','2023-02-14',1800);
                  ERROR: Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)
                  testdb=> insert into sales_order values(10005,2,1000,'2023-03-30','2023-03-30',2000);
                  ERROR: current transaction is aborted, commands ignored until end of transaction block
                  testdb=> rollback;
                  ROLLBACK

                  通过上面的测试可以看到,故障shard上面存储的数据无法提供任何服务了,但其他正常shard的数据还可以进行正常的读写和事务。

                  完全故障的shard不会影响正常shard的运行,也不好影响集群整体运行,仅有故障shard上的数据无法读写。

                  END

                  产品文档

                  Klustron 快速入门:
                  https://doc.kunlunbase.com/zh/Klustron_Instruction_Manual.html

                  Klustron 快速体验指南:
                  https://doc.kunlunbase.com/zh/Klustron_Quickly_Guide.html

                  Klustron 功能体验范例:
                  https://doc.kunlunbase.com/zh/Klustron-function-experience-example.html

                  Klustron 产品使用和测评指南:
                  https://doc.kunlunbase.com/zh/product-usage-and-evaluation-guidelines.html


                  欢迎大家下载和安装KunlunBase数据库集群,并免费使用(无需注册码)。

                  KunlunBase 完整软件包下载:
                  http://downloads.kunlunbase.com

                  如需购买请邮箱联系sales_vip@kunlunbase.com,有相关问题欢迎添加下方小助手微信联系🌹

                  欢迎大家扫码👇添加小助手(备注:加入Klustron技术交流群)欢迎大家在交流群共同探讨更多问题及主题。 

                   点击👆上方,关注获取源代码及技术信息~

                  文章转载自KunlunBase 昆仑数据库,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                  评论