通过Chaos-Mesh打造更稳定TiDB数据库高可用架构(一)

原创严少安 2023-03-11

397

lqbyz 发表于 2023-03-06
原创实践案例管理与运维扩/缩容集群管理数据库架构设计
一、简介

本文主要介绍chaos-mesh相关的知识包括混沌工程Chaos-Mesh的简介、核心功能、架构预览以及相关实验的功能，为后边构建tidb容器化数据库做准备。
1、Chaos-Mesh简介

Chaos Mesh 是一个开源的云原生混沌工程平台，提供丰富的故障模拟类型，具有强大的故障场景编排能力，方便用户在开发测试中以及生产环境中模拟现实世界中可能出现的各类异常，帮助用户发现系统潜在的问题。Chaos Mesh 提供完善的可视化操作，旨在降低用户进行混沌工程的门槛。用户可以方便地在 Web UI 界面上设计自己的混沌场景，以及监控混沌实验的运行状态。

2、Chaos Mesh核心功能

Chaos Mesh 作为业内领先的混沌测试平台，具备以下核心优势：

    核心能力稳固：Chaos Mesh 起源于 TiDB 的核心测试平台，发布初期即继承了大量 TiDB 已有的测试经验。

    被充分验证：Chaos Mesh 被众多公司以及组织所使用，例如腾讯和美团等；同时被用于众多知名分布式系统的测试体系中，例如 Apache APISIX 和 RabbitMQ 等。

    系统易用性强：图形化操作和基于 Kubernetes 的使用方式，充分利用了自动化能力。

    云原生：Chaos Mesh 原生支持 Kubernetes 环境，提供了强悍的自动化能力。

    丰富的故障模拟场景：Chaos Mesh 几乎涵盖了分布式测试体系中基础故障模拟的绝大多数场景。

    灵活的实验编排能力：用户可以通过平台设计自己的混沌实验场景，场景可包含多个混沌实验编排，以及应用状态检查等。

    安全性高：Chaos Mesh 具有多层次安全控制设计，提供高安全性。

    活跃的社区：Chaos Mesh 为全球知名开源混沌测试平台，CNCF 开源基金会孵化项目。

    强大的扩展能力：Chaos Mesh 为故障测试类型扩展和功能扩展提供了充分的扩展能力。

3、架构概览

no-alt

Chaos Mesh 基于 Kubernetes CRD (Custom Resource Definition) 构建，根据不同的故障类型定义多个 CRD 类型，并为不同的 CRD 对象实现单独的 Controller 以管理不同的混沌实验。Chaos Mesh 主要包含以下三个组件:

    Chaos Dashboard：Chaos Mesh 的可视化组件，提供了一套用户友好的 Web 界面，用户可通过该界面对混沌实验进行操作和观测。同时，Chaos Dashboard 还提供了 RBAC 权限管理机制。

    Chaos Controller Manager：Chaos Mesh 的核心逻辑组件，主要负责混沌实验的调度与管理。该组件包含多个 CRD Controller，例如 Workflow Controller、Scheduler Controller 以及各类故障类型的 Controller。

    Chaos Daemon：Chaos Mesh 的主要执行组件。Chaos Daemon 以 DaemonSet 的方式运行，默认拥有 Privileged 权限（可以关闭）。该组件主要通过侵入目标 Pod Namespace 的方式干扰具体的网络设备、文件系统、内核等。

二、安装部署
1.环境准备

1.在安装之前，请先确保环境中已经安装 Helm。
[root@k8s-master chaos-mesh]# helm version
version.BuildInfo{Version:“v3.4.1”, GitCommit:“c4e74854886b2efe3321e185578e6db9be0a6e29”, GitTreeState:“clean”, GoVersion:“go1.14.11”}

2.添加chaos mesh 仓库
helm repo add chaos-mesh https://charts.chaos-mesh.org

3.查看安装chaos mesh版本
helm search repo chaos-mesh
或helm search repo chaos-mesh -l

4.创建命名空间
kubectl create ns chaos-testing

5.安装docker 环境
helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing

6.验证安装
[root@k8s-master chaos-mesh]# kubectl get po -n chaos-testing
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-856bc96c68-6mppc 1/1 Running 0 6h49m
chaos-controller-manager-856bc96c68-hk6nl 1/1 Running 0 6h50m
chaos-controller-manager-856bc96c68-q99vm 1/1 Running 0 6h50m
chaos-daemon-ng4vx 1/1 Running 0 6h49m
chaos-daemon-w2w7h 1/1 Running 0 6h50m
chaos-dashboard-5fdf8b8bb-nnnhz 1/1 Running 0 6h50m
备注为了保证高可用性，Chaos Mesh 默认开启了 leader-election 特性。如果不需要这个特性，请通过 --set controllerManager.leaderElection.enabled=false 手动关闭该特性。

6.升级chaos mesh
helm upgrade chaos-mesh chaos-mesh/chaos-mesh

7.卸载chaos mesh
helm uninstall chaos-mesh -n chaos-testing

2.管理用户权限
2.1、通过token进行登陆

1、创建用户并绑定权限。访问dashboard点击这里生成
2、创建令牌辅助生成器：
2.1：选择权限的范围
2.2：选择角色
2.3：生成rbac配置
2.4：点击复制
3、创建用户并绑定权限
[root@k8s-master chaos-mesh]# cat /chaosMesh/rbac.yml
kind: ServiceAccount
apiVersion: v1
metadata:
namespace: tidb
name: account-tidb-manager-aypth

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: tidb
name: role-tidb-manager-aypth
rules:

apiGroups: [""]
resources: [“pods”, “namespaces”]
verbs: [“get”, “watch”, “list”]
apiGroups:
- chaos-mesh.org
  resources: [ “*” ]
  verbs: [“get”, “list”, “watch”, “create”, “delete”, “patch”, “update”]

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: bind-tidb-manager-aypth
namespace: tidb
subjects:

kind: ServiceAccount
name: account-tidb-manager-aypth
namespace: tidb
roleRef:
kind: Role
name: role-tidb-manager-aypth
apiGroup: rbac.authorization.k8s.io

kubectl apply -f rbac.yml

4、生成令牌，并查看
kubectl describe -n tidb secrets account-tidb-manager-aypth
Name: account-tidb-manager-aypth-token-z4kvc
Namespace: tidb
Labels:
Annotations: kubernetes.io/service-account.name: account-tidb-manager-aypth
kubernetes.io/service-account.uid: 98910f01-64b1-489c-be76-ab9241c6514a

Type: kubernetes.io/service-account-token

Data
====
token: eyJhbGciOiJSUzI1NiIsImtpZCI6IlYxc2pxT1hRQkdZNGFaLUtPOWpEYVZLM1FIeFJPVzFvOXA2aGp6RS0xSjQifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJ0aWRiIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImFjY291bnQtdGlkYi1tYW5hZ2VyLWF5cHRoLXRva2VuLXo0a3ZjIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6ImFjY291bnQtdGlkYi1tYW5hZ2VyLWF5cHRoIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiOTg5MTBmMDEtNjRiMS00ODljLWJlNzYtYWI5MjQxYzY1MTRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50OnRpZGI6YWNjb3VudC10aWRiLW1hbmFnZXItYXlwdGgifQ.qZoomZT5ncAxCuRZ6R5hspa5tmqWMUHaNjjnM_Psa3HeShSYlcM-0ruVjtVj1-g-I2vCyLKYAUCuu4MHCEaULBBonwDwUHM1kqGH6EhrfBBKeLJ1H8EedsDA65RDoiBoYlJqnUi0NGrSbWHYVOEuPcoHTpRAS0gLvwtT77qkc4favMkwB0cX-wxgeBlgLqCq-i98PlOTs4-jQel6gO0j6kE38_sB1o8Bqk4my4NNv95SNZCIuiiwzipYTz7b9bmK3lF4A2s9BK6R6_7kBT5SPZ_YnIIb-C2rHZy0zUvZUsLBjPG32Wi0TDD1LF9A1lQz5lXwTZlyzrWeq082NmnMzw
ca.crt: 1066 bytes
namespace: 4 bytes

5、用令牌进行登录

2.2、关闭token登陆(不安全)

使用 Helm 安装 Chaos Mesh 时，默认开启权限验证功能。对于生产环境及其他安全要求较高的场景，建议都保持权限验证功能开启。如果只是想体验 Chaos Mesh 的功能，希望关闭权限验证从而快速创建混沌实验，可以在 Helm 命令中设置 --set dashboard.securityMode=false，命令如下所示：

helm upgrade chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing --version 2.1.4 --set dashboard.securityMode=false

备注，如果想重新开启权限验证功能，再重新设置 --set dashboard.securityMode=true 即可。

三、混沌工程的实验的类型
(一）、实验环境的准备

创建对应的pod的deployment

1、创建通过deployment创建相关的pod服务
#cat web-show.yml 
apiVersion: apps/v1
kind: Deployment
metadata:
name: webshow-deployment
labels:
 app: webshow-deployment
spec:
replicas: 1
selector:
 matchLabels:
   app: webshow-deployment
template:
 metadata:
   labels:
     app: webshow-deployment
 spec:
   containers:
        - name: webshow-deployment
          image: pingcap/web-show
          imagePullPolicy: Always
          command:
            - /usr/local/bin/web-show
            - --target-ip=${TARGET_IP}
          ports:
            - name: web-port
              containerPort: 8081
              hostPort: 8081

2、创建相关的服务
#kubectl apply -f web-show.yml

3、通过master节点把服务的端口映射出去
#nohup kubectl port-forward --address 0.0.0.0 deployment.apps/webshow-deployment 8081:8081 -n  chaosmesh-test &

4、若端口有问题，杀掉重启端口映射步骤3
kill $(lsof -t -i:8081) > /dev/null  2>&1 ||true

5、正常访问的页面如下：

(二)、实验
3.2.1、创建pod类型的POD FAILURE测试
1.点击实验–新建实验
2.依次选择实验类型：KUBERNETES 、POD故障
3.填写实验信息选项卡

备注：mode的相关信息有：

指定实验的运行方式，可选择的方式包括：one（表示随机选出一个符合条件的 Pod）、all（表示选出所有符合条件的 Pod）、fixed（表示选出指定数量且符合条件的 Pod）、fixed-percent（表示选出占符合条件的 Pod 中指定百分比的 Pod）、random-max-percent（表示选出占符合条件的 Pod 中不超过指定百分比的 Pod）

4.提交相关的信息。
5.通过k8s的master节点监控查看pod 的相关情况

#watch kubectl get pod,PodChaos,StressChaos,NetworkChaos -n chaosmesh-test
NAME READY STATUS RESTARTS AGE
pod/webshow-deployment-6cbdcc4cd4-ljbtk 1/1 Running 7 6h43m

NAME AGE
podchaos.chaos-mesh.org/pod-containers-kill 7h13m
podchaos.chaos-mesh.org/pod-failure-01 20m
podchaos.chaos-mesh.org/pod-kill 8h
podchaos.chaos-mesh.org/pod-kill-all 6h43m
podchaos.chaos-mesh.org/pod-kill03 8h

NAME DURATION
stresschaos.chaos-mesh.org/pod-cpu 5m

NAME ACTION DURATION
networkchaos.chaos-mesh.org/network-delay loss 5m
networkchaos.chaos-mesh.org/network-delay-02 delay 5m
networkchaos.chaos-mesh.org/pod-network-delay delay 70s
networkchaos.chaos-mesh.org/pod-network-loss loss 120s
networkchaos.chaos-mesh.org/pod-network-loss-01 loss 2m

6.当执行任务是出现相关的问题，如截图
7. 通过kubectl检查实验结果

可以使用 kubectl describe 命令查看此混沌实验对象的 Status 和 Events，从而确定实验结果

kubectl describe networkchaos.chaos-mesh.org/network-delay -nchaosmesh-test

Name: network-delay
Namespace: chaosmesh-test
Labels:
Annotations: experiment.chaos-mesh.org/pause: false
API Version: chaos-mesh.org/v1alpha1
Kind: NetworkChaos
Metadata:
Creation Timestamp: 2022-04-01T08:06:54Z
Finalizers:
chaos-mesh/records
Generation: 24
Managed Fields:
API Version: chaos-mesh.org/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.:
v:“chaos-mesh/records”:
f:status:
f:conditions:
f:experiment:
f:containerRecords:
f:desiredPhase:
f:instances:
.:
f:chaosmesh-test/webshow-deployment-6cbdcc4cd4-ljbtk:
Manager: chaos-controller-manager
Operation: Update
Time: 2022-04-01T08:06:54Z
API Version: chaos-mesh.org/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:experiment.chaos-mesh.org/pause:
f:spec:
.:
f:action:
f:direction:
f:duration:
f:loss:
.:
f:loss:
f:mode:
f:selector:
.:
f:labelSelectors:
.:
f:app:
f:namespaces:
f:status:
.:
f:experiment:
Manager: chaos-dashboard
Operation: Update
Time: 2022-04-01T08:11:11Z
Resource Version: 1305926
UID: ee609703-aa48-4b55-9ff2-88b4aab967b5
Spec:
Action: loss
Direction: to
Duration: 5m
Loss:
Correlation: 0
Loss: 80
Mode: all
Selector:
Label Selectors:
App: webshow-deployment
Namespaces:
chaosmesh-test
Status:
Conditions:
Reason:
Status: False
Type: AllInjected
Reason:
Status: True
Type: AllRecovered
Reason:
Status: False
Type: Paused
Reason:
Status: True
Type: Selected
Experiment:
Container Records:
Id: chaosmesh-test/webshow-deployment-6cbdcc4cd4-ljbtk
Phase: Not Injected
Selector Key: .
Desired Phase: Stop
Instances:
chaosmesh-test/webshow-deployment-6cbdcc4cd4-ljbtk: 11
Events:

上述输出中，主要包含两部分：

Status

依据混沌实验的执行流程，Status 提供了以下四类状态记录：

    Paused： 代表混沌实验正处于暂停阶段。

    Selected： 代表混沌实验已经正确选择出待测试目标。

    AllInjected：代表所有测试目标都已经被成功注入故障。

    AllRecoverd：代表所有测试目标的故障都已经被成功恢复。

可以通过这四类状态记录推断出当前混沌实验的真实运行情况。例如：

    当 Paused、Selected、AllRecoverd 的状态是 True 且 AllInjected 的状态是 False时，说明当前实验处在暂停状态。

    当 Paused 为 True 的时，说明当前实验处在暂停状态，但是如果此时的 Selected 值为 False，那么可以进一步得出此混沌实验无法选出待测试目标。
注意

你可以从上述的四类实验记录组合中可以推导出更多的信息，例如当 Paused 为 True 的时候，说明混沌实验处在暂停状态，但是如果此时的 Selected 值为 False，那么可以进一步得出此混沌实验无法选出待测试目标。

Events

事件列表中包含混沌实验整个生命周期中的操作记录，可以帮助确定混沌实验状态并排除问题。

8.查看dashboard界面

9.实验结束，查看pod的服务是否正常地
10.把实验步骤进行归档

如果你想要在 Dashboard 上删除混沌实验并归档到历史记录汇总，可以点击对应混沌实验的归档按钮。
3.3.3、模拟网络故障

    请在进行网络注入的过程中保证 Controller Manager 与 Chaos Daemon 之间的连接通畅，否则将无法恢复。

    如果使用 Net Emulation 功能，请确保 Linux 内核拥有 NET_SCH_NETEM 模块。对于 CentOS 可以通过 kernel-modules-extra 包安装，大部分其他发行版已默认安装相应模块。

（一）模拟LOSS
1.依次选择新建–网络攻击–LOSS

loss:表示丢包发生的概率。取值范围：[0, 100]

correlation:表示延迟时间的时间长度与前一次延迟时长的相关性。取值范围：[0, 100]

direction: 值为 from，to 或 both。用于指定选出“来自 target 的包”，“发往 target 的包”，或者“全部选中”

externalTargets: 表示 Kubernetes 之外的网络目标, 可以是 IPv4 地址或者域名。只能与 direction: to 一起工作。 如8.8.8.8 baidu.com

2.填写实验信息，并提交。
3.进入该容器内部进行相关的ping操作，会出现丢包现象。

[root@k8s-master ~]# kubectl exec -it pod/webshow-deployment-6cbdcc4cd4-ljbtk -nchaosmesh-test /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] – [COMMAND] instead.
sh-4.2# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=54 ttl=108 time=53.5 ms

^C
— 8.8.8.8 ping statistics —
67 packets transmitted, 1 received, 98% packet loss, time 67604ms
rtt min/avg/max/mdev = 53.598/53.598/53.598/0.000 ms
sh-4.2# ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

(二)、模拟delay场景
1.创建相关的配置
2.查看相关的实验信息，并点击开始
3.验证相关结果，通过进入pod，master节点和pod对应的work节点进行ping测试

1.进入容器进行ping外网

kubectl exec -it pod/webshow-deployment-6cbdcc4cd4-ljbtk -nchaosmesh-test /bin/sh

#ping 8.8.8.8
2.在master节点ping该pod的ip地址
3.在该pod所在的work节点ping该pod的地址。

总结：
通过ping发现该地址均出现ping延迟或丢包现象。

备注：

字段说明
参数类型说明默认值是否必填示例
action string 表示具体的故障类型。netem，delay，loss，duplicate，corrupt 对应 net emulation 类型；partition 表示网络分区；bandwidth 表示限制带宽无是 partition
target Selector 与 direction 组合使用，使得 Chaos 只对部分包生效无否
direction enum 值为 from，to 或 both。用于指定选出“来自 target 的包”，“发往 target 的包”，或者“全部选中” to 否 both
mode string 指定实验的运行方式，可选择的方式包括：one（表示随机选出一个符合条件的 Pod）、all（表示选出所有符合条件的 Pod）、fixed（表示选出指定数量且符合条件的 Pod）、fixed-percent（表示选出占符合条件的 Pod 中指定百分比的 Pod）、random-max-percent（表示选出占符合条件的 Pod 中不超过指定百分比的 Pod）无是 one
value string 取决与 mode 的配置，为 mode 提供对应的参数。例如，当你将 mode 配置为 fixed-percent 时，value 用于指定 Pod 的百分比无否 1
containerNames []string 指定注入的容器名称无否 [“nginx”]
selector struct 指定注入故障的目标 Pod，详情请参考定义实验范围无是
externalTargets []string 表示 Kubernetes 之外的网络目标, 可以是 IPv4 地址或者域名。只能与 direction: to 一起工作。无否 1.1.1.1, www.google.com
device string 指定影响的网络设备无否 “eth0”

参数类型说明默认值是否必填示例
latency string 表示延迟的时间长度 0 否 2ms
correlation string 表示延迟时间的时间长度与前一次延迟时长的相关性。取值范围：[0, 100] 0 否 50
jitter string 表示延迟时间的变化范围 0 否 1ms
reorder Reorder(#Reorder) 表示网络包乱序的状态
否

具体可以参考https://chaos-mesh.org/zh/docs/simulate-network-chaos-on-kubernetes/#Loss
3.3.4、模拟压力场景
1.依次选择dashboard–实验–新的实验–压力测试
2.查看cpu的相关测试信息，通过进入pod内部和pod所在的计算节点

1.进入容器内部看负载
[root@k8s-master ~]# kubectl exec -it pod/webshow-deployment-6cbdcc4cd4-ljbtk -nchaosmesh-test /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] – [COMMAND] instead.
sh-4.2# top
top - 03:17:58 up 7 days, 23:04, 0 users, load average: 6.33, 1.99, 0.75
Tasks: 16 total, 5 running, 11 sleeping, 0 stopped, 0 zombie
%Cpu(s): 93.8 us, 4.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 1.6 si, 0.0 st
KiB Mem : 8154912 total, 240500 free, 2111328 used, 5803084 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 5718908 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
46 root 20 0 291064 253740 1344 R 100.0 3.1 1:02.55 stress-ng-vm
44 root 20 0 291064 253740 1344 R 100.0 3.1 1:03.47 stress-ng-vm
43 root 20 0 291064 253740 1344 R 93.3 3.1 1:04.10 stress-ng-vm
45 root 20 0 291064 253740 1344 R 60.0 3.1 1:02.40 stress-ng-vm
1 root 20 0 112976 15316 6820 S 0.0 0.2 0:25.18 web-show
34 root 20 0 41060 7840 5452 S 0.0 0.1 0:00.00 stress-ng
35 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm
36 root 20 0 41704 9252 3540 S 0.0 0.1 0:01.97 stress-ng-cpu
37 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm
38 root 20 0 41704 9192 3476 S 0.0 0.1 0:01.62 stress-ng-cpu
39 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm
40 root 20 0 41704 9252 3540 S 0.0 0.1 0:01.64 stress-ng-cpu
41 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm
42 root 20 0 41704 9192 3476 S 0.0 0.1 0:01.66 stress-ng-cpu
47 root 20 0 11832 2684 2456 S 0.0 0.0 0:00.01 sh
sh-4.2# top
top - 03:19:13 up 7 days, 23:05, 0 users, load average: 8.91, 3.79, 1.48
Tasks: 16 total, 5 running, 11 sleeping, 0 stopped, 0 zombie
%Cpu(s): 98.6 us, 1.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 8154912 total, 239884 free, 2111728 used, 5803300 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 5718512 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
46 root 20 0 291064 253740 1344 R 94.3 3.1 2:11.74 stress-ng-vm
45 root 20 0 291064 253740 1344 R 93.7 3.1 2:13.74 stress-ng-vm
44 root 20 0 291064 253740 1344 R 93.3 3.1 2:13.47 stress-ng-vm
43 root 20 0 291064 253740 1344 R 91.3 3.1 2:13.76 stress-ng-vm
38 root 20 0 41704 9192 3476 S 11.7 0.1 0:03.40 stress-ng-cpu
1 root 20 0 112976 15316 6820 S 0.0 0.2 0:25.21 web-show
34 root 20 0 41060 7840 5452 S 0.0 0.1 0:00.00 stress-ng
35 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm
36 root 20 0 41704 9252 3540 S 0.0 0.1 0:03.71 stress-ng-cpu
37 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm
39 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm
40 root 20 0 41704 9252 3540 S 0.0 0.1 0:02.83 stress-ng-cpu
41 root 20 0 41064 2448 52 S 0.0 0.0 0:00.00 stress-ng-vm
42 root 20 0 41704 9192 3476 S 0.0 0.1 0:02.85 stress-ng-cpu
47 root 20 0 11832 2804 2440 S 0.0 0.0 0:00.01 sh
[1]+ Stopped(SIGSTOP) top
sh-4.2# uptime
03:19:22 up 7 days, 23:05, 0 users, load average: 8.95, 3.97, 1.56

2.查看计算节点的负载
[root@k8s-node1 ~]# top
top - 11:19:55 up 7 days, 23:06, 1 user, load average: 8.36, 4.30, 1.75
Tasks: 189 total, 6 running, 115 sleeping, 1 stopped, 0 zombie
%Cpu(s): 98.5 us, 1.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 8154912 total, 144968 free, 2029056 used, 5980888 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 5623656 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5484 root 20 0 291064 253740 1344 R 97.3 3.1 2:52.08 stress-ng-vm
5485 root 20 0 322328 284904 1344 R 93.7 3.5 2:52.45 stress-ng-vm
5486 root 20 0 322328 284860 1344 R 91.0 3.5 2:50.38 stress-ng-vm
5483 root 20 0 322328 284792 1344 R 89.4 3.5 2:52.82 stress-ng-vm
5476 root 20 0 41704 9252 3540 S 9.6 0.1 0:05.00 stress-ng-cpu
13676 tidb 20 0 10.5g 209764 58128 S 9.3 2.6 432:52.32 pd-server
22361 root 20 0 1986632 125576 70520 S 3.0 1.5 307:20.30 kubelet
30096 root 20 0 752296 57752 35756 S 1.3 0.7 22:41.88 kube-scheduler
31181 root 20 0 753256 62176 35336 S 1.0 0.8 5:01.58 chaos-controlle
2340 root 20 0 1695896 108680 53292 S 0.7 1.3 107:31.90 dockerd
873 root 20 0 21544 2704 2456 S 0.3 0.0 0:20.71 irqbalance
25407 root 20 0 711016 14412 6096 S 0.3 0.2 0:54.73 containerd-shim
1 root 20 0 191568 5648 3700 S 0.0 0.1 0:46.20 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.38 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
[root@k8s-node1 ~]# uptime
11:19:58 up 7 days, 23:06, 1 user, load average: 8.49, 4.40, 1.79

3.中间暂停，发下cpu负载下来，当继续的时候又上来了，知道该实验结束。
(二)、工作流

为满足该需求，Chaos Mesh 提供了 Chaos Mesh Workflow，一个内置的工作流引擎。使用该引擎，你可以串行或并行地执行多种不同的 Chaos 实验， 用于模拟生产级别的错误。

目前， Chaos Mesh Workflow 支持以下功能：

串行编排

并行编排

自定义任务

条件分支

使用场景举例：

使用并行编排同时注入多个 NetworkChaos 模拟复杂的网络环境

在串行编排中进行健康检查，使用条件分支决定是否执行剩下的步骤

Chaos Mesh Workflow 在设计时一定程度上参考了 Argo Workflow。如果您熟悉 Argo Workflow 您也能很快地上手 Chaos Mesh Workflow。

具体可以参考https://chaos-mesh.org/zh/docs/create-chaos-mesh-workflow/
(三)、计划

在 Kubernetes 中，Chaos Mesh 使用 Schedule 对象来描述定时任务。

一个 Schedule 对象名不应超过 57 字符，因为它创建的混沌实验将在名字后额外添加 6 位随机字符。一个包含有 Workflow 的 Schedule 对象名不应超过 51 字符，因为 Workflow 也将在创建的名字后额外添加 6 位随机字符。

schedule 字段•
schedule 字段用于指定实验发生的时间。

┌───────────── 分钟 (0 - 59)

│ ┌───────────── 小时 (0 - 23)

│ │ ┌───────────── 月的某天 (1 - 31)

│ │ │ ┌───────────── 月份 (1 - 12)

│ │ │ │ ┌───────────── 周的某天 (0 - 6) （周日到周一；在某些系统上，7 也是星期日）

│ │ │ │ │

* * * * *

输入描述等效替代
@yearly (or @annually) 每年 1 月 1 日的午夜运行一次 0 0 1 1 *
@monthly 每月第一天的午夜运行一次 0 0 1 * *
@weekly 每周的周日午夜运行一次 0 0 * * 0
@daily (or @midnight) 每天午夜运行一次 0 0 * * *
@hourly 每小时的开始一次 0 * * * *
1.创建工作计划
2.填写计划周期、并发策略等信息
3.提交实验
4.由于schedule是每两分钟执行一次.

可以看下pod的cpu负载以及pod所在的work节点的cpu负载,并在master节点查看schedule信息

1.查看master节点的信息
kubectl get pod,PodChaos,StressChaos,schedule -n chaosmesh-test -owide Sat Apr 2 12:03:35 2022

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/webshow-deployment-6cbdcc4cd4-ljbtk 1/1 Running 7 21h 10.244.3.39 k8s-node1

NAME AGE
podchaos.chaos-mesh.org/pod-containers-kill 21h
podchaos.chaos-mesh.org/pod-failure-01 15h
podchaos.chaos-mesh.org/pod-kill 23h
podchaos.chaos-mesh.org/pod-kill-all 21h
podchaos.chaos-mesh.org/pod-kill03 22h

NAME DURATION
stresschaos.chaos-mesh.org/cpu-test-01 10m
stresschaos.chaos-mesh.org/pod-cpu 5m
stresschaos.chaos-mesh.org/schedule-01-j9n5f 10m

NAME AGE
schedule.chaos-mesh.org/schedule-01 33m

####查看schedule详细信息
[root@k8s-master ~]# kubectl describe schedule.chaos-mesh.org/schedule-01 -nchaosmesh-test
Name: schedule-01
Namespace: chaosmesh-test
Labels:
Annotations: experiment.chaos-mesh.org/pause: false
API Version: chaos-mesh.org/v1alpha1
Kind: Schedule
Metadata:
Creation Timestamp: 2022-04-02T03:30:07Z
Generation: 23
Managed Fields:
API Version: chaos-mesh.org/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:active:
f:time:
Manager: chaos-controller-manager
Operation: Update
Time: 2022-04-02T03:32:00Z
API Version: chaos-mesh.org/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:experiment.chaos-mesh.org/pause:
f:spec:
.:
f:concurrencyPolicy:
f:historyLimit:
f:schedule:
f:startingDeadlineSeconds:
f:stressChaos:
.:
f:duration:
f:mode:
f:selector:
.:
f:namespaces:
f:stressors:
.:
f:cpu:
.:
f:workers:
f:memory:
.:
f:size:
f:workers:
f:type:
f:status:
Manager: chaos-dashboard
Operation: Update
Time: 2022-04-02T03:36:49Z
Resource Version: 1513442
UID: 7a198cb5-feb6-4403-ab37-b3ceab1e954e
Spec:
Concurrency Policy: Forbid
History Limit: 1
Schedule: */2 * * * *
Starting Deadline Seconds: 600
Stress Chaos:
Duration: 10m
Mode: all
Selector:
Namespaces:
chaosmesh-test
Stressors:
Cpu:
Workers: 3
Memory:
Size: 1024m
Workers: 3
Type: StressChaos
Status:
Active:
API Version: chaos-mesh.org/v1alpha1
Kind: StressChaos
Name: schedule-01-98lvp
Namespace: chaosmesh-test
Resource Version: 1513440
UID: abcedc4b-1cb4-48ef-923e-f3c2c9cb6934
Time: 2022-04-02T04:04:28Z
Events:
Type Reason Age From Message

Normal Spawned 35m schedule-cron Create new object: schedule-01-j9n5f
Normal Updated 35m schedule-cron Successfully update lastScheduleTime of resource
Warning Forbid 33m schedule-cron Forbid spawning new job because: schedule-01-j9n5f is still running
Normal Spawned 3m5s schedule-cron Create new object: schedule-01-98lvp
Normal Updated 3m5s schedule-cron Successfully update lastScheduleTime of resource
Warning Forbid 93s schedule-cron Forbid spawning new job because: schedule-01-98lvp is still running

2.查看pod负载
top - 04:06:14 up 7 days, 23:52, 0 users, load average: 7.65, 2.91, 1.67
Tasks: 14 total, 7 running, 6 sleeping, 1 stopped, 0 zombie
%Cpu(s): 98.4 us, 1.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem : 8154912 total, 276480 free, 2087400 used, 5791032 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 5742844 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 374392 335024 1336 R 73.4 4.1 1:04.01 stress-ng-vm
74 root 20 0 41704 8012 3544 R 68.1 0.1 1:07.17 stress-ng-cpu
71 root 20 0 41704 8012 3544 R 65.8 0.1 1:07.89 stress-ng-cpu
75 root 20 0 374392 335024 1336 R 63.5 4.1 1:08.54 stress-ng-vm
72 root 20 0 374392 335024 1336 R 57.5 4.1 1:08.74 stress-ng-vm
69 root 20 0 41704 8012 3544 R 51.5 0.1 1:05.62 stress-ng-cpu
1 root 20 0 112976 15316 6820 S 0.0 0.2 0:26.62 web-show
47 root 20 0 11832 2804 2440 S 0.0 0.0 0:00.01 sh
54 root 20 0 56192 3776 3264 T 0.0 0.0 0:00.00 top
56 root 20 0 56196 3720 3208 R 0.0 0.0 0:00.63 top
67 root 20 0 41056 5628 5280 S 0.0 0.1 0:00.00 stress-ng
68 root 20 0 41060 420 64 S 0.0 0.0 0:00.00 stress-ng-vm
70 root 20 0 41060 420 64 S 0.0 0.0 0:00.00 stress-ng-vm
73 root 20 0 41060 420 64 S 0.0 0.0 0:00.00 stress-ng-vm

3.查看work节点负载
[root@k8s-node1 ~]# uptime
12:06:51 up 7 days, 23:53, 1 user, load average: 7.82, 3.46, 1.90

5.暂停定时任务

与 CronJob 不同，暂停一个 Schedule 不仅仅会阻止它创建新的实验，也会暂停已创建的实验。

1.如果你暂时不想再通过定时任务来创建混沌实验，需要为该 Schedule 对象添加 experiment.chaos-mesh.org/pause=true 注解。可以使用 kubectl 命令行工具添加注解：
kubectl annotate -n $NAMESPACE schedule NAME experiment.chaos-mesh.org/pause=true 返回结果： schedule/NAME annotated

2.如果要解除暂停，可以使用如下命令去除该注解：
kubectl annotate -n $NAMESPACE schedule NAME experiment.chaos-mesh.org/pause- 返回结果 schedule/NAME annotated

备注.mode类型查找

https://github.com/chaos-mesh/chaos-mesh/blob/master/api/v1alpha1/selector.go

const (
// OneMode represents that the system will do the chaos action on one object selected randomly.
OneMode SelectorMode = “one”
// AllMode represents that the system will do the chaos action on all objects
// regardless of status (not ready or not running pods includes).
// Use this label carefully.
AllMode SelectorMode = “all”
// FixedMode represents that the system will do the chaos action on a specific number of running objects.
FixedMode SelectorMode = “fixed”
// FixedPercentMode to specify a fixed % that can be inject chaos action.
FixedPercentMode SelectorMode = “fixed-percent”
// RandomMaxPercentMode to specify a maximum % that can be inject chaos action.
RandomMaxPercentMode SelectorMode = “random-max-percent”
)

版权声明：本文为 TiDB 社区用户原创文章，遵循 CC BY-NC-SA 4.0 版权协议，转载请附上原文出处链接和本声明。
https://tidb.net/blog/eff7080c

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

日常技术分享~TiDB（共181篇）

日常技术分享

通过Chaos-Mesh打造更稳定TiDB数据库高可用架构(一)

kubectl describe networkchaos.chaos-mesh.org/network-delay -nchaosmesh-test

kubectl exec -it pod/webshow-deployment-6cbdcc4cd4-ljbtk -nchaosmesh-test /bin/sh

┌───────────── 分钟 (0 - 59)

│ ┌───────────── 小时 (0 - 23)

│ │ ┌───────────── 月的某天 (1 - 31)

│ │ │ ┌───────────── 月份 (1 - 12)

│ │ │ │ ┌───────────── 周的某天 (0 - 6) （周日到周一；在某些系统上，7 也是星期日）

│ │ │ │ │

│ │ │ │ │

│ │ │ │ │

* * * * *

文章被以下合辑收录

评论