暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

容灾恢复 | 记一次K8S集群中etcd数据快照的备份恢复实践

全栈工程师修炼指南 2023-03-09
317

欢迎关注「全栈工程师修炼指南」公众号

点击 👇 下方卡片 即可关注我哟!

设为星标⭐每天带你 基础入门 到 进阶实践 再到 放弃学习

涉及 企业运维、网络安全应用开发物联网、人工智能、大数据 学习知识

  花开堪折直须折,莫待无花空折枝 



作者主页:[ https://www.weiyigeek.top ]

作者博客:[ https://blog.weiyigeek.top ]

作者答疑学习交流群:请关注公众号后回复【学习交流群


文章目录:

  • 0x00 前言简述

  • 0x01 环境准备

    • 1.安装的二进制 etcdctl

    • 2.拉取带有 etcdctl 的 Docker 镜像

    • 3.Kubernetes 使用 etcdctl 镜像创建Pod

  • 0x02 备份实践

    • 1.使用二进制安装 etcdctl 客户端工具

    • 2.使用Docker镜像安装 etcdctl 客户端工具

    • 3.在kubernetes集群中快速创建pod进行手动备份

    • 4.在kubernetes集群中使用CronJob资源控制器进行定时备份

  • 0x02 恢复实践

    • 1.单master节点恢复

    • 2.多master节点恢复

    • 3.K8S集群中etcd数据库节点数据不一致问题解决实践


温馨提示: 由于作者水平有限,本章错漏缺点在所难免,希望读者批评指正,若有问题或建议请在文章末尾留下您宝贵的经验知识,或联系邮箱地址

master@weiyigeek.top 或 关注公众号 [全栈工程师修炼指南] 留言。



0x00 前言简述

描述:在 Kubernetes 集群中所有操作的资源数据都是存储在 etcd 数据库上, 所以防止集群节点瘫痪未正常工作或在集群迁移时,以及在出现异常的情况下能尽快的恢复集群数据,则我们需要定期针对etcd集群数据进行相应的容灾操作。

在K8S集群中或者Docker环境中,我们可以非常方便的针对 etcd 数据进行备份,通我们常在一个节点上对 etcd 做快照就能够实现etcd数据的备份,其快照文件包含所有 Kubernetes 状态和关键信息, 有了etcd集群数据备份后,例如在灾难场景(例如丢失所有控制平面节点)下也能快速恢复 Kubernetes 集群,Boss再也不同担心系统起不来呢。


0x01 环境准备

1.安装的二进制 etcdctl

描述: etcdctl 二进制文件可以在 github.com/coreos/etcd/releases
 选择对应的版本下载,例如可以执行以下 install_etcdctl.sh
 的脚本,修改其中的版本信息。

install_etcdctl.sh

    #!/bin/bash
    # Author: WeiyiGeek
    # Description: etcd 与 etcdctl 下载安装
    ETCD_VER=v3.5.5
    ETCD_DIR=etcd-download
    DOWNLOAD_URL=https://github.com/coreos/etcd/releases/download


    # Download
    mkdir ${ETCD_DIR}
    cd ${ETCD_DIR}
    wget ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz
    tar -xzvf etcd-${ETCD_VER}-linux-amd64.tar.gz


    # Install
    cd etcd-${ETCD_VER}-linux-amd64
    cp etcdctl usr/local/bin/

    验证安装:

    $ etcdctl version
    etcdctl version: 3.5.5
    API version: 3.5


    2.拉取带有 etcdctl 的 Docker 镜像

    操作流程:

      # 镜像拉取与容器创建。
      docker run --rm \
      -v data/backup:/backup \
      -v etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
      --env ETCDCTL_API=3 \
      registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.1-0 \
      /bin/sh -c "etcdctl version"

      安装验证:

        3.5.1-0: Pulling from google_containers/etcd
        e8614d09b7be: Pull complete
        45b6afb4a92f: Pull complete
        .......
        Digest: sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263
        Status: Downloaded newer image for registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.1-0
        etcdctl version: 3.5.1
        API version: 3.5


        3.Kubernetes 使用 etcdctl 镜像创建Pod

          # 镜像拉取
          crictl pull registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.5-0


          # Pod 创建以及安装验证
          $ kubectl run etcdctl --image=registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.5-0 --command -- usr/local/bin/etcdctl version
          $ kubectl logs -f etcdctl
          etcdctl version: 3.5.5
          API version: 3.5
          $ kubectl delete pod etcdctl


          0x02 备份实践

          1.使用二进制安装 etcdctl 客户端工具

          温馨提示: 如果是单节点 Kubernetes 我们只需要对其的 etcd 数据库进行快照备份, 如果是多主多从的集群,我们则需依次备份多个 master 节点中 etcd,防止在备份时etc数据被更改!

          此处实践环境为多master高可用集群节点, 即三主节点、四从工作节点,若你对K8s集群不了解或者项搭建高可用集群的朋友,关注 WeiyiGeek 公众号回复【Kubernetes学习之路汇总
          】即可获得学习资料:
          https://www.weiyigeek.top/wechat.html?key=Kubernetes学习之路汇总

            $ kubectl get node
            NAME STATUS ROLES AGE VERSION
            weiyigeek-107 Ready control-plane,master 279d v1.23.1
            weiyigeek-108 Ready control-plane,master 278d v1.23.1
            weiyigeek-109 Ready control-plane,master 278d v1.23.1
            weiyigeek-223 Ready work 278d v1.23.1
            weiyigeek-224 Ready work 278d v1.23.1
            weiyigeek-225 Ready work 279d v1.23.1
            weiyigeek-226 Ready work 118d v1.23.1


            操作流程:

              # 1.创建etcd快照备份目录
              $ mkdir -pv backup


              # 2.查看etcd证书
              $ ls etc/kubernetes/pki/etcd/
              ca.crt ca.key healthcheck-client.crt healthcheck-client.key peer.crt peer.key server.crt server.key


              # 3.查看 etcd 地址以及服务
              $ kubectl get pod -n kube-system -o wide | grep "etcd"
              etcd-weiyigeek-107 1/1 Running 1 279d 192.168.12.107 weiyigeek-107 <none><none>
              etcd-weiyigeek-108 1/1 Running 0 278d 192.168.12.108 weiyigeek-108 <none><none>
              etcd-weiyigeek-109 1/1 Running 0 278d 192.168.12.109 weiyigeek-109 <none><none>


              # 4.此时在 107、 108 、109 主节点上查看你监听情况
              $ netstat -ano | grep "107:2379"
              tcp 0 0 192.168.12.107:2379 0.0.0.0:* LISTEN off (0.00/0/0)
              $ netstat -ano | grep "108:2379"
              tcp 0 0 192.168.12.108:2379 0.0.0.0:* LISTEN off (0.00/0/0)
              $ netstat -ano | grep "109:2379"
              tcp 0 0 192.168.12.109:2379 0.0.0.0:* LISTEN off (0.00/0/0)


              # 5. 使用etcdctl客户端工具依次备份节点中的数据
              $ etcdctl --endpoints=https://10.20.176.212:2379 \
              --cacert=/etc/kubernetes/pki/etcd/ca.crt \
              --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
              --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
              snapshot save backup/etcd-snapshot.db
              {"level":"info","ts":"2022-10-23T16:32:26.020+0800","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/backup/etcd-snapshot.db.part"}
              {"level":"info","ts":"2022-10-23T16:32:26.034+0800","logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
              {"level":"info","ts":"2022-10-23T16:32:26.034+0800","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://10.20.176.212:2379"}
              {"level":"info","ts":"2022-10-23T16:32:26.871+0800","logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
              {"level":"info","ts":"2022-10-23T16:32:26.946+0800","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://10.20.176.212:2379","size":"112 MB","took":"now"}
              {"level":"info","ts":"2022-10-23T16:32:26.946+0800","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/backup/etcd-snapshot.db"}
              Snapshot saved at backup/etcd-snapshot.db

              通过etcdctl查询Kubernetes中etcd数据,由于Kubernetes使用etcd v3版本的API,而且etcd 集群中默认使用tls认证,所以先配置几个环境变量。

                # 1.环境变量
                export ETCDCTL_API=3
                export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
                export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/peer.crt
                export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/peer.key


                # 2.查询集群中所有的key列表
                # –-prefix:表示查找所有以/registry为前缀的key
                # --keys-only=true:表示只给出key不给出value
                etcdctl --endpoints=https://10.20.176.212:2379 get registry --prefix --keys-only=true | head -n 1
                /registry/apiextensions.k8s.io/customresourcedefinitions/bgpconfigurations.crd.projectcalico.org


                # 3.查询某个key的值
                # -–keys-only=false : 表示要给出value,该参数默认值即为false,
                # -w=json :表示输出json格式
                etcdctl --endpoints=https://10.20.176.212:2379 get registry/namespaces/default --prefix --keys-only=false -w=json | python3 -m json.tool


                etcdctl --endpoints=https://10.20.176.212:2379 get registry/namespaces/default --prefix --keys-only=false -w=json | python3 -m json.tool
                {
                "header": {
                "cluster_id": 11404821125176160774,
                "member_id": 7099450421952911102,
                "revision": 30240109,
                "raft_term": 3
                },
                "kvs": [
                {
                "key": "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA==",
                "create_revision": 192,
                "mod_revision": 192,
                "version": 1,
                "value": "azhzAAoPCg.......XNwYWNlEGgAiAA=="
                }
                ],
                "count": 1
                }


                # 4.其 Key Value 都是采用 base64 编码
                echo -e "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA==" | base64 -d
                echo -e "azhzAAoPCgJ2MRIJTmFtZXNwYWNlEogCCu0BCgdkZWZhdWx0EgAaACIAKiQ5ZDQyYmYxMy03OGM0LTQ4NzQtOThiYy05NjNlMDg1MDYyZjYyADgAQggIo8yllQYQAFomChtrdWJlcm5ldGVzLmlvL21ldGFkYXRhLm5hbWUSB2RlZmF1bHR6AIoBfQoOa3ViZS1hcGlzZXJ2ZXISBlVwZGF0ZRoCdjEiCAijzKWVBhAAMghGaWVsZHNWMTpJCkd7ImY6bWV0YWRhdGEiOnsiZjpsYWJlbHMiOnsiLiI6e30sImY6a3ViZXJuZXRlcy5pby9tZXRhZGF0YS5uYW1lIjp7fX19fUIAEgwKCmt1YmVybmV0ZXMaCAoGQWN0aXZlGgAiAA==" | base64 -d


                # 5.实际上述value编码解码后的的内容如下:
                $ kubectl get ns default -o yaml
                apiVersion: v1
                kind: Namespace
                metadata:
                creationTimestamp: "2022-06-15T04:54:59Z"
                labels:
                kubernetes.io/metadata.name: default
                name: default
                resourceVersion: "192"
                uid: 9d42bf13-78c4-4874-98bc-963e085062f6
                spec:
                finalizers:
                - kubernetes
                status:
                phase: Active


                # 5.Pod 资源信息查看
                etcdctl --endpoints=https://10.20.176.212:2379 get registry/pods/default --prefix --keys-only=true
                # registry/pods/default/nfs-dev-nfs-subdir-external-provisioner-cf7684f8b-fzl9h
                # registry/pods/default/nfs-local-nfs-subdir-external-provisioner-6f97d44bb8-424tk


                etcdctl --endpoints=https://10.20.176.212:2379 get registry/pods/default --prefix --keys-only=true -w=json | python3 -m json.tool
                {
                "header": {
                "cluster_id": 11404821125176160774,
                "member_id": 7099450421952911102,
                "revision": 30442415,
                "raft_term": 3
                },
                "kvs": [
                { # 实际上该编码是 registry/pods/default/nfs-dev-nfs-subdir-external-provisioner-cf7684f8b-fzl9h
                "key": "L3JlZ2lzdHJ5L3BvZHMvZGVmYXVsdC9uZnMtZGV2LW5mcy1zdWJkaXItZXh0ZXJuYWwtcHJvdmlzaW9uZXItY2Y3Njg0ZjhiLWZ6bDlo",
                "create_revision": 5510865,
                "mod_revision": 5510883,
                "version": 5
                },
                {
                "key": "L3JlZ2lzdHJ5L3BvZHMvZGVmYXVsdC9uZnMtbG9jYWwtbmZzLXN1YmRpci1leHRlcm5hbC1wcm92aXNpb25lci02Zjk3ZDQ0YmI4LTQyNHRr",
                "create_revision": 5510967,
                "mod_revision": 5510987,
                "version": 5
                }
                ],
                "count": 2
                }


                2.使用Docker镜像安装 etcdctl 客户端工具

                描述: 在装有Docker环境的机器,我们可以非常方便的备份K8s集群中的etcd数据库,此处我已经安装好了Docker,若有不了解Docker或者需要搭建Docker环境中童鞋。

                  # 1.Docker 实践环境
                  $ docker version
                  Client: Docker Engine - Community
                  Version 20.10.3
                  .......
                  Server: Docker Engine - Community
                  Engine:
                  Version: 20.10.3


                  # 2.etcd 备份文件存储的目录
                  $ mkdir -vp backup


                  # 3.执行docker创建容器,在备份数据库后便删除该容器。
                  $ docker run --rm \
                  -v backup:/backup \
                  -v etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
                  --env ETCDCTL_API=3 \
                  registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.1-0 \
                  /bin/sh -c "etcdctl --endpoints=https://192.168.12.107:2379 \
                  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
                  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
                  snapshot save backup/etcd-snapshot.db"
                  # {"level":"info","ts":1666515535.63076,"caller":"snapshot/v3_snapshot.go:68","msg":"created temporary db file","path":"/backup/etcd-snapshot.db.part"}
                  # {"level":"info","ts":1666515535.6411893,"logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
                  # {"level":"info","ts":1666515535.6419039,"caller":"snapshot/v3_snapshot.go:76","msg":"fetching snapshot","endpoint":"https://192.168.12.107:2379"}
                  # {"level":"info","ts":1666515535.9170482,"logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
                  # {"level":"info","ts":1666515535.931862,"caller":"snapshot/v3_snapshot.go:91","msg":"fetched snapshot","endpoint":"https://192.168.12.107:2379","size":"9.0 MB","took":"now"}
                  # {"level":"info","ts":1666515535.9322069,"caller":"snapshot/v3_snapshot.go:100","msg":"saved","path":"/backup/etcd-snapshot.db"}
                  Snapshot saved at backup/etcd-snapshot.db


                  # 4.查看备份的etcd快照文件
                  ls -alh backup/etcd-snapshot.db
                  -rw------- 1 root root 8.6M Oct 23 16:58 backup/etcd-snapshot.db

                  使用 Docker 容器查看 k8s 集群中的etcd数据库中的数据。

                    $ docker run --rm \
                    -v backup:/backup \
                    -v etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
                    --env ETCDCTL_API=3 \
                    --env ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt \
                    --env ETCDCTL_CERT=/etc/kubernetes/pki/etcd/peer.crt \
                    --env ETCDCTL_KEY=/etc/kubernetes/pki/etcd/peer.key \
                    registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.1-0 \
                    /bin/sh -c "etcdctl --endpoints=https://192.168.12.107:2379 get registry/namespaces/default -w=json"

                    执行结果:


                    3.在kubernetes集群中快速创建pod进行手动备份

                    准备一个Pod资源清单并部署

                      cat <<EOF | kubectl apply -f -
                      apiVersion: v1
                      kind: Pod
                      metadata:
                      name: etcd-backup
                      labels:
                      tool: backup
                      spec:
                      containers:
                      - name: redis
                      image: registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.5-0
                      imagePullPolicy: IfNotPresent
                      command:
                      - sh
                      - -c
                      - "etcd"
                      env:
                      - name: ETCDCTL_API
                      value: "3"
                      - name: ETCDCTL_CACERT
                      value: "/etc/kubernetes/pki/etcd/ca.crt"
                      - name: ETCDCTL_CERT
                      value: "/etc/kubernetes/pki/etcd/healthcheck-client.crt"
                      - name: ETCDCTL_KEY
                      value: "/etc/kubernetes/pki/etcd/healthcheck-client.key"
                      volumeMounts:
                      - name: "pki"
                      mountPath: "/etc/kubernetes"
                      - name: "backup"
                      mountPath: "/backup"
                      volumes:
                      - name: pki
                      hostPath:
                      path: "/etc/kubernetes" # 证书目录
                      type: "DirectoryOrCreate"
                      - name: "backup"
                      hostPath:
                      path: "/storage/dev/backup" # 数据备份目录
                      type: "DirectoryOrCreate"
                      restartPolicy: Never
                      nodeSelector:
                      node-role.kubernetes.io/master: "" # 绑定在主节点中
                      EOF
                      pod/etcd-backup created


                      进入到该Pod终端之中执行相应的备份命令

                        ~$ kubectl exec -it etcd-backup sh
                        # 快照备份
                        sh-5.1# export RAND=$RANDOM
                        sh-5.1# etcdctl --endpoints=https://192.168.12.107:2379 snapshot save backup/etcd-107-${RAND}-snapshot.db
                        sh-5.1# etcdctl --endpoints=https://192.168.12.108:2379 snapshot save backup/etcd-108-${RAND}-snapshot.db
                        Snapshot saved at backup/etcd-108-32616-snapshot.db
                        sh-5.1# etcdctl --endpoints=https://192.168.12.109:2379 snapshot save backup/etcd-109-${RAND}-snapshot.db


                        # etcd 节点成员
                        sh-5.1# etcdctl member list --endpoints=https://192.168.12.107:2379 --endpoints=https://192.168.12.108:2379 --endpoints=https://192.168.12.109:2379


                        2db31a5d67ec1034, started, weiyigeek-108, https://192.168.12.108:2380, https://192.168.12.108:2379, false
                        42efe7cca897d765, started, weiyigeek-109, https://192.168.12.109:2380, https://192.168.12.109:2379, false
                        471323846709334f, started, weiyigeek-107, https://192.168.12.107:2380, https://192.168.12.107:2379, false




                        # etcd 节点健康信息
                        sh-5.1# etcdctl endpoint health --endpoints=https://192.168.12.107:2379 --endpoints=https://192.168.12.108:2379 --endpoints=https://192.168.12.109:2379


                        https://192.168.12.107:2379 is healthy: successfully committed proposal: took = 11.930331ms
                        https://192.168.12.109:2379 is healthy: successfully committed proposal: took = 11.930993ms
                        https://192.168.12.108:2379 is healthy: successfully committed proposal: took = 109.515933ms




                        # etcd 节点状态及空间占用信息
                        sh-5.1# etcdctl endpoint status --endpoints=https://192.168.12.107:2379 --endpoints=https://192.168.12.108:2379 --endpoints=https://192.168.12.109:2379


                        https://192.168.12.107:2379, 471323846709334f, 3.5.1, 9.2 MB, false, false, 4, 71464830, 71464830,
                        https://192.168.12.108:2379, 2db31a5d67ec1034, 3.5.1, 9.2 MB, false, false, 4, 71464830, 71464830,
                        https://192.168.12.109:2379, 42efe7cca897d765, 3.5.1, 9.2 MB, true, false, 4, 71464830, 71464830, # 此处为主

                        至此,手动备份etcd集群数据快照完毕!


                        偷偷的告诉你哟?极客全栈修炼】微信小程序已经上线了,

                        可直接在微信里面直接浏览博主博客了哟,后续将上线更多有趣的小工具。



                        4.在kubernetes集群中使用CronJob资源控制器进行定时备份

                        首先准备一个cronJob资源清单:

                          cat > etcd-database-backup.yaml <<'EOF'
                          apiVersion: batch/v1
                          kind: CronJob
                          metadata:
                          name: etcd-database-backup
                          annotations:
                          descript: "etcd数据库定时备份"
                          spec:
                          schedule: "*/5 * * * *" # 表示每5分钟运行一次
                          jobTemplate:
                          spec:
                          template:
                          spec:
                          containers:
                          - name: etcdctl
                          image: registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.5-0
                          env:
                          - name: ETCDCTL_API
                          value: "3"
                          - name: ETCDCTL_CACERT
                          value: "/etc/kubernetes/pki/etcd/ca.crt"
                          - name: ETCDCTL_CERT
                          value: "/etc/kubernetes/pki/etcd/healthcheck-client.crt"
                          - name: ETCDCTL_KEY
                          value: "/etc/kubernetes/pki/etcd/healthcheck-client.key"
                          command:
                          - /bin/sh
                          - -c
                          - |
                          export RAND=$RANDOM
                          etcdctl --endpoints=https://192.168.12.107:2379 snapshot save /backup/etcd-107-${RAND}-snapshot.db
                          etcdctl --endpoints=https://192.168.12.108:2379 snapshot save /backup/etcd-108-${RAND}-snapshot.db
                          etcdctl --endpoints=https://192.168.12.109:2379 snapshot save /backup/etcd-109-${RAND}-snapshot.db
                          volumeMounts:
                          - name: "pki"
                          mountPath: "/etc/kubernetes"
                          - name: "backup"
                          mountPath: "/backup"
                          imagePullPolicy: IfNotPresent
                          volumes:
                          - name: "pki"
                          hostPath:
                          path: "/etc/kubernetes"
                          type: "DirectoryOrCreate"
                          - name: "backup"
                          hostPath:
                          path: "/storage/dev/backup" # 数据备份目录
                          type: "DirectoryOrCreate"
                          nodeSelector: # 将Pod绑定在主节点之中,否则只能将相关证书放在各个节点能访问的nfs共享存储中
                          node-role.kubernetes.io/master: ""
                          restartPolicy: Never
                          EOF

                          创建cronjob资源清单:

                          kubectl apply -f etcd-database-backup.yaml
                          # cronjob.batch/etcd-database-backup created

                          查看创建的cronjob资源及其集群etcd备份:

                            NAME                                 SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
                            cronjob.batch/etcd-database-backup */5 * * * * False 0 21s 14m


                            NAME READY STATUS RESTARTS AGE
                            pod/etcd-database-backup-27776740-rhzkk 0/1 Completed 0 21s

                            查看定时Pod日志以及备份文件:

                              kubectl logs -f pod/etcd-database-backup-27776740-rhzkk
                              Snapshot saved at /backup/etcd-107-25615-snapshot.db
                              Snapshot saved at /backup/etcd-108-25615-snapshot.db
                              Snapshot saved at /backup/etcd-109-25615-snapshot.db




                              $ ls -lt # 显示最新备份的文件按照时间排序
                              total 25M
                              -rw------- 1 root root 8.6M Oct 24 21:12 etcd-107-25615-snapshot.db
                              -rw------- 1 root root 7.1M Oct 24 21:12 etcd-108-25615-snapshot.db
                              -rw------- 1 root root 8.8M Oct 24 21:12 etcd-109-25615-snapshot.db

                              至此集群中的etcd快照数据备份完毕!


                              0x02 恢复实践

                              1.单master节点恢复

                              描述: 当单master集群节点资源清单数据丢失时,我们可采用如下方式进行快速恢复数据。

                              操作流程:

                              温馨提示: 如果是单节点的k8S集群则使用如下命令恢复

                                mv /etc/kubernetes/manifests/ /etc/kubernetes/manifests-backup/
                                mv /var/lib/etcd /var/lib/etcd.bak


                                ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-212-32616-snapshot.db --data-dir=/var/lib/etcd/ --endpoints=https://10.20.176.212:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key


                                2.多master节点恢复

                                1. 温馨提示,此处的集群etcd数据库是安装在Kubernetes集群之中的,并非外部独立安装部署的。

                                  $ kubectl get pod -n kube-system  etcd-devtest-master-212 etcd-devtest-master-213 etcd-devtest-master-214
                                  NAME READY STATUS RESTARTS AGE
                                  etcd-devtest-master-212 1/1 Running 3 134d
                                  etcd-devtest-master-213 1/1 Running 0 134d
                                  etcd-devtest-master-214 1/1 Running 0 134d

                                  2.前面我们提到过,在进行恢复前需要查看 etcd 集群当前成员以及监控状态。

                                    # etcd 集群成员列表
                                    ETCDCTL_API=3 etcdctl member list --endpoints=https://10.20.176.212:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key
                                    6286508b550016fe, started, devtest-master-212, https://10.20.176.212:2380, https://10.20.176.212:2379, false
                                    9dd15f852caf8e05, started, devtest-master-214, https://10.20.176.214:2380, https://10.20.176.214:2379, false
                                    e0f23bd90b7c7c0d, started, devtest-master-213, https://10.20.176.213:2380, https://10.20.176.213:2379, false


                                    # etcd 集群节点状态查看主从节点
                                    ETCDCTL_API=3 etcdctl endpoint status --endpoints=https://10.20.176.212:2379 --endpoints=https://10.20.176.213:2379 --endpoints=https://10.20.176.214:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out table


                                    # etcd 集群节点健康信息筛选出不健康的节点
                                    ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://10.20.176.212:2379 --endpoints=https://10.20.176.213:2379 --endpoints=https://10.20.176.214:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key
                                    https://10.20.176.212:2379 is healthy: successfully committed proposal: took = 14.686875ms
                                    https://10.20.176.214:2379 is healthy: successfully committed proposal: took = 16.201187ms
                                    https://10.20.176.213:2379 is healthy: successfully committed proposal: took = 18.962462ms

                                    3.停掉所有Master机器的kube-apiserver和etcd ,然后在利用备份进行恢复该节点的etcd数据。

                                      mv /etc/kubernetes/manifests/ /etc/kubernetes/manifests-backup/


                                      # 在该节点上删除 /var/lib/etcd
                                      mv /var/lib/etcd /var/lib/etcd.bak


                                      # 利用快照进行恢复,注意etcd集群中可以用同一份snapshot恢复。
                                      ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-212-32616-snapshot.db --data-dir=/var/lib/etcd --name=devtest-master-212 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --initial-cluster-token=etcd-cluster-0 --initial-cluster=devtest-master-212=https://10.20.176.212:2380,devtest-master-213=https://10.20.176.213:2380,devtest-master-214=https://10.20.176.214:2380 --initial-advertise-peer-urls=https://10.20.176.212:2380


                                      温馨提示: 当节点加入控制平面 control-plane 后为 API Server、Controller Manager 和 Scheduler
                                       生成静态Pod配置清单,主机上的kubelet服务会监视 /etc/kubernetes/manifests
                                      目录中的配置清单的创建、变动和删除等状态变动,并根据变动完成Pod创建、更新或删除操作。因此,这两个阶段创建生成的各配置清单将会启动Master组件的相关Pod

                                      4.然后启动 etcd 和 apiserver 并查看 pods是否恢复正常。

                                        # 查看 etcd pod
                                        $ kubectl get pod -n kube-system -l component=etcd


                                        # 查看组件健康状态
                                        $ kubectl get componentstatuses
                                        Warning: v1 ComponentStatus is deprecated in v1.19+
                                        NAME STATUS MESSAGE ERROR
                                        scheduler Healthy ok
                                        controller-manager Healthy ok
                                        etcd-0 Healthy {"health":"true","reason":""}

                                        亲,文章就要看完了,不关注一下作者吗?



                                        3.K8S集群中etcd数据库节点数据不一致问题解决实践

                                        描述: 在日常巡检业务时,发现etcd数据不一致,执行kubectl get pod -n xxx获取的信息资源不一样, etcdctl 直接查询了 etcd 集群状态和集群数据,返回结果显示 3 个节点状态都正常,且 RaftIndex 一致,观察 etcd 的日志也并未发现报错信息,唯一可疑的地方是 3 个节点的 dbsize 差别较大,得出的结论是 etcd 数据不一致。

                                          $ ETCDCTL_API=3 etcdctl endpoint status --endpoints=https://192.168.12.108:2379 --endpoints=https://192.168.12.107:2379 --endpoints=https://192.168.12.109:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out table
                                          +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
                                          | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
                                          +-----------------------------+------------------+---------+---------+-----------+-----------+------------+
                                          | https://192.168.12.108:2379 | 2db31a5d67ec1034 | 3.5.1 | 7.4 MB | false | 4 | 72291872 |
                                          | https://192.168.12.107:2379 | 471323846709334f | 3.5.1 | 9.0 MB | false | 4 | 72291872 |
                                          | https://192.168.12.109:2379 | 42efe7cca897d765 | 3.5.1 | 9.2 MB | true | 4 | 72291872 |
                                          +-----------------------------+------------------+---------+---------+-----------+-----------+------------+

                                          由于从 kube-apiserver 的日志中同样无法提取出能够帮助解决问题的有用信息,起初我们只能猜测可能是 kube-apiserver 的缓存更新异常导致的。正当我们要从这个切入点去解决问题时,该同事反馈了一个更诡异的问题——自己新创建的 Pod,通过 kubectl查询 Pod 列表,突然消失了!纳尼?这是什么骚操作?经过我们多次测试查询发现,通过 kubectl 来 list pod 列表,该 pod 有时候能查到,有时候查不到。那么问题来了,K8s api 的 list 操作是没有缓存的,数据是 kube-apiserver 直接从 etcd 拉取返回给客户端的,难道是 etcd 本身出了问题?

                                          众所周知,etcd 本身是一个强一致性的 KV 存储,在写操作成功的情况下,两次读请求不应该读取到不一样的数据。怀着不信邪的态度,我们通过 etcdctl 直接查询了 etcd 集群状态和集群数据,返回结果显示 3 个节点状态都正常,且 RaftIndex 一致,观察 etcd 的日志也并未发现报错信息,唯一可疑的地方是 3 个节点的 dbsize 差别较大。接着,我们又将 client 访问的 endpoint 指定为不同节点地址来查询每个节点的 key 的数量,结果发现 3 个节点返回的 key 的数量不一致,甚至两个不同节点上 Key 的数量差最大可达到几千!而直接通过 etcdctl 查询刚才创建的 Pod,发现访问某些 endpoint 能够查询到该 pod,而访问其他 endpoint 则查不到。

                                          至此,基本可以确定 etcd 集群的节点之间确实存在数据不一致现象。

                                          解决思路


                                          解决实践: 此处将weiyigeek-109中的etcd数据备份恢复到weiyigeek-107与weiyigeek-108节点中。

                                            # 1.导出 DB SIZE 最大的 etcd 节点数据(操作前一定要备份各 etcd 节点数据)。
                                            etcdctl --endpoints=https://192.168.12.107:2379 snapshot save /backup/etcd-107-snapshot.db
                                            etcdctl --endpoints=https://192.168.12.108:2379 snapshot save /backup/etcd-108-snapshot.db
                                            etcdctl --endpoints=https://192.168.12.109:2379 snapshot save /backup/etcd-109-snapshot.db # 最完整的etd数据节点(它是leader)


                                            # 补充:如果需要迁移leader,执行如下命令将leader交于weiyigeek-109
                                            etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --endpoints=192.168.12.107:2379 move-leader 42efe7cca897d765


                                            # 2.备份异常节点上 etcd 目录
                                            mv /var/lib/etcd /var/lib/etcd.bak
                                            # 或者压缩备份
                                            tar -czvf etcd-20230308bak.taz.gz /var/lib/etcd


                                            # 3.从etcd集群中移除异常节点,注意此处指定的endpoints是正常的leader节点。
                                            $ etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --endpoints=192.168.12.109:2379 member list
                                            # 异常 异常的 weiyigeek-107 节点
                                            $ etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --endpoints=192.168.12.109:2379 member remove 2db31a5d67ec1034


                                            # 4.修改weiyigeek-107节点上的etcd服务, 先将该节点节点再次加入etcd集群。。
                                            $ mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/manifests-backup/etcd.yaml


                                            $ vim /etc/kubernetes/manifests-backup/etcd.yaml
                                            - --initial-cluster=weiyigeek-109=https://192.168.12.109:2380,weiyigeek-108=https://192.168.12.108:2380,weiyigeek-107=https://192.168.12.107:2380
                                            - --initial-cluster-state=existing


                                            $ kubectl exec -it -n kube-system etcd-weiyigeek-107 sh
                                            $ export ETCDCTL_API=3
                                            $ ETCD_ENDPOINTS=192.168.12.109:2379
                                            $ etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --endpoints=$ETCD_ENDPOINTS member add weiyigeek-107 --peer-urls=https://192.168.12.107:2380
                                            Member 8cecf1f50b91e502 added to cluster 71d4ff56f6fd6e70


                                            # 5.加入到集群后,启动 weiyigeek-107 节点中的 etcd 服务
                                            mv /etc/kubernetes/manifests-backup/etcd.yaml /etc/kubernetes/manifests/etcd.yaml

                                            验证结果
                                            解决之后可以发现节点启动正常,数据也变成一致,各Etcd节点keys的数量是一致的都是1976,且数据大小都变成一致了。

                                              $ for count in {107..109};do ETCDCTL_API=3  etcdctl get / --prefix --keys-only --endpoints=https://192.168.12.${count}:2379  --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key | wc -l;done
                                              1976


                                              $ for count in {107..109};do ETCDCTL_API=3 etcdctl endpoint status --endpoints=https://192.168.12.${count}:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key ;done
                                              https://192.168.12.107:2379, 8cecf1f50b91e502, 3.5.1, 9.2 MB, false, 9, 72291872
                                              https://192.168.12.108:2379, 35672f075108e36a, 3.5.1, 9.2 MB, false, 9, 72291872
                                              https://192.168.12.109:2379, 42efe7cca897d765, 3.5.1, 9.2 MB, true, 9, 72291872

                                              至此,K8S集群中etcd数据不一致解决与恢复完毕,如果此篇文章对你有帮助,希望大家多多支持!


                                              原文地址: https://blog.weiyigeek.top/2022/10-15-690.html

                                              本文至此完毕,更多技术文章,尽情等待下篇好文!






                                               学习书籍推荐 往期发布文章 

                                              记一次由于外部K8S集群证书到期导致Jenkins无法生成动态agent节点错误解决(入坑出坑)

                                              运维案例之记一次Kubernetes集群证书过期或延期操作处理实践指南(干货分享)

                                              持续集成案例之使用Docker运行自构建Jenkins的Agent镜像固定工作节点实践(分享企业项目流水线代码)

                                              运维实践 | 如何优雅将K8S资源清单中的元数据metadata,通过环境变量注入到Pod容器?

                                              企业运维 | Redis内存数据库在Docker与Kubernetes环境中快速搭建部署单实例与主从集群实践

                                              长按(扫描)二维码  获取更多渠道哟!






                                              如果此篇文章对你有帮助,请你将它分享给更多的人!

                                               点击【"阅读原文"】获取更多有趣的知识!

                                              文章转载自全栈工程师修炼指南,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                                              评论