记一次longhorn 组件重启导致pv无法正常挂载

运维开发故事 2021-09-29

1510

集群中的longhorn组件异常重启后发现我们使用longhorn创建的pv无法正常挂载报错如下

Events:  Type     Reason       Age   From               Message  ----     ------       ----  ----               -------  Normal   Scheduled    99s   default-scheduler  Successfully assigned devops/nexus3-84c8b98cb-rshlv to node-02  Warning  FailedMount  78s   kubelet            MountVolume.SetUp failed for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 but could not correct them: fsck from util-linux 2.31.1/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 contains a file system with errors, check forced./dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: Inodes that were part of a corrupted orphan linked list found.  /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

describe信息提示我们执行fsck，我们到pv所在的node节点上执行fsck如下

[root@node-02 e2fsprogs-1.45.6]# fsck.ext4 -cvf /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 e2fsck 1.42.9 (28-Dec-2013)/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 has unsupported feature(s): metadata_csume2fsck: Get a newer version of e2fsck!

提示e2fsck版本太低需要升级，我们这里先升级一下e2fsck

[root@node-02 replicas]# wget https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz--2021-09-16 11:51:48--  https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz正在解析主机 distfiles.macports.org (distfiles.macports.org)... 151.101.230.132正在连接 distfiles.macports.org (distfiles.macports.org)|151.101.230.132|:443... 已连接。已发出 HTTP 请求，正在等待回应... 200 OK长度：7938544 (7.6M) [application/x-gzip]正在保存至: “e2fsprogs-1.45.6.tar.gz”100%[=======================================================================================================================================>] 7,938,544    747KB/s 用时 10s    2021-09-16 11:52:04 (747 KB/s) - 已保存 “e2fsprogs-1.45.6.tar.gz” [7938544/7938544])[root@node-02 replicas]# tar -zxvf e2fsprogs-1.45.6.tar.gze2fsprogs-1.45.6/e2fsprogs-1.45.6/.gitignoree2fsprogs-1.45.6/.missing-copyrighte2fsprogs-1.45.6/.release-checklist.......[root@node-02 replicas]# cd e2fsprogs-1.45.6/[root@node-02 e2fsprogs-1.45.6]# ./configure Generating configuration file for e2fsprogs version 1.45.6Release date is March, 2020checking build system type... x86_64-pc-linux-gnuchecking host system type... x86_64-pc-linux-gnuchecking for gcc... gccchecking whether the C compiler works... yes.......[root@node-02 e2fsprogs-1.45.6]# makecd ./util ; make substmake[1]: 进入目录“/var/lib/longhorn/replicas/e2fsprogs-1.45.6/util”        CREATE dirpaths.h        CC subst.c        LD substmake[1]: 离开目录“/var/lib/longhorn/replicas/e2fsprogs-1.45.6/util”make[1]: 进入目录“/var/lib/longhorn/replicas/e2fsprogs-1.45.6”make[1]: “util/subst.conf”是最新的。.......[root@node-02 e2fsprogs-1.45.6]# lsABOUT-NLS     asm_types.h   config.status  debian      e2fsck          include         intl         MCONFIG     parse-types.log  RELEASE-NOTES  SUBMITTING-PATCHES  wordwrap.placinclude.m4  CleanSpec.mk  configure      debugfs     e2fsprogs.lsm   INSTALL         lib          MCONFIG.in  po               resize         testsaclocal.m4    config        configure.ac   depfix.sed  e2fsprogs.spec  INSTALL.elfbin  Makefile     misc        public_config.h  scrub          utilAndroid.bp    config.log    contrib        doc         ext2ed          install-utils   Makefile.in  NOTICE      README           SHLIBS         version.h[root@node-02 e2fsprogs-1.45.6]# cd e2fsck/[root@node-02 e2fsck]# lsAndroid.bp   dx_dirinfo.c  e2fsck.conf.5     ehandler.c  flushb.c    logfile.o    mtrace.c  pass2.c  pass5.c     quota.c      region.c  scantest.c    unix.obadblocks.c  dx_dirinfo.o  e2fsck.conf.5.in  ehandler.o  iscan.c     Makefile     mtrace.h  pass2.o  pass5.o     quota.o      region.o  sigcatcher.c  util.cbadblocks.o  e2fsck        e2fsck.h          emptydir.c  jfs_user.h  Makefile.in  pass1b.c  pass3.c  problem.c   readahead.c  rehash.c  sigcatcher.o  util.oCHANGES      e2fsck.8      e2fsck.o          extend.c    journal.c   message.c    pass1b.o  pass3.o  problem.h   readahead.o  rehash.o  super.cdirinfo.c    e2fsck.8.in   ea_refcount.c     extents.c   journal.o   message.o    pass1.c   pass4.c  problem.o   recovery.c   revoke.c  super.odirinfo.o    e2fsck.c      ea_refcount.o     extents.o   logfile.c   mtrace.awk   pass1.o   pass4.o  problemP.h  recovery.o   revoke.o  unix.c[root@node-02 e2fsck]# e2fsck #查看编译好的最新e2fsck信息[root@node-02 e2fsck]# cp e2fsck  /sbin   #将e2fsck复制替换掉系统原有e2fsckcp：是否覆盖"/sbin/e2fsck"？y

我们再使用fsck执行一下修复

[root@node-02 e2fsck]# fsck.ext4 -cvf /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 e2fsck 1.45.6 (20-Mar-2020)Checking for bad blocks (read-only test): done                                                 /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: Updating bad block inode.第一步: 检查inode,块,和大小Inodes that were part of a corrupted orphan linked list found.  处理<y>? 是Inode 131102 was part of the ��立的 inode list.  已处理.Inode 131103 was part of the ��立的 inode list.  已处理.Inode 131104 was part of the ��立的 inode list.  已处理.Inode 131105 was part of the ��立的 inode list.  已处理.Inode 131106 was part of the ��立的 inode list.  已处理.Inode 131107 was part of the ��立的 inode list.  已处理.Inode 131117 was part of the ��立的 inode list.  已处理.Inode 131402 was part of the ��立的 inode list.  已处理.Inode 131412 was part of the ��立的 inode list.  已处理.Inode 131630 was part of the ��立的 inode list.  已处理.Inode 131638 was part of the ��立的 inode list.  已处理.Inode 131644 was part of the ��立的 inode list.  已处理.第二步: 检查目录结构第3步: 检查目录连接性Pass 4: Checking reference counts第5步: 检查簇概要信息块位图差异:  -(688640--690326)处理<y>? 是Free 块s count wrong for 簇 #21 (31069, counted=32756).处理<y>? 是Free 块s count wrong (1227977, counted=1229664).处理<y>? 是Inode位图差异:  -(131101--131107) -131117 -131402 -131412 -131630 -131638 -131644处理<y>? 是Free inodes count wrong for 簇 #16 (7567, counted=7580).处理<y>? 是Free inodes count wrong (325295, counted=325308).处理<y>? 是/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: ***** 文件系统已修改 *****        2372 inodes used (0.72%, out of 327680)         182 non-contiguous files (7.7%)           1 non-contiguous directory (0.0%)             # of inodes with ind/dind/tind blocks: 0/0/0             Extent depth histogram: 2361/3       81056 blocks used (6.18%, out of 1310720)           0 bad blocks           1 large file        1600 regular files         763 directories           0 character device files           0 block device files           0 fifos           0 links           0 symbolic links (0 fast symbolic links)           0 sockets------------        2363 files[root@node-02 e2fsck]

检查完成后我们使用descibe 查看之前报错的pod 发现如下

  Normal   Scheduled               12m                   default-scheduler        Successfully assigned devops/nexus3-5c9c5545d9-nmfjg to node-02  Normal   SuccessfulAttachVolume  12m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90"  Warning  FailedMount             3m46s (x12 over 12m)  kubelet                  MountVolume.SetUp failed for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 but could not correct them: fsck from util-linux 2.31.1/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 contains a file system with errors, check forced./dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: Inodes that were part of a corrupted orphan linked list found.  /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.           (i.e., without -a or -p options)  Warning  FailedMount  3m33s  kubelet  Unable to attach or mount volumes: unmounted volumes=[nexus-data], unattached volumes=[default-token-dv7nx nexus-data]: timed out waiting for the condition  Warning  FailedMount  104s   kubelet  MountVolume.SetUp failed for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90" : rpc error: code = Internal desc = mount failed: exit status 32Mounting command: mountMounting arguments: -t ext4 -o defaults /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 /var/lib/kubelet/pods/8268934a-f1d9-4c14-ad4a-276d6986cee8/volumes/kubernetes.io~csi/pvc-9784831a-3130-4377-9d44-7e7129473b90/mountOutput: mount: /var/lib/kubelet/pods/8268934a-f1d9-4c14-ad4a-276d6986cee8/volumes/kubernetes.io~csi/pvc-9784831a-3130-4377-9d44-7e7129473b90/mount: /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 already mounted or mount point busy.  Warning  FailedMount  78s (x4 over 10m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[nexus-data], unattached volumes=[nexus-data default-token-dv7nx]: timed out waiting for the condition

此时我们删除这个pod，重建pod 就会发现pv已经可以正常挂载了

[root@master-01 nexus]# kubectl -n devops  delete pod nexus3-5c9c5545d9-nmfjg pod "nexus3-5c9c5545d9-nmfjg" deleted[root@master-01 nexus]# kubectl describe  pod -n devops  nexus3-5c9c5545d9-bm9dk Name:         nexus3-5c9c5545d9-bm9dkNamespace:    devopsPriority:     0Node:         node-02/172.26.204.144Start Time:   Thu, 16 Sep 2021 12:00:11 +0800Labels:       k8s-app=nexus3              pod-template-hash=5c9c5545d9Annotations:  cni.projectcalico.org/podIP: 100.114.252.214/32Status:       RunningIP:           100.114.252.214IPs:  IP:           100.114.252.214Controlled By:  ReplicaSet/nexus3-5c9c5545d9Containers:  nexus3:    Container ID:   docker://a729451dbf3482c0847397b355a204f4e2fa0681392d28a478276b6efeb7c0a2    Image:          sonatype/nexus3:3.32.0    Image ID:       docker-pullable://sonatype/nexus3@sha256:4b73d33797727349adb7dff50da9c8eb17298706b481a00b330c589b8a893f36    Ports:          8083/TCP, 8081/TCP    Host Ports:     0/TCP, 0/TCP    State:          Running      Started:      Thu, 16 Sep 2021 12:00:20 +0800    Ready:          True    Restart Count:  0    Limits:      memory:  2Gi    Requests:      cpu:        100m      memory:     200Mi    Environment:  <none>    Mounts:      /nexus-data from nexus-data (rw)      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dv7nx (ro)Conditions:  Type              Status  Initialized       True   Ready             True   ContainersReady   True   PodScheduled      True Volumes:  nexus-data:    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)    ClaimName:  nexus-data    ReadOnly:   false  default-token-dv7nx:    Type:        Secret (a volume populated by a Secret)    SecretName:  default-token-dv7nx    Optional:    falseQoS Class:       BurstableNode-Selectors:  <none>Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300sEvents:  Type    Reason     Age   From               Message  ----    ------     ----  ----               -------  Normal  Scheduled  12s   default-scheduler  Successfully assigned devops/nexus3-5c9c5545d9-bm9dk to node-02
  Normal  Pulled     3s    kubelet            Container image "sonatype/nexus3:3.32.0" already present on machine
  Normal  Created    3s    kubelet            Created container nexus3
  Normal  Started    3s    kubelet            Started container nexus3

数据库

文章转载自运维开发故事，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

记一次longhorn 组件重启导致pv无法正常挂载

评论