记一次longhorn 组件重启导致pv无法正常挂载

k8s实战 2021-09-27

639

集群中的longhorn组件异常重启后发现我们使用longhorn创建的pv无法正常挂载报错如下

Events:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Normal   Scheduled    99s   default-scheduler  Successfully assigned devops/nexus3-84c8b98cb-rshlv to node-02
  Warning  FailedMount  78s   kubelet            MountVolume.SetUp failed for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 contains a file system with errors, check forced.
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: Inodes that were part of a corrupted orphan linked list found.  

/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

describe信息提示我们执行fsck，我们到pv所在的node节点上执行fsck如下

[root@node-02 e2fsprogs-1.45.6]# fsck.ext4 -cvf /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 
e2fsck 1.42.9 (28-Dec-2013)
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 has unsupported feature(s): metadata_csum
e2fsck: Get a newer version of e2fsck!

提示e2fsck版本太低需要升级，我们这里先升级一下e2fsck

[root@node-02 replicas]# wget https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz
--2021-09-16 11:51:48--  https://distfiles.macports.org/e2fsprogs/e2fsprogs-1.45.6.tar.gz
正在解析主机 distfiles.macports.org (distfiles.macports.org)... 151.101.230.132
正在连接 distfiles.macports.org (distfiles.macports.org)|151.101.230.132|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：7938544 (7.6M) [application/x-gzip]
正在保存至: “e2fsprogs-1.45.6.tar.gz”

100%[=======================================================================================================================================>] 7,938,544    747KB/s 用时 10s    

2021-09-16 11:52:04 (747 KB/s) - 已保存 “e2fsprogs-1.45.6.tar.gz” [7938544/7938544])

[root@node-02 replicas]# tar -zxvf e2fsprogs-1.45.6.tar.gz
e2fsprogs-1.45.6/
e2fsprogs-1.45.6/.gitignore
e2fsprogs-1.45.6/.missing-copyright
e2fsprogs-1.45.6/.release-checklist
.......
[root@node-02 replicas]# cd e2fsprogs-1.45.6/
[root@node-02 e2fsprogs-1.45.6]# ./configure 
Generating configuration file for e2fsprogs version 1.45.6
Release date is March, 2020
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking for gcc... gcc
checking whether the C compiler works... yes
.......
[root@node-02 e2fsprogs-1.45.6]# make
cd ./util ; make subst
make[1]: 进入目录“/var/lib/longhorn/replicas/e2fsprogs-1.45.6/util”
        CREATE dirpaths.h
        CC subst.c
        LD subst
make[1]: 离开目录“/var/lib/longhorn/replicas/e2fsprogs-1.45.6/util”
make[1]: 进入目录“/var/lib/longhorn/replicas/e2fsprogs-1.45.6”
make[1]: “util/subst.conf”是最新的。
.......
[root@node-02 e2fsprogs-1.45.6]# ls
ABOUT-NLS     asm_types.h   config.status  debian      e2fsck          include         intl         MCONFIG     parse-types.log  RELEASE-NOTES  SUBMITTING-PATCHES  wordwrap.pl
acinclude.m4  CleanSpec.mk  configure      debugfs     e2fsprogs.lsm   INSTALL         lib          MCONFIG.in  po               resize         tests
aclocal.m4    config        configure.ac   depfix.sed  e2fsprogs.spec  INSTALL.elfbin  Makefile     misc        public_config.h  scrub          util
Android.bp    config.log    contrib        doc         ext2ed          install-utils   Makefile.in  NOTICE      README           SHLIBS         version.h
[root@node-02 e2fsprogs-1.45.6]# cd e2fsck/
[root@node-02 e2fsck]# ls
Android.bp   dx_dirinfo.c  e2fsck.conf.5     ehandler.c  flushb.c    logfile.o    mtrace.c  pass2.c  pass5.c     quota.c      region.c  scantest.c    unix.o
badblocks.c  dx_dirinfo.o  e2fsck.conf.5.in  ehandler.o  iscan.c     Makefile     mtrace.h  pass2.o  pass5.o     quota.o      region.o  sigcatcher.c  util.c
badblocks.o  e2fsck        e2fsck.h          emptydir.c  jfs_user.h  Makefile.in  pass1b.c  pass3.c  problem.c   readahead.c  rehash.c  sigcatcher.o  util.o
CHANGES      e2fsck.8      e2fsck.o          extend.c    journal.c   message.c    pass1b.o  pass3.o  problem.h   readahead.o  rehash.o  super.c
dirinfo.c    e2fsck.8.in   ea_refcount.c     extents.c   journal.o   message.o    pass1.c   pass4.c  problem.o   recovery.c   revoke.c  super.o
dirinfo.o    e2fsck.c      ea_refcount.o     extents.o   logfile.c   mtrace.awk   pass1.o   pass4.o  problemP.h  recovery.o   revoke.o  unix.c
[root@node-02 e2fsck]# e2fsck #查看编译好的最新e2fsck信息
[root@node-02 e2fsck]# cp e2fsck  /sbin   #将e2fsck复制替换掉系统原有e2fsck
cp：是否覆盖"/sbin/e2fsck"？y

我们再使用fsck执行一下修复

[root@node-02 e2fsck]# fsck.ext4 -cvf /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 
e2fsck 1.45.6 (20-Mar-2020)
Checking for bad blocks (read-only test): done                                                 
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: Updating bad block inode.
第一步: 检查inode,块,和大小
Inodes that were part of a corrupted orphan linked list found.  处理<y>? 是
Inode 131102 was part of the ��立的 inode list.  已处理.
Inode 131103 was part of the ��立的 inode list.  已处理.
Inode 131104 was part of the ��立的 inode list.  已处理.
Inode 131105 was part of the ��立的 inode list.  已处理.
Inode 131106 was part of the ��立的 inode list.  已处理.
Inode 131107 was part of the ��立的 inode list.  已处理.
Inode 131117 was part of the ��立的 inode list.  已处理.
Inode 131402 was part of the ��立的 inode list.  已处理.
Inode 131412 was part of the ��立的 inode list.  已处理.
Inode 131630 was part of the ��立的 inode list.  已处理.
Inode 131638 was part of the ��立的 inode list.  已处理.
Inode 131644 was part of the ��立的 inode list.  已处理.
第二步: 检查目录结构
第3步: 检查目录连接性
Pass 4: Checking reference counts
第5步: 检查簇概要信息
块位图差异:  -(688640--690326)
处理<y>? 是
Free 块s count wrong for 簇 #21 (31069, counted=32756).
处理<y>? 是
Free 块s count wrong (1227977, counted=1229664).
处理<y>? 是
Inode位图差异:  -(131101--131107) -131117 -131402 -131412 -131630 -131638 -131644
处理<y>? 是
Free inodes count wrong for 簇 #16 (7567, counted=7580).
处理<y>? 是
Free inodes count wrong (325295, counted=325308).
处理<y>? 是

/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: ***** 文件系统已修改 *****

        2372 inodes used (0.72%, out of 327680)
         182 non-contiguous files (7.7%)
           1 non-contiguous directory (0.0%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 2361/3
       81056 blocks used (6.18%, out of 1310720)
           0 bad blocks
           1 large file

        1600 regular files
         763 directories
           0 character device files
           0 block device files
           0 fifos
           0 links
           0 symbolic links (0 fast symbolic links)
           0 sockets
------------
        2363 files
[root@node-02 e2fsck]

检查完成后我们使用descibe 查看之前报错的pod 发现如下

  Normal   Scheduled               12m                   default-scheduler        Successfully assigned devops/nexus3-5c9c5545d9-nmfjg to node-02
  Normal   SuccessfulAttachVolume  12m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90"
  Warning  FailedMount             3m46s (x12 over 12m)  kubelet                  MountVolume.SetUp failed for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 but could not correct them: fsck from util-linux 2.31.1
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 contains a file system with errors, check forced.
/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: Inodes that were part of a corrupted orphan linked list found.  

/dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
           (i.e., without -a or -p options)
  Warning  FailedMount  3m33s  kubelet  Unable to attach or mount volumes: unmounted volumes=[nexus-data], unattached volumes=[default-token-dv7nx nexus-data]: timed out waiting for the condition
  Warning  FailedMount  104s   kubelet  MountVolume.SetUp failed for volume "pvc-9784831a-3130-4377-9d44-7e7129473b90" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 /var/lib/kubelet/pods/8268934a-f1d9-4c14-ad4a-276d6986cee8/volumes/kubernetes.io~csi/pvc-9784831a-3130-4377-9d44-7e7129473b90/mount
Output: mount: /var/lib/kubelet/pods/8268934a-f1d9-4c14-ad4a-276d6986cee8/volumes/kubernetes.io~csi/pvc-9784831a-3130-4377-9d44-7e7129473b90/mount: /dev/longhorn/pvc-9784831a-3130-4377-9d44-7e7129473b90 already mounted or mount point busy.
  Warning  FailedMount  78s (x4 over 10m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[nexus-data], unattached volumes=[nexus-data default-token-dv7nx]: timed out waiting for the condition

此时我们删除这个pod，重建pod 就会发现pv已经可以正常挂载了

[root@master-01 nexus]# kubectl -n devops  delete pod nexus3-5c9c5545d9-nmfjg 
pod "nexus3-5c9c5545d9-nmfjg" deleted
[root@master-01 nexus]# kubectl describe  pod -n devops  nexus3-5c9c5545d9-bm9dk 
Name:         nexus3-5c9c5545d9-bm9dk
Namespace:    devops
Priority:     0
Node:         node-02/172.26.204.144
Start Time:   Thu, 16 Sep 2021 12:00:11 +0800
Labels:       k8s-app=nexus3
              pod-template-hash=5c9c5545d9
Annotations:  cni.projectcalico.org/podIP: 100.114.252.214/32
Status:       Running
IP:           100.114.252.214
IPs:
  IP:           100.114.252.214
Controlled By:  ReplicaSet/nexus3-5c9c5545d9
Containers:
  nexus3:
    Container ID:   docker://a729451dbf3482c0847397b355a204f4e2fa0681392d28a478276b6efeb7c0a2
    Image:          sonatype/nexus3:3.32.0
    Image ID:       docker-pullable://sonatype/nexus3@sha256:4b73d33797727349adb7dff50da9c8eb17298706b481a00b330c589b8a893f36
    Ports:          8083/TCP, 8081/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 16 Sep 2021 12:00:20 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  2Gi
    Requests:
      cpu:        100m
      memory:     200Mi
    Environment:  <none>
    Mounts:
      /nexus-data from nexus-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dv7nx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  nexus-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  nexus-data
    ReadOnly:   false
  default-token-dv7nx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-dv7nx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  12s   default-scheduler  Successfully assigned devops/nexus3-5c9c5545d9-bm9dk to node-02
  Normal  Pulled     3s    kubelet            Container image "sonatype/nexus3:3.32.0" already present on machine
  Normal  Created    3s    kubelet            Created container nexus3
  Normal  Started    3s    kubelet            Started container nexus3

数据库

文章转载自k8s实战，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

记一次longhorn 组件重启导致pv无法正常挂载

评论