Gluster Self-Heal如何工作的？

存储内核技术交流 2021-03-12

850

作者	时间	QQ技术交流群
perrynzhou@gmail.com	2020/12/01	672152841

Gluster Shd进程介绍

glusterfs shd进程的主要职责是修复副本卷和EC卷的数据，shd存在于每个gluster集群的节点中(如果开启heal)，不是基于卷的。比如一个节点既部署了副本的brick，同时也部署了ec的brick，基本就一个shd进程

glusterfs shd进程会加载io-stats这个xlator，这个xlator的孩子可以是副本卷的xlator,也可以是ec卷的xlator.如果展示了glustershd的加载的xlator

 // EC卷的第一个glusterfsd
 volume ec-vol-client-0
     type protocol/client
     option ping-timeout 42
     option remote-host 172.16.84.37
     option remote-subvolume /data/ec/brick
 end-volume
 // EC卷的第二个glusterfsd
 volume ec-vol-client-1
     type protocol/client
     option ping-timeout 42
     option remote-host 172.16.84.41
     option remote-subvolume /data/ec/brick
 end-volume

// EC卷的sglusterfsd的集合
 volume ec-vol-disperse-0
     type cluster/disperse
     option redundancy 1
     option self-heal-daemon enable
     option iam-self-heal-daemon yes
     subvolumes ec-vol-client-0 ec-vol-client-1 ec-vol-client-2
 end-volume
 
 // io-stats的xlator
 volume ec-vol
   type debug/io-stats
   option log-level INFO
   option threads 16
   subvolumes ec-vol-disperse-0
 end-volume

Volume Name: ec-vol
Type: Disperse
Volume ID: 8ab95850-a90f-4e1d-b7f4-5932c52f7dce
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 172.16.84.37:/data/ec/brick
Brick2: 172.16.84.41:/data/ec/brick
Brick3: 172.16.84.42:/data/ec/brick
Options Reconfigured:
cluster.disperse-self-heal-daemon: enable
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on

shd进程有两种self-heal crawls,一种是index heal,另外一种是full heal.每个文件在crawling时候，会执行metadata、data、entry的修复。metadata修复文件的属性、权限、mode。data的修复会修复文件内容;entry修复会修复entry里面的目录

Index Heal

触发的条件

glusterfs中一般会有三种情况会触发index heal
```
1.每隔60s会触发一次，可以通过gluster volume xxx set cluster.heal-timeout {value}来设置
2.当客户端执行gluster volume heal {volume}会触发
3.当副本卷所在节点或者glusterfsd宕机后，又再次恢复过来会触发
```
在一个时间段内只会有一个shd进程来做修复。当此次shd修复完成后才可以触发shd进程进行下一次的修复，没完成之前会把请求进入队列。如果在cli执行了gluster volume heal {volume}后，此次修复没有完成，再次执行gluster volume heal {volume} full,只有等上一次完成之后才能进行full的修复

如何做Index Heal?

在index heal逻辑里面，gluster shd进程读取每个brick下的.glusterfs/indices/xattrop这个目录下的目录，加锁触发修复。.glusterfs/indices/xattrop目录中包含了以"xattrop-"开头的目录。这个目录下的其他木库是以gfid开头的，这些是需要修复的数据

//在节点2上干掉一个glusterfsd进程
[root@node2 ~]# ps -ef|grep glusterfsd
root     1401177       1  1 10:54 ?        00:03:15 /usr/local/sbin/glusterfsd -s 172.16.84.41 --volfile-id replica-vol.172.16.84.41.data-replica-brick -p /var/run/gluster/vols/replica-vol/172.16.84.41-data-replica-brick.pid 
[root@node2 ~]# kill -9 1401177

// 节点1 上挂载然后，拷贝文件到这个卷中
[root@node1 glusterfs]#  mount -t glusterfs 172.16.84.37:replica-vol /mnt/rep/
[root@node1 glusterfs]# cp cmd_history.log  /mnt/rep/

// 然后发现这个卷下有2个文件 xattrop-22a5abc0-b16c-497e-b882-18c2c3486c5e和300f383b-db21-412b-9a10-6d70fd80fa7d
[root@node1 xattrop]# pwd
/data/replica/brick/.glusterfs/indices/xattrop
[root@node1 xattrop]# ls -l
total 0
---------- 3 root root 0 Mar 11 10:57 00000000-0000-0000-0000-000000000001
---------- 3 root root 0 Mar 11 10:57 300f383b-db21-412b-9a10-6d70fd80fa7d
---------- 3 root root 0 Mar 11 10:57 xattrop-22a5abc0-b16c-497e-b882-18c2c3486c5e

当客户端对这个cmd_history.log 文件进行操作请求发送到glusterfsd处理时候，在features/index这个xlator处理时候会添加硬链接在fop的pre-op事务处理阶段，如果修复处理成功也会删除硬链接在post-op处理阶段。因此当在.glusterfs/indices/xattrop目录中的没有IO操作的时候，对.glusterfs/indices/xattrop下面的文件进行修复，修复的逻辑如下

//修复过程
In shd process of each node {
        opendir +readdir (.glusterfs/indices/xattrop)
        for each entry inside it {
                self_heal_entry() //Explained below.
        }
}

self_heal_entry() {
        Call syncop_lookup(replicae subvolume) which eventually does {
                take appropriate locks
                determine source and sinks from AFR changelog xattrs    
                perform whatever heal is needed (any of metadata, data and entry heal in that order)
                clear changelog xattrs and hardlink inside .glusterfs/indices/xattrop
        }
}

// features/index xlator的信息
volume replica-vol-index
  type features/index
  option index-base /data/replica/brick/.glusterfs/indices
  option xattrop-dirty-watchlist trusted.afr.dirty
  option xattrop-pending-watchlist trusted.afr.replica-vol-
  subvolumes replica-vol-barrier
end-volume

当gfid 的硬链接在副本卷的中2个副本的brick存在 .glusterfs/indices/xattrop，它会基于domain lock一个一个修复。当某个文件或者目录的父目录没有没有被修复，会跳过继续修复可以修复的文件或者目录。
如果一个文件出现data或者metadata出现脑裂，这些文件是不需要被修复的。

Full Heal

当在cli执行gluster volume heal {volume} full,会触发Full heal。这个命令比较粗暴，直接从健康的brick拷贝数据到被修复的brick.
这个和index heal修复模式不同，gluster集群中每个节点都有一个uuid.选择uuid最大的.也是最新最健康的的brick源端的数据源，选择后数据搬运到待修复的brick上.基本的修复逻辑如下

In shd process of highest UUID node per replica {
          opendir +readdir ("/")
          for each entry inside it {
                  self_heal_entry()
                  if (entry == directory) {
                          / Recurse/
                          again opendir+readdir (directory) followed by self_heal_entry() of each entry.
                  }
  
}

Gluster 修复的核心函数调用

测试环境

replica 卷

[root@node1 ~]# gluster volume info
Volume Name: replica-vol
Type: Replicate
Volume ID: dc294dff-7adb-46dd-a081-92b3a54d476e
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.16.84.37:/data/replica/brick
Brick2: 172.16.84.41:/data/replica/brick
Brick3: 172.16.84.42:/data/replica/brick
Options Reconfigured:
cluster.self-heal-daemon: enable
storage.reserve: 99
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

// 启动每个volume的shd进程
[root@node1 ~]# gluster volume status replica-vol
Status of volume: replica-vol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 172.16.84.37:/data/replica/brick      49152     0          Y       1511816
Brick 172.16.84.41:/data/replica/brick      49152     0          Y       1401177
Brick 172.16.84.42:/data/replica/brick      49152     0          Y       2731653
Self-heal Daemon on localhost               N/A       N/A        Y       1511833
Self-heal Daemon on 172.16.84.41            N/A       N/A        Y       1401194
Self-heal Daemon on 172.16.84.42            N/A       N/A        Y       2731670
 
Task Status of Volume replica-vol
------------------------------------------------------------------------------
There are no active volume tasks

ec 卷

[root@node1 ~]# gluster volume info 
Volume Name: ec-vol
Type: Disperse
Volume ID: 8ab95850-a90f-4e1d-b7f4-5932c52f7dce
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 172.16.84.37:/data/ec/brick
Brick2: 172.16.84.41:/data/ec/brick
Brick3: 172.16.84.42:/data/ec/brick
Options Reconfigured:
cluster.disperse-self-heal-daemon: enable
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on

[root@node1 ~]# gluster volume status ec-vol
Status of volume: ec-vol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 172.16.84.37:/data/ec/brick           49153     0          Y       1569574
Brick 172.16.84.41:/data/ec/brick           49153     0          Y       1456710
Brick 172.16.84.42:/data/ec/brick           49153     0          Y       2833998
Self-heal Daemon on localhost               N/A       N/A        Y       1511833
Self-heal Daemon on 172.16.84.41            N/A       N/A        Y       1401194
Self-heal Daemon on 172.16.84.42            N/A       N/A        Y       2731670
 
Task Status of Volume ec-vol
------------------------------------------------------------------------------
There are no active volume tasks

节点进程运行信息

// 每个节点运行1个shd进程
[root@node1 ~]# ps -ef|grep -v grep |grep glustershd 
root     1578547       1  0 11:46 ?        00:00:00 /usr/local/sbin/glusterfs -s localhost --volfile-id shd/ec-vol -p /var/run/gluster/shd/ec-vol/ec-vol-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6cf91972718c43a8.socket --xlator-option *replicate*.node-uuid=a37ce536-32de-4829-9d7d-3f3a5ca83ef7 --process-name glustershd --client-pid=-6

数据库

文章转载自存储内核技术交流，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。