暂无图片
暂无图片
3
暂无图片
暂无图片
暂无图片

[Oracle] AFD的爱恨之AFD转普通盘

DBA的自我修养 2021-04-08
4842
 最近遇到个故障,19c升级到19.10以后,整个数据库集群性能下降,IO调用存在异常,工程师使用旧版软件备份(tar包)回退到源版本也没有解决问题。操作系统报错如下,看到multipathed 加上表现为IO缓慢,一度的认为是存储侧异常了。

看到了这个日志,ofsctl ,asm!.. 这可能和acfs/afd driver有关,加上工程师回退过程做过的操作,确定只做了回退软件没有执行prepatch和postpatch。心里一悬。mos一查,果然有bug。 'blk_cloned_rq_check_limits: over max segments limit' Kernel Messages After Grid Infrastructure Patching to 19.10 DB RU (文档 ID 2754950.1) 32437246: HIGH PERFORMANCE ISSUE AFTER APPLYING 19.10 ACFSRU

19.10 , AFD, IO 性能问题, 都对上了。所以必须禁用AFD。

正文开始。

一、 将AFD磁盘转成普通磁盘

1  、关闭crs

/u01/app/19.0.0/grid/bin/crsctl stop crs -f 

2、确定AFD的状态

[root@ora19c02 ~]# export ORACLE_BASE=/u01/app/grid
[root@ora19c02 ~]# /u01/app/19.0.0/grid/bin/asmcmd afd_state
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'ENABLED' on host 'ora19c02'

3、把AFDfilter禁用

[root@ora19c02 ~]# /u01/app/19.0.0/grid/bin/asmcmd afd_filter -d 
[root@ora19c02 ~]# /u01/app/19.0.0/grid/bin/asmcmd afd_state
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'DISABLED' on host 'ora19c02' 

4、查看状态

[root@ora19c02 ~]# /u01/app/19.0.0/grid/bin/asmcmd afd_lsdsk
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
DATA00                     DISABLED   /dev/asm-data0
DATA2DG00                  DISABLED   /dev/asm-data2dg0
DATA2DG01                  DISABLED   /dev/asm-data2dg1
DATA2DG02                  DISABLED   /dev/asm-data2dg2
DATA2DG03                  DISABLED   /dev/asm-data2dg3
FRA00                      DISABLED   /dev/asm-fra0
MGMT00                     DISABLED   /dev/asm-mgmt0
OCR00                      DISABLED   /dev/asm-ocr0
OCR01                      DISABLED   /dev/asm-ocr1
OCR02                      DISABLED   /dev/asm-ocr2
DATA00                     DISABLED   /dev/asm-data0
DATA2DG00                  DISABLED   /dev/asm-data2dg0
DATA2DG01                  DISABLED   /dev/asm-data2dg1
DATA2DG02                  DISABLED   /dev/asm-data2dg2
DATA2DG03                  DISABLED   /dev/asm-data2dg3
FRA00                      DISABLED   /dev/asm-fra0
MGMT00                     DISABLED   /dev/asm-mgmt0
OCR00                      DISABLED   /dev/asm-ocr0
OCR01                      DISABLED   /dev/asm-ocr1
OCR02                      DISABLED   /dev/asm-ocr2

5、在正常节点普通盘路径加上

[grid@ora19c01 ~]$ asmcmd dsget
parameter:AFD:*
profile:AFD:*
[grid@ora19c01 ~]$    asmcmd dsset '/dev/asm-*','AFD:*'
[grid@ora19c01 ~]$ asmcmd dsget
parameter:/dev/asm-*, AFD:*
profile:/dev/asm-*,AFD:*

6、停止acfs驱动和afd驱动,deconfigure掉AFD

[root@ora19c02 ~]# /u01/app/19.0.0/grid/bin/acfsload stop
[root@ora19c02 ~]# /u01/app/19.0.0/grid/bin/afdload stop

[root@ora19c02 ~]# /u01/app/19.0.0/grid/bin/asmcmd afd_deconfigure
AFD-632: Existing AFD installation detected.
AFD-634: Removing previous AFD installation.
AFD-635: Previous AFD components successfully removed.
Modifying resource dependencies - this may take some time.

这里需要说明如果不停掉acfs驱动,无法deconfig掉AFD配置。这里只是stop,并没有禁用,因此ACFS在启动集群时还是会被拉起。只是deconfig掉的AFD就不会再被使用上。

同样这里也分析一下afd_deconfigure都做了啥。

01-Apr-21 00:09 ASMCMD (PID = 1362) Given command - afd_deconfigure
01-Apr-21 00:09 NOTE: asmcmdafd_is_exadata /etc/oracle/cell/network-config/cellip.ora not found
01-Apr-21 00:09 NOTE: Verifying AFD driver state : supported
01-Apr-21 00:09 NOTE: Verifying AFD driver state : installed
01-Apr-21 00:09 NOTE: Status of clusterware stack : Not online
01-Apr-21 00:09 NOTE: command execution (/u01/app/19.0.0/grid/bin/acfsdriverstate loaded) returned : 1
01-Apr-21 00:09 NOTE: Verifying ACFS driver state : Not loaded
01-Apr-21 00:09 NOTE: Uninstalling AFD... 
01-Apr-21 00:10 NOTE: Successfully uninstalled AFD
01-Apr-21 00:10 NOTE: command execution (/bin/rpm -q sles-release 2>&1) returned : 1
01-Apr-21 00:10 NOTE: asmcmdafd_is_siha Opening /etc/oracle/ocr.loc
01-Apr-21 00:10 NOTE: asmcmdafd_is_siha Value (FALSE) set for local_only
01-Apr-21 00:10 NOTE: Starting OHASD... 
01-Apr-21 00:10 NOTE: asmcmdafd_is_siha Opening /etc/oracle/ocr.loc
01-Apr-21 00:10 NOTE: asmcmdafd_is_siha Value (FALSE) set for local_only
上面是检查和uninstall 掉AFD。

01-Apr-21 00:10 cssd res modify command : /u01/app/19.0.0/grid/bin/crsctl modify res ora.cssd -attr "STOP_DEPENDENCIES='hard(shutdown:ora.gipcd,shutdown:ora.diskmon,intermediate:ora.cssdmonitor, shutdown:ora.gpnpd)'" -init
01-Apr-21 00:10 cssdmonitor res modify command : /u01/app/19.0.0/grid/bin/crsctl modify res ora.cssdmonitor -attr "START_DEPENDENCIES='hard(ora.gpnpd)'" -init
这两步是修改olr资源的依赖关系
01-Apr-21 00:10 NOTE: Modified ora.cssd and ora.cssdmonitor resource attributes
01-Apr-21 00:10 NOTE: asmcmdafd_is_siha Opening /etc/oracle/ocr.loc
01-Apr-21 00:10 NOTE: asmcmdafd_is_siha Value (FALSE) set for local_only
01-Apr-21 00:10 NOTE: Checking existence of AFD resource
01-Apr-21 00:10 NOTE: command execution (/u01/app/19.0.0/grid/bin/crsctl stop resource ora.driver.afd -init) returned : 1
01-Apr-21 00:10 NOTE: AFD resource stopped or not running
01-Apr-21 00:10 NOTE: Deleted AFD resource
以上是删除集群中afd的相关资源。ora.driver.afd 从-init 的状态里面移除了。

01-Apr-21 00:10 NOTE: asmcmdafd_is_siha Opening /etc/oracle/ocr.loc
01-Apr-21 00:10 NOTE: asmcmdafd_is_siha Value (FALSE) set for local_only
01-Apr-21 00:10 NOTE: Stopping OHASD... 

7 、启动crs

[root@ora19c02 ~]# 
[root@ora19c02 ~]# /u01/app/19.0.0/grid/bin/crsctl start crs 
CRS-4123: Oracle High Availability Services has been started.

到这里并没完,只是完成了一个节点。

8 、节点1重复以上1-4。

[root@ora19c01 ~]# /u01/app/19.0.0/grid/bin/crsctl stop crs -f 
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ora19c01'
…………
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ora19c01' has completed
CRS-4133: Oracle High Availability Services has been stopped.

[root@ora19c01 ~]# export ORACLE_BASE=/u01/app/grid
[root@ora19c01 ~]# /u01/app/19.0.0/grid/bin/asmcmd afd_state
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'ENABLED' on host 'ora19c01'
[root@ora19c01 ~]#  /u01/app/19.0.0/grid/bin/asmcmd afd_filter -d
[root@ora19c01 ~]# /u01/app/19.0.0/grid/bin/asmcmd afd_lsdsk
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
DATA00                     DISABLED   /dev/asm-data0
DATA2DG00                  DISABLED   /dev/asm-data2dg0
DATA2DG01                  DISABLED   /dev/asm-data2dg1
DATA2DG02                  DISABLED   /dev/asm-data2dg2
DATA2DG03                  DISABLED   /dev/asm-data2dg3
FRA00                      DISABLED   /dev/asm-fra0
MGMT00                     DISABLED   /dev/asm-mgmt0
OCR00                      DISABLED   /dev/asm-ocr0
OCR01                      DISABLED   /dev/asm-ocr1
OCR02                      DISABLED   /dev/asm-ocr2

9 、关键的一步,移除AFD label

(从官方文档描述, 该步骤不是必须要做的, As a root user unlabel the ASMFD disks. (it is an optional step, if you skip this step you can ignore step 7))

/u01/app/19.0.0/grid/bin/asmcmd afd_lsdsk |grep DISABLED |awk '{print "/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel "$1} "-f "' 
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel DATA00 -f
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel DATA2DG00 -f
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel DATA2DG01 -f
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel DATA2DG02 -f
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel DATA2DG03 -f
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel FRA00 -f
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel MGMT00 -f
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel OCR00 -f
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel OCR01 -f
/u01/app/19.0.0/grid/bin/asmcmd afd_unlabel OCR02 -f


做完之后,可以看到磁盘头信息更新了
[grid@ora19c02 alert]$ kfed read /dev/asm-data0 |grep provstr
kfdhdb.driver.provstr:   ORCLDISKDATA00 ; 0x000: length=14
[grid@ora19c02 alert]$ kfed read /dev/asm-data0 |grep provstr
kfdhdb.driver.provstr:         ORCLDISK ; 0x000: length=8

10 、停止acfs驱动和afd驱动,deconfigure掉AFD

[root@ora19c01 ~]# lsmod |grep oracle 
oracleacfs           5581810  0 
oracleadvm           1231385  0 
oracleoks             721311  2 oracleacfs,oracleadvm
oracleafd             214072  0 
[root@ora19c01 ~]# /u01/app/19.0.0/grid/bin/asmcmd afd_state
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'DISABLED' on host 'ora19c01'
[root@ora19c01 ~]# /u01/app/19.0.0/grid/bin/acfsload stop
[root@ora19c01 ~]# /u01/app/19.0.0/grid/bin/afdload stop
[root@ora19c01 ~]# lsmod |grep oracle

11 、启动crs

[root@ora19c01 ~]#  /u01/app/19.0.0/grid/bin/crsctl start crs 
CRS-4123: Oracle High Availability Services has been started.

12 修改asm磁盘的扫描路径

[grid@ora19c02 ~]$ asmcmd dsget
parameter:/dev/asm-*, AFD:*
profile:/dev/asm-*,AFD:*
[grid@ora19c02 ~]$ asmcmd dsset '/dev/asm-*'
[grid@ora19c02 ~]$ asmcmd dsget
parameter:/dev/asm-*
profile:/dev/asm-*

至此,禁用AFD,转回普通盘的操作完成。 从操作验证来看,这个过程可以滚动完成。不需全停集群。

二 故障的处理方案

再回到故障的处理。

可行的方案一, 禁用AFD,转为普通盘,一个节点一个节点操作

1、关闭crs :
   /u01/app/19.0.0/grid/bin/crsctl stop crs
2、禁用AFD filter:
   /u01/app/19.0.0/grid/bin/asmcmd afd_filter -d
3、停止acfs驱动和afd驱动,deconfigure掉AFD 
   /u01/app/19.0.0/grid/bin/acfsload stop
   /u01/app/19.0.0/grid/bin/afdload stop
   /u01/app/19.0.0/grid/bin/asmcmd afd_deconfigure
4、启动crs:
   /u01/app/19.0.0/grid/bin/crsctl start crs

可行的方案二,使用旧版的ACFS和AFD驱动启动集群:

1、关闭crs:
stop crs 
2、移除19.10的驱动
mv /opt/oracle/app/grid_home/usm/install/Oracle/EL7/x86_64/3.10.0-862/3.10.0-862-x86_64 /opt/oracle/app/grid_home/usm/install/Oracle/EL7/x86_64/3.10.0-862/3.10.0-862-x86_64_19.10

3、恢复源版本驱动
cp -r /opt/oracle/app/grid_home/.patch_storage/32218663_*/files/usm/install/Oracle/EL7/x86_64/3.10.0-862/3.10.0-862-x86_64 /opt/oracle/app/grid_home/usm/install/Oracle/EL7/x86_64/3.10.0-862/ 
chmod 755 /opt/oracle/app/grid_home/usm/install/Oracle/EL7/x86_64/3.10.0-862/3.10.0-862-x86_64
chmod 755 /opt/oracle/app/grid_home/usm/install/Oracle/EL7/x86_64/3.10.0-862/3.10.0-862-x86_64/bin

cd  /opt/oracle/app/grid_home/usm/install/Oracle/EL7/x86_64/3.10.0-862/3.10.0-862-x86_64/bin
unzip oracka_mod_kga.zip
unzip oracka.zip
unzip oracleacfs.zip
unzip oracleadvm.zip
unzip oracleafd.zip
unzip oracleoks.zip
unzip oracleolfs.zip

4、删除系统已加载的驱动文件
rm -f /lib/modules/3.10.0-957.el7.x86_64/weak-updates/oracle/oracleafd.ko
rm -f /lib/modules/3.10.0-957.el7.x86_64/weak-updates/usm/oracleacfs.ko
rm -f /lib/modules/3.10.0-957.el7.x86_64/weak-updates/usm/oracleadvm.ko
rm -f /lib/modules/3.10.0-957.el7.x86_64/weak-updates/usm/oracleoks.ko

5、启动crs

三 小结

AFD功能喜忧参半吧,简单的说就是用来加强磁盘权限的磁盘绑定方案。通过操作系统底层驱动控制当前系统的误操作格盘,并不是对磁盘加上了特定的格式。因此,AFD和块设备(裸设备)的源路径磁盘可以共存。掌握AFD和普通盘的相互转换,也能应对一些特定情况的问题。


参考文档:

ASMFD : How to Migrate ASM Diskgroups from ASMFD (ASM Filter Driver) to ASMLIB on Oracle Grid Infrastructure (RAC) or Oracle Restart (文档 ID 2172803.1)

 'blk_cloned_rq_check_limits: over max segments limit' Kernel Messages After Grid Infrastructure Patching to 19.10 DB RU (文档 ID 2754950.1) 

32437246: HIGH PERFORMANCE ISSUE AFTER APPLYING 19.10 ACFSRU


文章转载自DBA的自我修养,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论