暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

【应知应会】使用kfed运维兵器修复ASM磁盘和磁盘组

799
欢迎关注“数据库运维之道”公众号,一起学习数据库技术! 本期将为大家分享“使用kfed运维兵器修复ASM磁盘和磁盘组” 的运维技能。

关键词:ORA-15053、ORA-15027、ORA-15040、ORA-01187、kfed repair、kfed merge、kfed read、strace

数据库的ASM磁盘或ASM磁盘组不小心被人为误删除从而导致磁盘组无法挂载,DBA需要将磁盘组挂载起来。通过实验再次模拟故障环境,详细分解每个操作步骤,避免下次继续踩坑从而提升运维处置效率。

场景描述:假设ASM磁盘组被删除,我们将如何进行快速恢复呢?

模拟环境:两个节点的RAC集群、数据库名YWZD、ASM磁盘组名称YWZD、测试ASM磁盘名称YWZD1

实验步骤:

1、查看ASM磁盘的名称、路径和状态信息;

    col path for a50
    set pagesize 100
    set linesize 1000
    select name,path,header_status from v$asm_disk
    where group_number = (select group_number from v$asm_diskgroup
    where name='YWZD');-->输入ASM磁盘组名称
    NAME PATH HEADER_STATU
    ------------------------------ -------------------------------------------------- ------------
    YWZD_0000 dev/oracleasm/disks/YWZD1 MEMBER
    MEMBER表示ASM磁盘是磁盘组的成员。
    2、创建测试表空间和测试表;
      create tablespace ywzd datafile '+YWZD' size 100m;
      create table ywzd as select * from dba_tables;
      select count(*) from ywzd;
      COUNT(*)
      ----------
      1884
      3、删除ASM磁盘组,会报ORA-15053错误,即磁盘组存在文件;
        --删除ASM磁盘组
        su - grid
        $sqlplus as sysasm


        SQL> drop diskgroup ywzd;
        drop diskgroup ywzd
        *
        ERROR at line 1:
        ORA-15039: diskgroup not dropped
        ORA-15053: diskgroup "YWZD" contains existing files


        --错误码具体说明
        $oerr ora 15053
        15053, 00000, "diskgroup \"%s\" contains existing files"
        // *Cause: An attempt was made to drop a diskgroup that still contains
        // existing files.
        // *Action: Specify the INCLUDING CONTENTS option to drop the diskgroup
        // and all of its existing files.
        //
        4、根据上述方案,再次删除ASM磁盘组并加INCLUDING CONTENTS参数。但是出现ORA-15027错误,即磁盘组被实例占用无法dismount。
          --删除ASM磁盘组
          SQL> drop diskgroup ywzd including contents;
          drop diskgroup ywzd including contents
          *
          ERROR at line 1:
          ORA-15039: diskgroup not dropped
          ORA-15027: active use of diskgroup "YWZD" precludes its dismount


          --错误码具体说明
          $oerr ora 15027
          15027, 00000, "active use of diskgroup \"%s\" precludes its dismount"
          // *Cause: An ALTER DISKGROUP ... DISMOUNT command specified a diskgroup
          // which had clients with open files in the diskgroup. Diskgroups
          // cannot be dismounted until all open files in the diskgroup
          // are closed.
          // *Action: Stop all clients that are using this diskgroup and retry
          // the ALTER DISKGROUP ... DISMOUNT command. Query the V$ASM_CLIENT
          // fixed view in an ASM instance or use ASMCMD LSOF to list active
          // clients.
          //


          --查看占用磁盘组的实例名
          select instance_name,db_name,status from GV$ASM_CLIENT
          where group_number = (select group_number from v$asm_diskgroup
          where name='YWZD'); -->>输入磁盘组名称
          INSTANCE_NAME DB_NAME STATUS
          ------------------------------ -------- ------------
          +ASM1 ywzd CONNECTED
          +ASM2 ywzd CONNECTED


          --或者使用lsof命令查看
          $asmcmd
          ASMCMD> lsof
          DB_Name Instance_Name Path


          ywzd ywzd1 +YWZD/ywzd/DATAFILE/ywzd.257.1164789351
          ywzd ywzd1 +YWZD/ywzd_block_track.log
          5、备份ASM磁盘头数据后,再执行删除,方便后面对比分析及恢复。
            --备份磁盘头信息
            kfed read dev/oracleasm/disks/YWZD1 >/tmp/YWZD1-0-0-B 磁盘头信息备份
            kfed read dev/oracleasm/disks/YWZD1 blkn=1 >/tmp/YWZD1-0-1-B 对块号为1备份
            kfed read dev/oracleasm/disks/YWZD1 aun=1 >/tmp/YWZD1-1-0-B 对aun为1备份


            --关闭数据库实例
            $srvctl stop database -d ywzd


            --切换到grid用户,先把其中一个节点的磁盘组DISMOUNT,否则会出现以下错误。
            SQL> drop diskgroup ywzd including contents;
            drop diskgroup ywzd including contents
            *
            ERROR at line 1:
            ORA-15039: diskgroup not dropped
            ORA-15073: diskgroup YWZD is mounted by another ASM instance


            --成功删除
            SQL> drop diskgroup ywzd including contents;
            Diskgroup dropped.
            6、手动再拉起磁盘组,出现ORA-15040错误,即磁盘组不完整错误。
              --Mount磁盘组,提示磁盘组不完整
              SQL> alter diskgroup ywzd mount;
              alter diskgroup ywzd mount
              *
              ERROR at line 1:
              ORA-15032: not all alterations performed
              ORA-15017: diskgroup "YWZD" cannot be mounted
              ORA-15040: diskgroup is incomplete


              --ASM实例日志,显示挂载失败
              2024-03-28T12:44:58.456141+08:00
              SQL> alter diskgroup ywzd mount
              2024-03-28T12:44:58.461584+08:00
              NOTE: cache registered group YWZD 8/0x4C149EB9
              NOTE: cache began mount (first) of group YWZD 8/0x4C149EB9
              2024-03-28T12:44:58.518360+08:00
              ERROR: no read quorum in group: required 2, found 0 disks
              2024-03-28T12:44:58.518657+08:00
              NOTE: cache dismounting (clean) group 8/0x4C149EB9 (YWZD)
              NOTE: messaging CKPT to quiesce pins Unix process pid: 29396, image: oracle@hlw3-db11 (TNS V1-V3)
              NOTE: dbwr not being msg'd to dismount
              NOTE: LGWR not being messaged to dismount
              NOTE: cache dismounted group 8/0x4C149EB9 (YWZD)
              NOTE: cache ending mount (fail) of group YWZD number=8 incarn=0x4c149eb9
              NOTE: cache deleting context for group YWZD 8/0x4c149eb9
              2024-03-28T12:44:58.534982+08:00
              GMON dismounting group 8 at 26996 for pid 34, osid 29396
              2024-03-28T12:44:58.536694+08:00
              ERROR: diskgroup YWZD was not mounted
              ORA-15032: not all alterations performed
              ORA-15017: diskgroup "YWZD" cannot be mounted
              ORA-15040: diskgroup is incomplete


              2024-03-28T12:44:58.543094+08:00
              ERROR: alter diskgroup ywzd mount


              --错误码具体信息
              $oerr ora 15040
              15040, 00000, "diskgroup is incomplete"
              // *Cause: Some of the disks comprising a diskgroup were not present.
              // *Action: Check the hardware to ensure that all disks are functional. Also
              // check that the setting of the ASM_DISKSTRING initialization
              // parameter has not changed. Alternatively, for normal or high
              // redundancy diskgroups, use MOUNT FORCE to offline missing disks
              // as part of mounting the diskgroup.
              //
              7、使用kfed再次备份下磁盘头数据,然后再进行删除前后对比。
                --备份磁盘头数据
                kfed read dev/oracleasm/disks/YWZD1 >/tmp/YWZD1-0-0-A
                kfed read dev/oracleasm/disks/YWZD1 blkn=1 >/tmp/YWZD1-0-1-A
                kfed read dev/oracleasm/disks/YWZD1 aun=1 >/tmp/YWZD1-1-0-A


                --通过diff命令进行对比
                diff tmp/YWZD1-0-0-B tmp/YWZD1-0-0-A
                8、比较两个文件的差异信息,其中带有“!”符号的记录表示前后有差异。可以看到磁盘头状态从KFDHDR_MEMBER变成KFDHDR_FORMER。
                  命令一:$ diff -C 1 tmp/YWZD1-0-0-B tmp/YWZD1-0-0-A|grep "kfdhdb.hdrsts"
                  *** tmp/YWZD1-0-0-B 2024-03-28 12:23:29.022209374 +0800
                  --- tmp/YWZD1-0-0-A 2024-03-28 12:56:31.268409902 +0800
                  ***************
                  *** 21,23 ****
                  kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL
                  ! kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER
                  kfdhdb.dskname: YWZD_0000 ; 0x028: length=9
                  --- 21,23 ----
                  kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL
                  ! kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORMER
                  kfdhdb.dskname: YWZD_0000 ; 0x028: length=9
                  ***************

                  命令二:$diff -y tmp/YWZD1-0-0-B tmp/YWZD1-0-0-A|grep "|"
                  kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBE | kfdhdb.hdrsts: 4 ; 0x027: KFDHDR_FORME
                  9、核实上述的对比信息后,修改备份文件中的磁盘头状态参数kfdhdb.hdrsts,并将修改的内容合并到磁盘头。
                    --修改/tmp/YWZD1-0-0-A文件
                    kfdhdb.hdrsts:即状态值从4改为3,状态名称从KFDHDR_FORMER改为KFDHDR_MEMBER


                    --合并修改结果
                    kfed merge dev/oracleasm/disks/YWZD1 text=/tmp/YWZD1-0-0-A
                    10、最后把磁盘组挂载起来,并进行数据验证。
                      --核实磁盘状态是否为MEMBER
                      select name,path,header_status from v$asm_disk
                      where group_number = (select group_number from v$asm_diskgroup
                      where name='YWZD');-->输入ASM磁盘组名称


                      --挂载磁盘组
                      SQL> alter diskgroup ywzd mount;
                      Diskgroup altered.


                      --启动数据库,并验证数据
                      select count(*) from ywzd;
                      ASM磁盘组挂载成功!
                      实验证明:当磁盘组被误删除后,可以通过kfed进行完美恢复。
                      场景描述:假设执行oracleasm deletedisks 命令删除ASM磁盘,我们将如何进行快速恢复呢?
                      模拟环境:两个节点的RAC集群、数据库名YWZD、ASM磁盘组名称YWZD、测试ASM磁盘名称YWZD1
                      实验步骤:
                      1、确认数据文件状态信息;
                        --数据文件状态可用
                        col file_name for a60
                        set linesize 1000
                        select file_name ,status from dba_data_files where file_name like '+YWZD%';
                        FILE_NAME STATUS
                        ------------------------------------------------------------ ---------
                        +YWZD/ywzd/DATAFILE/ywzd.257.1164789351 AVAILABLE


                        --ASM磁盘可识别
                        $ oracleasm listdisks | grep YWZD
                        YWZD1
                        2、先备份ASM磁盘头数据,再执行删除ASM盘,方便后面对比分析或恢复使用。
                          --备份磁盘头信息
                          kfed read dev/oracleasm/disks/YWZD1 >/tmp/YWZD11-0-0-B
                          kfed read dev/oracleasm/disks/YWZD1 blkn=1 >/tmp/YWZD11-0-1-B
                          kfed read dev/oracleasm/disks/YWZD1 aun=1 >/tmp/YWZD11-1-0-B
                          3、在节点1的root用户下执行deletedisk删除ASM磁盘
                            --删除磁盘
                            # oracleasm deletedisk YWZD1
                            Clearing disk header: done
                            Dropping disk: done


                            # oracleasm listdisks | grep YWZD-->无记录


                            # oracleasm scandisks
                            Reloading disk partitions: done
                            Cleaning any stale ASM disks...
                            Scanning system for ASM disks...
                            4、在节点2检查ASM的状态,磁盘状态从MEMBER变成为CANDIDATE。
                              --检查磁盘状态为CANDIDATE
                              col path for a40
                              col name for a20
                              set linesize 1000
                              select path,name,state ,header_status from v$asm_disk where path like '%YWZD%';
                              PATH NAME STATE HEADER_STATU
                              ---------------------------------------- -------------------- -------- ------------
                              /dev/oracleasm/disks/YWZD1 NORMAL CANDIDATE


                              --ASM磁盘扫描
                              # oracleasm scandisks
                              Reloading disk partitions: done
                              Cleaning any stale ASM disks...
                              Cleaning disk "YWZD1"
                              Scanning system for ASM disks...


                              # oracleasm listdisks | grep YWZD-->无记录


                              select path,name,state ,header_status from v$asm_disk where path like '%YWZD%';
                              no rows selected
                              5、检查数据库文件的状态,已经无法识别到数据文件。
                                --查看数据文件状态为AVAILABLE
                                col file_name for a60
                                select file_name ,status from dba_data_files where file_name like '+YWZD%';
                                FILE_NAME STATUS
                                ------------------------------------------------------------ ---------
                                +YWZD/ywzd/DATAFILE/ywzd.257.1164789351 AVAILABLE


                                --手动触发检查点
                                alter system checkpoint;
                                alter system checkpoint;


                                --物理磁盘检测异常,触发磁盘组dismounted
                                select file_name ,status from dba_data_files where file_name like '+YWZD%'
                                *
                                ERROR at line 1:
                                ORA-01187: cannot read from file because it failed verification tests
                                ORA-01110: data file 10: '+YWZD/ywzd/DATAFILE/ywzd.257.1164789351'


                                --ASM日志
                                2024-03-28T22:21:01.941213+08:00
                                Errors in file /u01/app/oracle/diag/rdbms/ywzd/ywzd1/trace/ywzd1_ckpt_16646.trc:
                                ORA-01110: data file 10: '+YWZD/ywzd/DATAFILE/ywzd.257.1164789351'
                                ORA-01115: IO error reading block from file 10 (block # 1)
                                ORA-15078: ASM diskgroup was forcibly dismounted
                                ORA-15078: ASM diskgroup was forcibly dismounted
                                6、asm磁盘头的元数据在0号数据块,其备份的元数据在510号数据块,因此可以通过kfed repair来修复。
                                  -- 执行修复命令触发KFED-00303
                                  # su - grid
                                  $ kfed repair /dev/oracleasm/disks/YWZD1
                                  KFED-00303: unable to open file '/dev/oracleasm/disks/YWZD1'


                                  --通过命令把原先的ASM磁盘状态写入磁盘头(renamedisk -f)
                                  $oracleasm createdisk YWZD1 /dev/mapper/YW-YWZD
                                  Unable to open device "/dev/mapper/YW-YWZD": Permission denied


                                  --root用户下创建
                                  # oracleasm createdisk YWZD1 /dev/mapper/YW-YWZD
                                  Writing disk header: done
                                  Instantiating disk: done


                                  --检查ASM磁盘状态为PROVISIONED
                                  PATH NAME STATE HEADER_STATU
                                  ---------------------------------------- -------------------- -------- ------------
                                  /dev/oracleasm/disks/YWZD1 NORMAL PROVISIONED


                                  --再次执行kfed repair进行修复,并通过strace把日志记录下来。
                                  strace -o asm_learn.log kfed repair /dev/oracleasm/disks/YWZD1


                                  --再次检查磁盘状态,已变成MEMBER
                                  PATH NAME STATE HEADER_STATU
                                  ---------------------------------------- -------------------- -------- ------------
                                  /dev/oracleasm/disks/YWZD1 NORMAL MEMBER


                                  --另一个节点扫描磁盘
                                  # oracleasm scandisks
                                  Reloading disk partitions: done
                                  Cleaning any stale ASM disks...
                                  Scanning system for ASM disks...
                                  Instantiating disk "YWZD1"


                                  --手动挂载两个节点的磁盘组
                                  SQL> select name,state from v$asm_diskgroup;
                                  NAME STATE
                                  ------------------------------ -----------
                                  YWZD DISMOUNTED


                                  SQL> alter diskgroup ywzd mount;
                                  Diskgroup altered.
                                  7、检查数据文件状态及验证数据。
                                    --数据文件读取异常
                                    select file_name ,status from dba_data_files where file_name like '+YWZD%'
                                    *
                                    ERROR at line 1:
                                    ORA-01187: cannot read from file because it failed verification tests
                                    ORA-01110: data file 10: '+YWZD/ywzd/DATAFILE/ywzd.257.1164789351'


                                    --错误码信息
                                    !oerr ora 01187
                                    01187, 00000, "cannot read from file %s because it failed verification tests"
                                    // *Cause: The data file did not pass the checks to insure it is part of the
                                    // database. Reads are not allowed until it is verified.
                                    // *Action: Make the correct file available to the database. Then, either open
                                    // the database, or execute ALTER SYSTEM CHECK DATAFILES.


                                    --手动进行数据文件校验
                                    alter database datafile 10 online;
                                    Database altered.


                                    ALTER SYSTEM CHECK DATAFILES;
                                    Database altered.


                                    --数据文件状态恢复正常
                                    select file_name ,status from dba_data_files where file_name like '+YWZD%';
                                    FILE_NAME STATUS
                                    ------------------------------------------------------------ ---------
                                    +YWZD/ywzd/DATAFILE/ywzd.257.1164789351 AVAILABLE


                                    --数据成功恢复
                                    SYS@ywzd1> select count(*) from ywzd;


                                    COUNT(*)
                                    ----------
                                    1884
                                    根据以上步骤成功将ASM磁盘、ASM磁盘组和数据文件成功恢复出来。
                                    虽然整个恢复过程步骤简单,但是还是要理解其中的原理。数据库底层未实际删除数据,而修改ASM磁盘头状态信息。可使用kfed工具手动调整磁盘头状态或将元数据备份直接恢复。
                                    1、ASM磁盘头HEADER_STATUS的状态,常见有以下4种。
                                      CANDIDATE\PROVISIONED:磁盘不是磁盘组的一部分,可以通过以下ALTER DISKGROUP语句添加到磁盘组中
                                      MEMBER:磁盘是现有磁盘组的成员。请勿尝试将磁盘添加到其他磁盘组
                                      FORMER:磁盘曾经是磁盘组的一部分,但已从该组中彻底删除。可以使用以下ALTER DISKGROUP语句将其添加到新的磁盘组中
                                      2、正常情况下,可通过kfed read检查以下关键指标
                                        kfbh.type 值为 KFBTYP_DISKHEAD
                                        kfdhdb.hdrsts 值为 KFDHDR_MEMBER
                                        • ASM tools used by Support : KFOD, KFED, AMDU (Doc ID 1485597.1)
                                        • https://blog.csdn.net/allway2/article/details/107250526/
                                        以上就是本期关于“使用kfed运维兵器修复ASM磁盘和磁盘组”。希望能给大家带来帮助!

                                        【性能优化】数据泵impdp导入时间特别久及导入中断后继续导入案例分享

                                        【问题诊断】安全等保加固重命名SYS用户触发ORA-600[kokasgi1]错误案例分享

                                        【问题诊断】机房断电后Oracle数据库启动异常ORA-00600[kcratr_nab_less_than_odr]案例分享

                                        【问题诊断】Data Guard Gap日志间隙SCN增量备份恢复案例分享

                                        欢迎关注“数据库运维之道”公众号,一起学习数据库技术!
                                        欢迎扫码进“数据库运维之道”,此群用于数据库技术交流,禁止发广告!
                                        可以加我的微信,交个朋友或讨论数据库解决方案,请备注”姓名单位“,谢谢!


                                        文章转载自数据库运维之道,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                                        评论