暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Testing ASM Disk Failure Scenario and disk_repair_time

原创 许玉冲 2023-08-22
659

#ORA-15084: ASM disk "" is offline and cannot be dropped.



When a disk failure occurs for an ASM disk, behavior of ASM would be different, based on what kind of redundancy for the diskgroup is in use. If diskgroup has EXTERNAL REDUDANCY, diskgroup would keep working if you have redundancy at external RAID level. If there is no RAID at external level, the diskgroup would immediately get dismounted and disk would need a repair/replaced and then diskgroup might need to be dropped and re-created, and data on this diskgroup would require recovery.


For NORMAL and HIGH redundancy diskgroups, the behavior is a little different. When a disk gets corrupted/missing in a NORMAL/HIGH redundancy diskgroup, error is reported in the alert log file, and disk becomes OFFLINE, as we can see in the output of bellow query, after I started my testing for an ASM disk failure. I just needed to plug out the disk from the storage that belonged to an ASM diskgroup with NORMAL redundancy.

col name format a8

col header_status format a7

set lines 2000

col path format a10

select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;


NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S

-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------

DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA4                              NORMAL   UNKNOWN         1200                   OFFLINE       MISSING

  

Here we see a value “1200” under REPAIR_TIME column; this value is time in seconds after which this disk would be dropped automatically. This time is calculated using value of a diskgroup attribute called DISK_REPAIR_TIME that I will discuss bellow.

In 10g, if a disk goes missing, it would immediately get dropped and REBALANCE operation would kick in immediately whereby ASM would start redistributing the ASM extents across the available disks in ASM diskgroup to restore the redundancy.


DISK_REPAIR_TIME

Starting 11g, oracle has provided an attribute for diskgroups called “DISK_REPAIR_TIME”. This has a default value of 3.6 hours. This actually means that in case a disk goes missing, this disk should not be dropped immediately and ASM should wait for this disk to come online/replaced. This feature helps in scenarios where a disk is plugged out accidentally, or a storage server/SAN gets disconnected/rebooted which leaves some ASM diskgroup without one or more disks. During the time when disk(s) remain unavailable, ASM would keep track of the extents that are candidates of being written to the missing disks, and immediately starts writing to the disk(s) as soon as missing disk(s) come back online (this feature is called fast mirror resync). If disk(s) does not come back online within DISK_REPAIR_TIME threshold, disk(s) is/are dropped and rebalance starts.


FAILGROUP_REPAIR_TIME

Starting 12c, another new attribute can be set for the diskgroup. This attribute is FAILGROUP_REPAIR_TIME, and this has a default value of 24 hours. This attribute is similar to DISK_REPAIR_TIME, but is applied to the whole failgroup. In Exadata, all disks belonging to a storage server can belong to a failgroup (to avoid a mirror copy of extent to be written in a disk from the same storage server), and this attribute is quite handy in Exadata environment when complete storage server is taken down for maintenance, or some other reason.

In the following we can see how to set values for the diskgroup attributes explained above.

SQL> col name format a30

SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';


NAME                           VALUE

------------------------------ --------------------

disk_repair_time               3.6h

failgroup_repair_time          24.0h


SQL> alter diskgroup data set attribute 'disk_repair_time'='1h';


Diskgroup altered.


SQL>  alter diskgroup data set attribute  'failgroup_repair_time'='10h';


Diskgroup altered.


SQL> select name,value from v$asm_attribute where group_number=3 and name like '%repair_time%';


NAME                           VALUE

------------------------------ --------------------

disk_repair_time               1h

failgroup_repair_time          10h


ORA-15042

If a disk is offline/missing from an ASM diskgroup, ASM may not mount the diskgroup automatically during instance restart. In this case, we might need to mount the diskgroup manually, with FORCE option.

SQL> alter diskgroup data mount;

alter diskgroup data mount

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15040: diskgroup is incomplete

ORA-15042: ASM disk "3" is missing from group number "2"


SQL> alter diskgroup data mount force;


Diskgroup altered.


Monitoring the REPAIR_TIME

After a disk goes offline, the time starts ticking and value of REPAIR_TIMER can be monitored to see the time remains before the disk can be made available to avoid auto drop of the disk.

SQL> select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;


NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S

-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------

DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA4                              NORMAL   UNKNOWN         649                     OFFLINE       MISSING


--We can confirm that no rebalance has started yet by using following query

SQL> select * from v$asm_operation;


no rows selected


If we are able to make this disk available/replaced before DISK_REPAIR_TIME lapses, we can bring this disk back online. Please note that we would need to bring it ONLINE manually.

SQL> alter diskgroup data online disk data4;


Diskgroup altered.


select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;


NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S

-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------

DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA4                              NORMAL   UNKNOWN        465                      SYNCING     CACHED


--Syncing is in progress, and hence no rebalance would occur.


SQL> select * from v$asm_operation;


no rows selected

-- After some time, everything would become normal.


select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status  from v$asm_disk;


NAME     PATH                STATE    HEADER_          REPAIR_TIMER    MODE_ST     MOUNT_S

-------- ---------- --------       ------- ------------ ------- -----------------  ------------  ------------- -----------------

DATA1    ORCL:DATA1 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA2    ORCL:DATA2 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA3    ORCL:DATA3 NORMAL   MEMBER             0                         ONLINE        CACHED

DATA4    ORCL:DATA4 NORMAL   MEMBER             0                         ONLINE        CACHED



If same disk cannot be made available, or replaced, either ASM would auto drop the disk after DISK_REPAIR_TIME has lapsed, or we manually drop this ASM disk. Rebalance would occur after the disk drop.
Since the disk status if OFFLINE, we would need to use FORCE option to drop the disk. After dropping the disk rebalance would start and can be monitored from v$ASM_OPERATION view.

SQL> alter diskgroup data drop disk data4;

alter diskgroup data drop disk data4

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15084: ASM disk "DATA4" is offline and cannot be dropped.



SQL> alter diskgroup data drop disk data4 force;


Diskgroup altered.


select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;


GROUP_NUMBER OPERA PASS                   STATE      POWER      SOFAR   EST_WORK 

---------------------------------- --------- ----            ---------- ---------- ---------- ------------------------

           2                     REBAL RESYNC             DONE          9                0             0   

           2                     REBAL REBALANCE    DONE           9                42          42  

           2                     REBAL COMPACT         RUN             9                1            0   


Later we can replace the faulty disk and then add back the new disk again into this diskgroup. Adding diskgroup back would initiate rebalance once again.

SQL> alter diskgroup data add disk 'ORCL:DATA4';


Diskgroup altered.


SQL> select * from v$asm_operation;


select group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;


GROUP_NUMBER OPERA PASS                   STATE      POWER      SOFAR   EST_WORK 

---------------------------------- --------- ----            ---------- ---------- ---------- ------------------------

           2                     REBAL RESYNC             DONE          9                0             0   

           2                     REBAL REBALANCE    RUN              9               37           2787  

           2                     REBAL COMPACT         WAIT            9                1            0   



最后修改时间:2023-08-22 13:27:17
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论