【经典案例】备份失败RMAN-06169及ORA-30013错误处理

欢迎关注“数据库运维之道”公众号，一起学习数据库技术! 本期将为大家分享“备份失败RMAN-06169及ORA-30013错误处理”的案例。

关键词：RMAN-06169、ORA-30013、ORA-01157

客户巡检发现数据库备份日志出现RMAN-06169报错，即：could not read file header for datafile 65 error reason 4。根据数据文件编号进一步定位是UNDOTBS2表空间下的一个数据文件存放目录不规范。数据库备份脚本配置在节点1，数据文件存放到节点2的磁盘上，导致节点1上的数据库实例无法读到数据文件触发这个错误。

DBA看到6169错误信息后，基本可以确认是数据文件头访问有异常。根据reason code信息可以初步判断大致原因，reason 4表示dbwr进程无法找到数据文件。

6169, 3, "could not read file header for datafile %(1)s error reason %(2)s"
// *Cause: The specified data file could not be accessed.  The reason codes are:
//         1  - file name is MISSINGxx in the control file
//         2  - file is offline
//         3  - file is not verified
//         4  - DBWR could not find the file
//         5  - unable to open file
//         6  - I/O error during read
//         7  - file header is corrupt
//         8  - file is not a data file
//         9  - file does not belong to this database
//         10 - file number is incorrect
//         12 - wrong file version
//         15 - control file is not current

接着要核实下数据文件65是归属于哪个表空间、数据库运行状态和备份日志。

set linesize 1000
col name for a70
select TS#,file#,name,status from v$datafile order by 1;


col tablespace_name for a20
col file_name for a60
select tablespace_name,file_id,file_name,status,online_status 
from dba_data_files order by 1;

根据上述的查询结果，可以确认UNDOTBS2表空间增加的数据文件存放目录不规范，即数据文件未存放到ASM共享磁盘组。然后进一步确认两个实例对应的默认UNDO表空间信息。

set linesize 1000
col name for a20
col value for a20
select name,value from v$parameter
where name in ('undo_management','undo_tablespace');
NAME             VALUE
-------------------- --------------------
undo_management      AUTO
undo_tablespace      UNDOTBS2


set linesize 1000
col tablespace_name for a20
select TABLESPACE_NAME, CONTENTS,
            EXTENT_MANAGEMENT, ALLOCATION_TYPE,
            SEGMENT_SPACE_MANAGEMENT
from dba_tablespaces where contents='UNDO';


TABLESPACE_NAME      CONTENTS  EXTENT_MAN ALLOCATIO SEGMEN
-------------------- --------- ---------- --------- ------
UNDOTBS1         UNDO      LOCAL      SYSTEM    MANUAL
UNDOTBS2         UNDO      LOCAL      SYSTEM    MANUAL

根据以上的初步排查，在不重启数据库的情况下可采取两种处置方案：1、新建UNDO表空间，删除旧的UNDO表空间；2、将UNDO数据文件调整到ASM磁盘组；

解决方案一：新建UNDO表空间，删除旧的UNDO表空间

1、查看undo表空间默认参数值，查看数据文件目录信息。undo表空间是用来存储数据被改之前的前镜像。

show parameter undo_tablespace
NAME                     TYPE     VALUE
------------------------------------ ----------- ------------------------------
undo_tablespace              string     UNDOTBS2


col file_name for a60
set linesize 100
select tablespace_name,file_name from dba_data_files where tablespace_name='UNDOTBS2';

2、创建新的undo表空间，并设置自动扩展参数，调整undo_tablespace参数

CREATE SMALLFILE UNDO TABLESPACE "UNDOTBS02" DATAFILE '+DATA' SIZE 200M AUTOEXTEND ON;
alter system set undo_tablespace='UNDOTBS02' SCOPE=both sid='yzdw2';

3、检查待删除的回滚段有没有正在执行的事务，如果没有会相对简单，可以直接删除掉该回滚段即可，并且没有数据丢失。

--检查回滚段状态
select tablespace_name,segment_id,segment_name,file_id,block_id,status,instance_num 
from dba_rollback_segs where tablespace_name='UNDOTBS2' and status = 'ONLINE'; 


TABLESPACE_NAME            SEGMENT_ID SEGMENT_NAME                FILE_ID   BLOCK_ID STATUS        INSTANCE_NUM
------------------------------ ---------- ------------------------------ ---------- ---------- ---------------- ----------------------------------------
UNDOTBS2                   12 _SYSSMU12_3316011814$           6       144 ONLINE
UNDOTBS2                   14 _SYSSMU14_3105766383$           6       176 ONLINE


--调整undo_tablespace参数后，活动会话会切换到新的undo表空间。
--alert日志文件会出现相关的提示信息
[6114] Successfully onlined Undo Tablespace 2.
[6114] **** active transactions found in undo Tablespace 5 - moved to Pending Switch-Out state.
[6114] active transactions found/affinity dissolution incompletein undo tablespace 5 during switch-out.

4、进一步定位具体的会话信息

select s.SID, s.serial#,s.machine,s.program,s.sql_id,s.status,substr(s.username, 1, 10) username, g.tablespace_name,g.segment_name 
from v$transaction, dba_rollback_segs g, v$session s where saddr = ses_addr and xidusn = segment_id;

5、跟客户确认待清理的会话信息。

alter system kill session '829,31003' immediate;

6、删除原有的UNDO表空间。

drop tablespace UNDOTBS2 including contents and datafiles;

正常情况下，上述步骤是可以解决问题的。可是实际情况比这个复杂，删除undo表空间出现ORA-30013错误，即后台进程还在使用这个表空间。

--删除命令
SQL> drop tablespace UNDOTBS2 including contents;
drop tablespace UNDOTBS2 including contents
*
ERROR at line 1:
ORA-30013: undo tablespace 'UNDOTBS2' is currently in use


--alert日志信息
Sat Mar 30 23:30:08 2024
drop tablespace UNDOTBS2 including contents
ORA-30013 signalled during: drop tablespace UNDOTBS2 including contents...
Sat Mar 30 23:30:15 2024
[26385] **** active transactions found in undo Tablespace 5 - moved to Pending Switch-Out state.
Sat Mar 30 23:32:50 2024
[26385] **** active transactions found in undo Tablespace 5 - moved to Pending Switch-Out state.

查看数据库正在运行的事务，包括对应的会话信息。

--活动事务信息select ADDR, XIDUSN,STATUS, START_TIME from v$transaction;--PENDING OFFLINE状态的会话信息set linesize 1000col name for a30col username for a15col machine for a20col osuser for a10SELECT   a.usn,   a.name,   b.status,   c.tablespace_name,   d.addr,   e.sid,   e.serial#,   e.username,   e.program,   e.machine,   e.osuserFROM   v$rollname a,   v$rollstat b,   dba_rollback_segs c,   v$transaction d,   v$session eWHERE   a.usn=b.usn AND   a.name=c.segment_name AND   a.usn=d.xidusn AND   d.addr=e.taddr AND   b.status='PENDING OFFLINE';--基于上面查出来的sid，serial#信息，清理掉相关的会话ALTER SYSTEM KILL SESSION '<sid#>, <serial#>' immediate;--最后再执行表空间删除命令(如果这些步骤还不能解决问题，可以查看推荐二的解决方案)drop tablespace UNDOTBS2 including contents and datafiles;

解决方案二：使用ASMCMD CP 将UNDO数据文件调整到ASM磁盘组

使用ASMCMDCP命令将数据文件从文件系统迁移到ASM

1、为了保证数据的一致性，先将数据文件设置为OFFLINIE状态。

--切换日志文件
SQL> alter system switch logfile;


--查看数据文件信息
select tablespace_name,file_name,file_id,bytes/(1024*1024*1024)g,autoextensible 
from dba_data_files where tablespace_name like '%UNDO%';  


--将数据文件离线
SQL> alter database datafile 6 offline;

2、拷贝数据文件到ASM共享磁盘

--查看数据文件信息是否处于RECOVER状态。
SQL> select file_name, file_id,online_status from dba_data_files where file_id=6;


--ASMCMD复制文件，使用ASMCMD将数据文件从文件系统复制到磁盘组。
ASMCMD> cp/u01/app/oracle/product/12.1.0/dbhome_1/dbs/SYSAUX02.DBF +YWZD/YWZD/SYSAUX02.dbf
Copying/u01/app/oracle/product/12.1.0/dbhome_1/dbs/SYSAUX02.DBF->+YWZD/YWZD/SYSAUX02.dbf
ASMCMD> ls –lt
ASMCMD> pwd

3、修改控制文件并恢复数据文件

--rename数据文件；数据文件复制之后，通过rename进行数据文件重命名
SQL> alter database rename file '/u01/app/oracle/product/12.1.0/dbhome_1/dbs/SYSAUX02.DBF'to '+YWZD/YWZD/SYSAUX02.dbf';


--恢复数据文件：对数据文件进行recover恢复
12c版本
SQL> alter database recover datafile 6;
11g版本
RMAN>  recover datafile 6;


--将数据文件设置为在线
SQL> alter database datafile 6 online;


--验证数据文件路径
SQL> select file_name, file_id,online_status from dba_data_files where file_id=6;
FILE_NAME           FILE_ID   ONLINE_STATUS
----------------          -------     ------------------
+YWZD/YWZD/SYSAUX02.dbf    6       ONLINE

推荐阅读一：12c在线迁移数据文件

1、若集群数据库的数据文件没有放到ASM共享磁盘组，查询dba_data_files视图可能会遇到ORA-01157错误。

--在集群环境下，多节点实例共享数据文件。如果数据文件存放在文件系统，会导致其中一个节点无法访问数据文件。--假如：数据文件创建在节点1的文件系统，节点2访问数据文件报错。SQL> select file_name from dba_data_files;
select file_name from dba_data_files
                      *
ERROR at line 1:
ORA-01157: cannot identify/lock data file 6- see DBWR trace file
ORA-01110: data file 6:
'/u01/app/oracle/product/12.1.0/dbhome_1/dbs/D:appOracleoradatahbgmsSYSAUX02.DBF
'


--官方建议：$oerr ora 01157
01157, 00000, "cannot identify/lock data file %s - see DBWR trace file"
// *Cause: The background process was either unable to find one of the data
//        files or failed to lock it because the file was already in use.
//        The database will prohibit access to this file but other files will
//        be unaffected. However the first instance to open the database will
//        need to access all online data files. Accompanying error from the
//        operating system describes why the file could not be identified.
// *Action: Have operating system make file available to database. Then either
//        open the database or do ALTER SYSTEM CHECK DATAFILES.

2、12c数据库支持在线数据文件迁移，不需要对数据文件进行离线处理。

--查看当前数据文件路径和状态
col file_name for a70
set linesize 200
select file_id,file_name,status from dba_data_files 
order by file_id;
--将数据文件调整至指定的路径目录，然后运行上面的SQL验证文件路径与状态
alter database move datafile '/u01/app/oracle/oradata/ywzd/system01.dbf' 
to '/u01/app/oracle/oradata/system01.dbf';

推荐阅读二：分布式事务排查与处理

1、如果上述的解决方案一无法释放UNDO表空间里的活动事务，那么数据库可能存在分布式事务。

--问题依旧存在
SQL> drop tablespace undotbs2 including contents and datafiles;
drop tablespace undotbs2 including contents and datafiles
*
ERROR at line 1:
ORA-30013: undo tablespace 'UNDOTBS2' is currently in use


--进一步收集分布式事务信息
Select segment_name,status,tablespace_name from dba_rollback_segs where status not in ('ONLINE','OFFLINE') ;
select KTUXEUSN, KTUXESLT, KTUXESQN, KTUXESTA, KTUXECFL, KTUXESIZ from x$ktuxe where KTUXESTA='ACTIVE' and KTUXECFL='DEAD';
SELECT LOCAL_TRAN_ID,GLOBAL_TRAN_ID,STATE,MIXED,COMMIT# FROM DBA_2PC_PENDING;
--local_tran_id is a trio combination of USN , Slot , Seq
SELECT LOCAL_TRAN_ID,IN_OUT,DATABASE,INTERFACE FROM dba_2pc_neighbors;

2、通过以下脚本清理掉分布式事务会话信息。

declare
v_txn VARCHAR2(22);
CURSOR trans_cursor is
SELECT LOCAL_TRAN_ID FROM DBA_2PC_PENDING WHERE STATE='forced rollback';
begin
open trans_cursor ;
LOOP
fetch trans_cursor into v_txn;
EXIT WHEN trans_cursor%NOTFOUND ;
Commit;
DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY(v_txn) ;
END LOOP;
COMMIT;
end;
/

推荐阅读三：X$KTUXE回滚段事务表

1、X$KTUXE是数据库内部表，可以通过desc x$ktuxe查看结构信息。KTUXE代表的就是：[K]ernel [T]ransaction [U]ndo Transa[x]tion [E]ntry (table)。在这个表中，数据库展示了回滚段头的事务表信息。通过这个结构体可以获得数据库事务相关的重要信息。X$KTUXE表的另外一个重要功能是可以获得无法通过v$transaction来观察的死事务信息。当数据库发生异常中断，或者进行延迟事务恢复时，数据库启动后，无法通过V$TRANSACTION来观察事务信息，但是X$KTUXE可以帮助我们获得这些信息。

--通过V$TYPE_SIZE视图可以找到Oracle的自解释信息
select * from v$type_size where component='KTU';


COMPONEN TYPE      DESCRIPTION                TYPE_SIZE
-------- -------- -------------------------------- ----------
KTU     KTUBH      UNDO HEADER                   16
KTU     KTUXE      UNDO TRANSACTION ENTRY           40
KTU     KTUXC      UNDO TRANSACTION CONTROL          104


--表中的KTUXECFL代表了事务的Flag标记，通过这个标记可以找到那些Dead事务
select distinct KTUXECFL,count(*) from x$ktuxe group by KTUXECFL;


KTUXECFL           COUNT(*)
------------------------ ----------
NONE                778


--KTUXESIZ用来记录事务使用的回滚段块数，可以通过观察这个字段来评估恢复进度：
select ADDR,KTUXEUSN,KTUXESLT,KTUXESQN,KTUXESIZ from x$ktuxe 
where  KTUXECFL ='DEAD';

2、错误信息为大量的DML不仅会产生大量的REDO记录还会导致UNDO空间增长，进而占用大量的磁盘空间。11g恢复数据文件需要使用RMAN工具，否则会出现以下错误。

SQL > alter database recover datafile 65;
alter database recover datafile 65
*ERROR at line 1:
ORA-00279: change 12597056539218 generated at 03/30/2024 21:09:53 needed for thread 2
ORA-00289: suggestion : +FRA/urpdb/archivelog/2024_03_30/thread_2_seq_146434.1195.1165007427
ORA-00280: change 12597056539218 for thread 2 is in sequence #146434

RMAN-06169: could not read file header for datafile XX error reason 5 (Doc ID 2868000.1)
Backup Fails RMAN-06169: could not read file header for datafile % error reason % (Doc ID 2991317.1)
Troubleshooting ORA-30013 Error (Doc ID 1578717.1)
Unable to Drop Undo Tablespace ORA-30013 (Doc ID 835944.1)
Undo Tablespace Moved To Pending Switch-Out State (Doc ID 341372.1)
How to Shrink the datafile of Undo Tablespace (Doc ID 268870.1)
Managing Rollback/Undo Segments in AUM (Automatic Undo Management) (Doc ID 135090.1)
Unable to Drop Undo Tablespace ORA-30013 (Doc ID 835944.1)
How to clear huge number of In-doubt transactions in PREPARED state (Doc ID 2489254.1)
https://blog.csdn.net/Hehuyi_In/article/details/107866180

以上就是本期关于“备份失败RMAN-06169及ORA-30013错误处理”的处置案例。希望能给大家带来帮助。

欢迎关注“数据库运维之道”公众号，一起学习数据库技术!

【经典案例】备份失败RMAN-06169及ORA-30013错误处理

评论