这是一则来自客户现场的真实案例,客户现场人员在通过PL/SQL Developer客户端工具创建索引时出现ORA-04030错误,导致索引创建失败。但是通过SQL*Plus重新执行创建索引语句却能够成功。为什么会出现这种错误和执行差异呢?请跟随我们的思路来一步步地了解这个案例的解析过程。
11.1 聚集数据的信息采集和分析
首先,了解ORA-04030错误的原因。ORA-04030错误引起的原因大概有以下几种情况。
(1)是否有足够的可用内存?
查看操作系统内存使用情况。
+-topas_nmon--r=Resources--------Host=gisdata2-------Refresh=2 secs---10:03.50------------+
| Memory ---------------------------------------------------------------------------|
| Physical PageSpace | pages/sec In Out | FileSystemCache |
|% Used 87.7% 2.6% | to Paging Space 0.5 0.0 | (numperm) 4.6% |
**|% Free 12.3% 97.4%** | to File System 0.0 0.0 | Process 66.1% |
|MB Used 55450.2MB 523.7MB | Page Scans 0.0 | System 17.0% |
**|MB Free 7781.6MB 19956.3MB** | Page Cycles 0.0 | Free 12.3% |
|Total(MB) 63231.9MB 20480.0MB | Page Steals 0.0 | ------ |
| | Page Faults 2110.3 | Total 100.0% |
|--------------------------------------------------------- | numclient 4.6% |
|Min/Maxperm 3065MB( 5%) 9195MB( 15%) <--% of RAM | maxclient 15.0% |
|Min/Maxfree 3000 4000 Total Virtual 81.7GB | User 67.7% |
|Min/Maxpgahead 2 8 Accessed Virtual 50.8GB 62.2%| Pinned 19.8% |
| | lruable pages 15692064.0|
|-----------------------------------------------------------------------------------|
查看当时出故障的服务器资源,可以看到剩余内存还有7.7GB左右,说明在操作系统层面还有足够多的可用内存。
(2)是否设置了Oracle的限制?
查看数据库中与PGA相关的设置。
SQL> show parameter PGA
NAME TYPE VALUE
------------------------------------ ----------------- ------------------------------
**pga_aggregate_target big integer 8400M**
SQL> select * from v$pgastat;
NAME VALUE UNIT
------------------------------------------------------------- ---------- ------------
aggregate PGA target parameter 8808038400 bytes
aggregate PGA auto target 7288095744 bytes
global memory bound 880803840 bytes
total PGA inuse 720201728 bytes
total PGA allocated 985562112 bytes
**maximum PGA allocated 2692780032 bytes ---**实际分配最大内存 **2568.03516MB**
total freeable PGA memory 163905536 bytes
process count 414
max processes count 529
PGA memory freed back to OS 2.4854E+11 bytes
total PGA used for auto workareas 10177536 bytes
maximum PGA used for auto workareas 642154496 bytes
total PGA used for manual workareas 0 bytes
maximum PGA used for manual workareas 537600 bytes
over allocation count 0
bytes processed 5.7321E+11 bytes
extra bytes read/written 2125827072 bytes
cache hit percentage 99.63 percent
recompute count (total) 61401
SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
2 FROM SYS.x$ksppi x, SYS.x$ksppcv y
3 WHERE x.indx = y.indx
4 AND x.ksppinm LIKE '%_pga_max_size%';
NAME VALUE DESCRIB
------------------------------ -----------------------------------------------------
**_pga_max_size 1761607680 Maximum size of the PGA memory for one process**
SQL> SELECT x.ksppinm NAME, y.ksppstvl VALUE, x.ksppdesc describ
2 FROM SYS.x$ksppi x, SYS.x$ksppcv y
3 WHERE x.indx = y.indx
4 AND x.ksppinm LIKE '%_smm_max_size%';
NAME VALUE DESCRIB
------------------------------ ------ ------------------------------------------------
**_smm_max_size 860160 maximum work area size in auto mode (serial)**
从上面的内容可以看到PGA设置的大小为8400M,根据单个会话使用PGA的期望尺寸(也可以认为是实际分配的最大尺寸)计算公式是:min(5%pga_aggregate_target,50%_pga_max_size, _smm_max_size),可以简单计算下min(5%*8400M,50%1680M,840M)=420M(其中_pga_max_size的单位为bytes,_smm_max_size的单位为kb),即单个会话能使用PGA的期望尺寸为420M,那报错的会话是否超过了该限制呢?
查看ORA-04030报错的trace文件如下所示。
TOP 10 MEMORY USES FOR THIS PROCESS
\---------------------------------------
98% 109 MB, 1797 chunks: "permanent memory " SQL
sort subheap ds=1108b67f0 dsprt=11048a508
1% 1493 KB, 1779 chunks: "free memory "
session heap ds=11047a818 dsprt=11019eae0
0% 193 KB, 26 chunks: "permanent memory "
pga heap ds=110072eb0 dsprt=0
0% 82 KB, 6 chunks: "frame segment " SQL
kxs-heap-f ds=11048a8c8 dsprt=11047a818
0% 45 KB, 4 chunks: "permanent memory "
session heap ds=11047a818 dsprt=11019eae0
0% 35 KB, 11 chunks: "kzctxhugi2 "
session heap ds=11047a818 dsprt=11019eae0
0% 33 KB, 1 chunk : "free memory "
top call heap ds=11019e8c0 dsprt=0
0% 24 KB, 11 chunks: "koh-kghu session heap "
session heap ds=11047a818 dsprt=11019eae0
0% 22 KB, 1 chunk : "Fixed Uga "
pga heap ds=110072eb0 dsprt=0
0% 20 KB, 5 chunks: "kxsFrame4kPage "
session heap ds=11047a818 dsprt=11019eae0
=======================================
PRIVATE MEMORY SUMMARY FOR THIS PROCESS
\---------------------------------------
******************************************************
**PRIVATE HEAP SUMMARY DUMP**
**111 MB total:**
111 MB commented, 194 KB permanent
45 KB free (0 KB in empty extents),
111 MB, 1 heap: "session heap "
\------------------------------------------------------
Summary of subheaps at depth 1
111 MB total:
109 MB commented, 55 KB permanent
1513 KB free (1 KB in empty extents),
2 KB uncommented freeable with mark, 110 MB, 1 heap: "kxs-heap-w " 1490 KB free held
\------------------------------------------------------
Summary of subheaps at depth 2
109 MB total:
109 MB commented, 10 KB permanent
14 KB free (4 KB in empty extents),
0 KB uncommented freeable with mark, 109 MB, 1 heap: "sort subheap "
\------------------------------------------------------
Summary of subheaps at depth 3
109 MB total:
0 KB commented, 109 MB permanent
2 KB free (0 KB in empty extents),
==========================================
INSTANCE-WIDE PRIVATE MEMORY USAGE SUMMARY
\-------------------------
Top 10 processes:
\-------------------------
(percentage is of 693 MB total allocated memory)
**16% pid 168: 111 MB used of 111 MB allocated <= CURRENT PROC**
6% pid 3: 43 MB used of 43 MB allocated
2% pid 24: 402 KB used of 12 MB allocated (12 MB freeable)
2% pid 33: 532 KB used of 12 MB allocated (11 MB freeable)
2% pid 7: 11 MB used of 11 MB allocated
2% pid 8: 11 MB used of 11 MB allocated
2% pid 9: 11 MB used of 11 MB allocated
2% pid 10: 11 MB used of 11 MB allocated
2% pid 11: 11 MB used of 11 MB allocated
2% pid 12: 11 MB used of 11 MB allocated
从以上的trace文件中可以看到,报错的进程实际分配的进程只有111MB,远远未达到420MB,说明并非是由于Oracle自身的限制引起的ORA-04030报错。
3)哪个进程需要的内存过多?
上一个是否是由于Oracle自身限制引起的解释中,已经可以从trace文件中看到,消耗最多内存的进程就是报ORA-04030的进程,消耗的内存为110MB,并未发现其他更消耗内存的进程。
4)是否设置了操作系统限制?
查看操作系统限制,Oracle用户的限制。
$ ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) 4194304
memory(kbytes) unlimited
coredump(blocks) unlimited
nofiles(descriptors) unlimited
threads(per process) unlimited
processes(per user) unlimited
root用户的限制。
gisdata2@root[/]ulimit -a
time(seconds) unlimited
file(blocks) unlimited
**data(kbytes) 131072**
stack(kbytes) 32768
memory(kbytes) unlimited
coredump(blocks) unlimited
nofiles(descriptors) unlimited
threads(per process) unlimited
processes(per user) unlimited
从上面root和oracle的limit限制来看,root用户的data(kbytes)的限定值得关注,该属性的意义是soft data segment size in blocks(进程数据段大小限制)。
11.2 聚焦疑点的跟踪测试与验证
现在做一个反向测试:利用SQLPlus工具创建索引,创建成功。问题来了,为何SQLPlus会成功,PL/SQL Developer却会失败?通过PL/SQL Developer工具创建索引时报ORA-04030错误,但是通过SQLPus创建却能成功,两者除了使用的工具不同,还有就是连接的方式不同(PL/SQL Developer是通过监听程序建立的进程连接,而SQLPlus是在数据库服务器上直接创建创建的连接,绕过了监听程序建立的进程)。从连接工具和方式的不同得到了不一样的结果,如何来验证到底是连接工具的问题还是连接方式的引起的报错?由于PL/SQL Developer只能通过监听的方式进行连接,但是SQLPlus可以通过监听或则直接连接两种方式进行,所以先对连接方式进行测试。通过SQLPlus以tnsnames.ora标签名的方式通过监听进行连接,并执行创建索引报错的语句,发现错误依然存在,但是如果不通过监听而直接连接是不会报错的,说明跟是否通过监听进行连接有很大的关系。
为何会受监听的影响?
在Oracle RAC环境中,由于crs的启停是通过root用户进行的。
gisdata2@root[/]ps -ef|grep init
root 1 0 0 Aug 31 - 18:24 /etc/init
root 5046686 1 0 Jul 18 - 0:00 /bin/sh /etc/init.crsd run
root 5702082 27066830 0 Jul 18 - 0:00 /bin/sh /etc/init.cssd oclsomon
root 6554098 27066830 0 Jul 18 - 0:00 /bin/sh /etc/init.cssd daemon
root 24969600 1 0 Jul 18 - 0:00 /bin/sh /etc/init.evmd run
root 27066830 1 1 Jul 18 - 8:07 /bin/sh /etc/init.cssd fatal
root 40108312 18940178 0 10:57:45 pts/1 0:00 grep init
所以在crs会继承root用户的limit属性,当通过crs或则srvctl命令启动监听时,也会继承root用户相应的limit属性,即data(kbytes)为131072。如何验证该推断?
现在通过监听的形式进行连接。
**$ SQLplus sys/password@gissc2 as sysdba**
SQL*Plus: Release 10.2.0.5.0 - Production on Mon Jul 20 17:20:02 2015
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
SQL> select * from v$mystat where rownum<2;
SID STATISTIC# VALUE
---------- ---------- ----------
1427 0 1
SQL> select spid from v$process where addr =(select paddr from v$session where sid=1427);
SPID
\------------
**10158446 -----**该连接的操作系统进程号
通过dbx工具查看该进程的limit信息:
**$ dbx -a 10158446**
Waiting to attach to process 10158446 ...
Successfully attached to oracle.
warning: Directory containing oracle could not be determined.
Apply 'use' command to initialize source path.
Type 'help' for help.
reading symbolic information ...
stopped in read at 0x90000000002e294 ($t1)
0x90000000002e294 (read+0x274) e8410028 ld r2,0x28(r1)
(dbx) proc rlimit
rlimit name: rlimit_cur rlimit_max (units)
RLIMIT_CPU: (unlimited) (unlimited) sec
RLIMIT_FSIZE: (unlimited) (unlimited) bytes
**RLIMIT_DATA: 134217728 (unlimited) bytes**
RLIMIT_STACK: 33554432 4294967296 bytes
RLIMIT_CORE: (unlimited) (unlimited) bytes
RLIMIT_RSS: (unlimited) (unlimited) bytes
RLIMIT_AS: (unlimited) (unlimited) bytes
RLIMIT_NOFILE: 65534 65536 descriptors
RLIMIT_THREADS: (unlimited) (unlimited) per process
RLIMIT_NPROC: (unlimited) (unlimited) per user
(dbx) quit
从上面的内容可以看到data属性的limit值为134217728 bytes即131072 kbytes与root的data(kbytes) 131072值完全吻合(stack的33554432 bytes即32768也与root的stack(kbytes) 32768一致),说明是通过监听建立连接进程的limit继承于root用户。
不通过监听进行连接:
$ SQLplus / as sysdba
SQL*Plus: Release 10.2.0.5.0 - Production on Mon Jul 20 17:16:09 2015
Copyright © 1982, 2010, Oracle. All Rights Reserved.
SQL> select * from v$mystat where rownum<2;
SID STATISTIC# VALUE
1375 0 1
SQL> select spid from vprocesswhereaddr=(selectpaddrfromvsession where sid=1375);
SPID
------------
47710432
跟踪这个服务器进程。
**$ dbx -a 47710432**
Waiting to attach to process 47710432 ...
Successfully attached to oracle.
warning: Directory containing oracle could not be determined.
Apply 'use' command to initialize source path.
Type 'help' for help.
reading symbolic information ...
stopped in read at 0x90000000002e294 ($t1)
0x90000000002e294 (read+0x274) e8410028 ld r2,0x28(r1)
(dbx) proc rlimit
rlimit name: rlimit_cur rlimit_max (units)
RLIMIT_CPU: (unlimited) (unlimited) sec
RLIMIT_FSIZE: (unlimited) (unlimited) bytes
**RLIMIT_DATA: (unlimited) (unlimited) bytes**
RLIMIT_STACK: 33554432 4294967296 bytes
RLIMIT_CORE: (unlimited) (unlimited) bytes
RLIMIT_RSS: (unlimited) (unlimited) bytes
RLIMIT_AS: (unlimited) (unlimited) bytes
RLIMIT_NOFILE: 65534 (unlimited) descriptors
RLIMIT_THREADS: (unlimited) (unlimited) per process
RLIMIT_NPROC: (unlimited) (unlimited) per user
(dbx) quit
$
从上面的内容可以看到,如果不通过监听连接数据库创建的进程,它的data限制为unlimited的即无限制。
最后查看crs中监听的启动日志:(/u01/oracle/product/10.2.0/db_1/log/gisdata2/racg中的日志文件ora.gisdata2.LISTENER_GISDATA2.lsnr.log)。
2015-07-18 14:16:54.676: [ RACG][1] [28508196][1][ora.gisdata2.LISTENER_GISDATA2.lsnr]: The command completed successfully
**2015-07-18 14:47:09.078:** [ RACG][1] [29950132][1][ora.gisdata2.LISTENER_GISDATA2.lsnr]:
LSNRCTL for IBM/AIX RISC System/6000: Version 10.2.0.5.0 - Production on 18-JUL-2015 14:47:08
Copyright (c) 1991, 2010, Oracle. All rights reserved.
Starting /u01/oracle/product/10.2.0/db_1/bin/tnslsnr: please wait...
2015-07-18 14:47:09.079: [ RACG][1] [29950132][1][ora.gisdata2.LISTENER_GISDATA2.lsnr]: TNSLSNR for IBM/AIX RISC System/6000: Version 10.2.0.5.0 - Production
System parameter file is /u01/oracle/product/10.2.0/db_1/network/admin/listener.ora
Log messages written to /u01/oracle/product/10.2.0/db_1/network/log/listener_gisdata2.log
2015-07-18 14:47:09.079: [ RACG][1] [29950132][1][ora.gisdata2.LISTENER_GISDATA2.lsnr]: Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.176.30.81)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.176.30.79)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC)))
从以上的内容可以发现监听是由crs(或则是srvctl命令)启动的。
以及监听的运行时间:
$ lsnrctl status
LSNRCTL for IBM/AIX RISC System/6000: Version 10.2.0.5.0 - Production on 20-JUL-2015 16:31:03
Copyright (c) 1991, 2010, Oracle. All rights reserved.
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
\------------------------
Alias LISTENER_GISDATA2
Version TNSLSNR for IBM/AIX RISC System/6000: Version 10.2.0.5.0 - Production
**Start Date 18-JUL-2015 14:56:39**
**Uptime 2 days 1 hr. 34 min. 24 sec**
Trace Level off
Security ON: Local OS Authentication
SNMP ON
Listener Parameter File /u01/oracle/product/10.2.0/db_1/network/admin/listener.ora
Listener Log File /u01/oracle/product/10.2.0/db_1/network/log/listener_gisdata2.log
监听启动的时间也与日志中的时间对应。
11.3 解析原理的问题总结与建议
那为什么CRS会使用ROOT用户的limit限制呢?
首先,得明白在Linux中,每触发任何一个事件时,系统都会为它定义一个进程,并且给予一个ID,即PID,同时会根据触发这个进程的用户与相关属性关系,给这个PID设置有效的权限,如图11-1所示。

图11-1
从上面可以知道,系统启动以后,CRS会自动启动,启动主要由/etc/init.d的几个脚本完成。而这些脚本的执行用户和用户组是root,也就是说当CRS启动时,Linux系统会根据UID/GID来判断资源属性和环境变量,那么这个进程所衍生的进程,也会沿用这些属性。
由此可以得到结论,由于监听是通过crs进行的启动,继承了root用户的limit限制,每个会话所能持有的内存大小最大不能超过128M,当通过监听进行数据库连接时,由监听创建的用户进程也将继承该limit限制,所以当通过P/LSQL Developer连接数据库(包括SQL*Plus等工具需要通过监听建立用户进程的情况),在创建索引过程中,当所请求的内存达到或非常接近该限制时,就会由于无法进一步申请更多的内存资源,抛出ORA-04030错误。
提示,如果是在Linux系统中,可以通过cat /proc/PID/limits命令进行查看单个进程的limit属性值,其中PID为要查看进程的进程号。
以ORA-04030案例进行延伸扩展,为避免该错误可参考以下建议。
(1)在安装Oracle软件创建数据库之前应该对主机层面的内核参数、limit限制等进行规范的修改,以避免类似问题的发生。
(2)配置合理的内存,例如物理内存和交换空间。
(3)使用自动PGA内存管理可降低ORA-04030错误的概率。




