释放内存导致数据库崩溃的案例

原创李真旭 2019-07-24

815

这是我一个运营商客户的案例。其现象大致是某天凌晨某RAC节点实例被重启了，通过如下的告警日志我们可以发现RAC集群的节点2实例被强行终止掉了，以下是详细的日志信息。

Mon Sep 28 02:00:00 2015
Errors in file /oracle/admin/xxx/bdxxx/xxx2_j000_7604.trc:
ORA-12012: error on auto execute of job 171538
ORA-06550: line ORA-06550: line 4, column 3:
PLS-00905: object XSQD.E_SP_DL_CRM_TERMINAL_MANAGER is invalid
ORA-06550: line 4, column 3:
PL/SQL: Statement ignored, column :
Mon Sep 28 02:03:18 2015
Errors in file /oracle/admin/xxx/udxxx/xxx2_ora_6810.trc:
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x679000020], [], [], [], [], [], []
.....
Trace dxxxing is performing id=[cdmp_20150928023925]
Mon Sep 28 02:39:32 2015
Errors in file /oracle/admin/xxx/bdxxx/xxx2_lmd0_24228.trc:
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x679000020], [], [], [], [], [], []
Mon Sep 28 02:39:32 2015
LMD0: terminating instance due to error 482
Mon Sep 28 02:39:33 2015
Shutting down instance (abort)
License high water mark = 145
Mon Sep 28 02:39:37 2015
Instance terminated by LMD0, pid = 24228
Mon Sep 28 02:39:38 2015

从上面的数据库告警日志来看，数据库实例2从2:03就开始报错ORA-00600 [KGHLKREM1]，一直持续到2:39，lmd0进程开始报同样的错误，紧接着LMD0进程强行把数据库实例2终止掉了。如果参照上述ORA-00600错误，直接搜索Oracle MOS，可能会搜到以下结果，Bug 14193240 : LMS SIGNALED ORA-600[KGHLKREM1] DURING BEEHIVE LOAD。但是这个Bug很容易被排除，根据系统监控就可以直接排除。在故障期间系统负载是非常低的。

这里我们需要注意，从告警日志来看，从2:03就开始报错，然而直到lmd0进程报错时，实例才被其终止掉。不难看出，数据库节点2的lmd0报错才是问题的关键。那么我们首先来分析数据库节点2的lmd0 进程的trace文件内容。

*** 2015-09-28 02:39:24.291
***** Internal heap ERROR KGHLKREM1 addr=0x679000020 ds=0x60000058 *****
***** Dxxx of memory around addr 0x679000020:
678FFF020 00000000 00000000 00000000 00000000  [................]
…..
679000030 00000000 00000000 00000000 00000000  [................]
  Repeat 254 times
Recovery state: ds=0x60000058 rtn=(nil) *rtn=(nil) szo=0 u4o=0 hdo=0 off=0
 Szo:
 UB4o:
 Hdo:
 Off:
 Hla: 0
******************************************************
HEAP Dxxx heap name="sga heap"  desc=0x60000058
 extent sz=0x47c0 alt=216 het=32767 rec=9 flg=-126 opc=4
 parent=(nil) owner=(nil) nex=(nil) xsz=0x0
 ds for latch 1: 0x60042f70 0x600447c8 0x60046020 0x60047878
……省略部分内容
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x679000020], [], [], [], [], [], []
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+31          call     ksedst1()            000000000 ? 000000001 ?
                                                   7FFF41D71F90 ? 7FFF41D71FF0 ?
                                                   7FFF41D71F30 ? 000000000 ?
ksedmp()+610         call     ksedst()             000000000 ? 000000001 ?
                                                   7FFF41D71F90 ? 7FFF41D71FF0 ?
                                                   7FFF41D71F30 ? 000000000 ?
ksfdmp()+21          call     ksedmp()             000000003 ? 000000001 ?
                                                   7FFF41D71F90 ? 7FFF41D71FF0 ?
                                                   7FFF41D71F30 ? 000000000 ?
kgerinv()+161        call     ksfdmp()             000000003 ? 000000001 ?
                                                   7FFF41D71F90 ? 7FFF41D71FF0 ?
                                                   7FFF41D71F30 ? 000000000 ?
kgesinv()+33         call     kgerinv()            0068966E0 ? 2AF92650E2C0 ?
                                                   7FFF41D71FF0 ? 7FFF41D71F30 ?
                                                   000000000 ? 000000000 ?
kgesin()+143         call     kgesinv()            0068966E0 ? 2AF92650E2C0 ?
                                                   7FFF41D71FF0 ? 7FFF41D71F30 ?
                                                   000000000 ? 000000000 ?
kghnerror()+342      call     kgesin()             0068966E0 ? 2AF92650E2C0 ?
                                                   7FFF41D71FF0 ? 7FFF41D71F30 ?
                                                   000000002 ? 679000020 ?
kghadd_reserved_ext  call     kghnerror()          0068966E0 ? 060000058 ?
ent()+1039                                         005AE1C14 ? 679000020 ?
                                                   000000002 ? 679000020 ?
kghget_reserved_ext  call     kghadd_reserved_ext  0068966E0 ? 060000058 ?
ent()+250                     ent()                060042F70 ? 060042FB8 ?
                                                   000000000 ? 000000000 ?
kghgex()+1622        call     kghget_reserved_ext  0068966E0 ? 060003B98 ?
                              ent()                060042F70 ? 000000B10 ?
                                                   000000000 ? 000000000 ?
kghfnd()+660         call     kghgex()             0068966E0 ? 060003B98 ?
                                                   060042F70 ? 000002000 ?
                                                   000000AC8 ? 06000D600 ?
kghprmalo()+274      call     kghfnd()             0068966E0 ? 060003B98 ?
                                                   060042F70 ? 000000AB8 ?
                                                   000000AB8 ? 06000D600 ?

从上面的信息来看，确实heap 存在错误的情况。根据这个错误堆栈可以在MOS上再次匹配，这时找到文档号1070812.1的文章：ORA-600 [KGHLKREM1] On Linux Using Parameter drop_cache On hugepages Configuration，此次故障跟描述基本上一致。

***** Internal heap ERROR KGHLKREM1 addr=0x679000020 ds=0x60000058 *****
***** Dxxx of memory around addr 0x679000020:
678FFF020 00000000 00000000 00000000 00000000  [................]
        Repeat 255 times
679000020 60001990 00000000 00000000 00000000  [...`............]
679000030 00000000 00000000 00000000 00000000  [................]
  Repeat 254 times

其中地址[0x679000020] 后面的内容也均为0，跟文档描述一样。其次，文章中提到使用了linux 内存释放机制以及同时启用了hugepage配置。根据文档描述，这应该是Linux Bug。通过检查对比2个节点配置，发现节点2的配置确实不同。

--节点1

[oracle@xxx-DS01 ~]$ cat /proc/sys/vm/drop_caches
0 
--节点2
[oracle@xxx-DS02 ~]$ cat /proc/sys/vm/drop_caches
3

当drop_caches 设置为3时，会触发linux的内存清理回收机制，可能出现内存错误的情况。然而我们检查配置发现并没有修改。

oracle@xxx-DS02 bdxxx]$ cat /etc/sysctl.conf
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request deBugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dxxxs will append the PID to the core filename
# Useful for deBugging multi-threaded applications
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1
# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536
# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 65536
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
kernel.shmmni = 4096
kernel.sem = 250 32000 100 128
fs.file-max = 65536
net.ipv4.ip_local_port_range = 1024 65000
net.core.rmem_default = 1048576
net.core.rmem_max = 1048576
net.core.wmem_default = 262144
net.core.wmem_max = 262144
vm.nr_hugepages=15800

既然没有修改配置文件，那么为什么会出现这种情况呢？

我怀疑是有人手工执行了echo 3 > /proc/sys/vm/drop_caches命令来强制释放内存导致。接下来查看到了最近几分钟的操作记录，发现了如下的蛛丝马迹。

[root@xxx-DS02 ~]# history|grep echo
   22  2015-09-29 16:12:42 root echo 3 > /proc/sys/vm/drop_caches
   71  2015-09-29 16:12:42 root echo 0 > /proc/sys/vm/drop_caches
   73  2015-09-29 16:12:42 root echo 3 > /proc/sys/vm/drop_caches
   79  2015-09-29 16:12:42 root echo 0 > /proc/sys/vm/drop_caches
  311  2015-09-29 16:12:42 root echo 3 > /proc/sys/vm/drop_caches
  329  2015-09-29 16:12:42 root echo 0 > /proc/sys/vm/drop_caches
 1001  2015-09-29 16:12:49 root history|grep echo
 1005  2015-09-29 16:14:55 root history|grep echo

很明显root 用户确实执行了内存释放的操作。然而运维人员却否认执行过类似操作，这说明事情并不是如此简单。我们进一步检查数据库操作系统日志发现如下信息。

Sep 29 00:00:12 xxx-DS02 kernel: Bug: soft lockup - CPU#1 stuck for 10s! [rel_mem.sh:13887]
Sep 29 00:00:12 xxx-DS02 kernel: CPU 1:
……
Sep 29 00:00:12 xxx-DS02 kernel: Pid: 13887, comm: rel_mem.sh Tainted: G      2.6.18-194.el5 #1
Sep 29 00:00:12 xxx-DS02 kernel: RIP: 0010:[<ffffffff800cb229>]  [<ffffffff800cb229>] __invalidate_mapping_pages+0xf3/0x183
Sep 29 00:00:12 xxx-DS02 kernel: RSP: 0018:ffff8112f5f71da8  EFLAGS: 00000207
……
Sep 29 00:00:12 xxx-DS02 kernel:
Sep 29 00:00:12 xxx-DS02 kernel: Call Trace:
Sep 29 00:00:12 xxx-DS02 kernel:  [<ffffffff800cb21a>] __invalidate_mapping_pages+0xe4/0x183
Sep 29 00:00:12 xxx-DS02 kernel:  [<ffffffff800f642e>] drop_pagecache+0xa5/0x13b
Sep 29 00:00:12 xxx-DS02 kernel:  [<ffffffff80096bc1>] do_proc_dointvec_minmax_conv+0x0/0x56
Sep 29 00:00:12 xxx-DS02 kernel:  [<ffffffff800f64de>] drop_caches_sysctl_handler+0x1a/0x2c
Sep 29 00:00:12 xxx-DS02 kernel:  [<ffffffff80096fea>] do_rw_proc+0xcb/0x126
Sep 29 00:00:12 xxx-DS02 kernel:  [<ffffffff80016a49>] vfs_write+0xce/0x174
Sep 29 00:00:12 xxx-DS02 kernel:  [<ffffffff80017316>] sys_write+0x45/0x6e
Sep 29 00:00:12 xxx-DS02 kernel:  [<ffffffff8005e28d>] tracesys+0xd5/0xe0

我们可以看出，原来是由于调用了rel_mem.sh 脚本引发了这个问题，这个调用甚至导致CPU 1 挂起了10秒。对于Oracle RAC而言，当CPU 出现挂起，那么极有可能导致LMS等进程也挂起，进而引发数据库故障。

既然是调用了shell脚本，而客户的运维人员又没有主动做过任何操作，因此我怀疑很可能是之前部署了crontab 脚本，经过检查发现确实存在相关脚本，如下所示。

[root@xxx-DS02 ~]# crontab -l
00 00 * * * /home/oracle/ht/rel_mem.sh
[root@xxx-DS02 ~]# cat /home/oracle/ht//rel_mem.sh
#!/bin/bash
#mkdir /var/log/freemem
time=`date +%Y%m%d`
used=`free -m | awk 'NR==2' | awk '{print $3}'`
free=`free -m | awk 'NR==2' | awk '{print $4}'`
echo "===========================" >> /var/log/freemem/mem$time.log
date >> /var/log/freemem/mem$time.log
echo "Memory usage | [Use:${used}MB][Free:${free}MB]" >> /var/log/freemem/mem$time.log
 
if [ $free -le 71680 ];then
                  sync && echo 3 > /proc/sys/vm/drop_caches
                  echo "OK" >> /var/log/freemem/mem$time.log
                  free >> /var/log/freemem/mem$time.log
else
                  echo "Not required" >> /var/log/freemem/mem$time.log
fi

到这里我们已经清楚了原因，到此这个案例已经告一段落了。最后我们还应该深入思考，为什么客户要部署这样一个脚本呢？只有一种解释，说明这个数据库节点之前可能面临内存使用居高不下的问题。既然如此，那么就进一步检查一下目前系统的内存使用情况。

[root@xxx-DS02 ~]# free -m
             total       used       free     shared    buffers     cached
Mem:        128976     127808       1168          0        245      88552
-/+ buffers/cache:      39010      89966
Swap:        15999        194      15805

我们可以看到操作系统物理内存为128GB，而其中cache 内存就占据了88GB。Linux 文件系统的cache分为2种：page cache和 buffer cache。page cache是用于文件inode等操作的，而buffer cache是用于块设备的操作。上述 free -m 命令中的cached 88552 全是page cache。目前数据库实例的内存分配之后也就40GB。

SQL> show parameter sga 
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
lock_sga                             boolean     FALSE
pre_page_sga                         boolean     FALSE
sga_max_size                         big integer 30G
sga_target                           big integer 30G
SQL> show parameter pga 
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target                 big integer 10G

由此可见，操作系统物理内存之所以看上去那么高，并非Oralce本身所消耗，大部分为文件系统cache所消耗。而对于Linux的文件系统缓存，我们是可以通过调整操作系统内核参数来加快回收的，并不需要使用前面提到的强制清理回收内存的暴力解决方式。

根据Oracle metalink相关文档建议，通过设置如下参数来避免文本中的问题。

  sysctl -w vm.min_free_kbytes=1024000 
  sysctl -w vm.vfs_cache_pressure=200
  sysctl -w vm.swappiness=40

其中min_free_kbytes参数表示操作系统至少保留的空闲物理内存大小，单位是KB。vm.vfs_cache_pressure 参数用来控制操作系统对内存的回收，默认值为100，通过增加该值大小，可以加快系统对文件系统cache的回收。

而最后的参数vm.swappiness则是控制swap交换产生的趋势程度，默认值为100。通过将该值调得更低一些，可以降低物理内存和disk之间产生交换的概率。但是Redhat 官方白皮书中明确提出，不建议将该值设置为10或者更低的值。

实际上，通过我们的调整之后，据后续观察，该系统至今未再出现实例宕机的情况。这个案例告诉我们，在一个复杂的系统中，通过单一的手段进行粗暴的问题处理是危险的。正确的途径应该是，找出问题的根源，从源头对症下药，如此才能高枕无忧。

oracle

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

释放内存导致数据库崩溃的案例

评论