监控工具：Oracle 12c Cluster Health Monitor 详解

戴明明 2016-06-24

1491

戴明明（Dave）

Oracle ACE-A，ACOUG核心成员，宝存科技数据库方案架构师

Dave也是CSDN 认证专家，超过7年的DBA经验，擅长Oracle数据库诊断、性能调优，热衷于Oracle 技术的研究与分享。从14年开始研究基于PCIe闪存卡的数据库高可用，高性能解决方案。

编辑手记：Cluster Health Monitor 会通过OS API来收集操作系统的统计信息，如内存，SWAP空间使用率，进程，IO 使用率，网络等相关的数据。CHM 的信息收集是实时的，并保存在CHM 仓库中。

在上一篇文章中讲过，Oracle的GIMR资料库中存储的重要信息包括：Cluster HealthMonitor (CHM/OS,ora.crf) 的内容。在本节中我们着重介绍关于CHM的知识。

之前内容参考：

12c特性解读：RAC MGMTDB资料库新特性说明及初相识

CHM 概述

Cluster Health Monitor 会通过OS API来收集操作系统的统计信息，如内存，SWAP空间使用率，进程，IO 使用率，网络等相关的数据。

CHM 的信息收集是实时的，在11.2.0.3 之前是每1秒收集一次，在11.2.0.3 之后，改成每5秒收集一次数据，并保存在CHM 仓库中。这个收集时间间隔不能手工修改。

CHM 的目的也是为了在出现问题时，提供一个分析的依据，比如节点重启，hang，实例被驱逐，性能下降，这些问题都可以通过对CHM 收集的数据进行分析。

而通过对这些常量的监控，也可以提前知道系统的运行状态，资源是否异常。

其实在GI 11.2.0.2 中，ORACLE 就把CHM 整合到GI中了，所以在11.2.0.2 的Linux 和Solaris 的2个平台中，不需要单独的安装CHM。

AIX 平台和Windows是在11.2.0.3 版本中整合进来的，11.2.0.2 之前的版本如果需要使用CHM的功能，必须从OTN上手工下载安装，并且在11.2.0.2 之前的版本中，也是没有Windows 的版本呢。

归纳如下：

11.2.0.1 之前: Linuxonly (download from OTN)
11.2.0.2: Solaris (Sparc 64 and x86-64only), and Linux.
11.2.0.3: AIX, Solaris (Sparc 64 and x86-64only), Linux, and Windows.

注意CHM不支持任何 Itanium 平台。

另外要注意，从OTN上下载的CHM 只能在单实例安装，并且从OTN 上下载的CHM 也只有Linux 和 Windows版本，对于11.2 之后的版本，CHM 只能在GI（RC）环境下运行。

在之前的版本里，当系统出现问题时，比如节点重启，我们都会部署OSW，来收集相关的信息，CHM和OSW既然是2个类似的工具，那么肯定就有对比和选择：

（1）   CHM直接调用OS的API来降低开销，而OSWatcher则是直接调用OS命令，CHM 对CPU 小消耗小于5%（core），几乎没有影响。
（2）   相对于OSW，CHM 收集的频率更快，每秒一次，
（3）   与OSW比，CHM 不会收集top，traceroute，netstat 的信息。
（4）   OSW 是运行在user priority，所以在CPU 负载很高的时候，是不能工作的，也就是说CHM 可以收集到OSW 收集不到的数据。

所以如果在部署一个工具的情况下，不能定位问题，那就2个工具都用上，如果只能选择一个，那就选择CHM。

CHM 基本管理

由Oracle GI 管理的对象都有资源名，CHM 也不例外，其对应的资源名叫:ora.crf。可以使用如下命令查看：

[root@rac1 ~]# crsctl stat res -t -init
--------------------------------------------------------------------------------
Name           Target State       Server                   Statedetails
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
     1        ONLINE ONLINE      rac1                     Started,STABLE
ora.cluster_interconnect.haip
     1        ONLINE ONLINE      rac1                     STABLE
ora.crf
      1       ONLINE ONLINE       rac1                     STABLE
ora.crsd
     1        ONLINE ONLINE      rac1                     STABLE
……

[root@rac2 ~]# crsctl stat res -t -init
--------------------------------------------------------------------------------
Name           Target State       Server                   Statedetails
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
     1        ONLINE ONLINE      rac2                     Started,STABLE
ora.cluster_interconnect.haip
     1        ONLINE ONLINE      rac2                     STABLE
ora.crf
      1       ONLINE ONLINE       rac2                     STABLE
ora.crsd
     1        ONLINE ONLINE      rac2                     STABLE
ora.cssd
     1        ONLINE ONLINE      rac2                     STABLE

查看到资源名之后，就可以按照普通资源一样，对其进行管理。

[root@rac1 ~]# crsctl stop res ora.crf-init
CRS-2673: Attempting to stop 'ora.crf' on'rac1'
CRS-2677: Stop of 'ora.crf' on 'rac1'succeeded

[root@rac1 ~]# crsctl stat res -t -init
--------------------------------------------------------------------------------
Name           Target State       Server                   Statedetails
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
     1        ONLINE ONLINE      rac1                     Started,STABLE
ora.cluster_interconnect.haip
     1        ONLINE ONLINE      rac1                     STABLE
ora.crf
      1        OFFLINE OFFLINE                               STABLE
ora.crsd
     1        ONLINE ONLINE      rac1                     STABLE

[root@rac1 ~]# crsctlstart res ora.crf -init
CRS-2672: Attempting to start 'ora.crf' on 'rac1'
CRS-2676: Start of 'ora.crf' on 'rac1'succeeded

[root@rac1 ~]# crsctl stat res -t -init
--------------------------------------------------------------------------------
Name           Target State       Server                   Statedetails
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
     1        ONLINE ONLINE      rac1                     Started,STABLE
ora.cluster_interconnect.haip
     1        ONLINE ONLINE      rac1                     STABLE
ora.crf
      1        ONLINE ONLINE       rac1                     STABLE
ora.crsd
     1        ONLINE ONLINE      rac1                     STABLE

对CHM 资源的启动和停止，只影响是否收集CHM的数据，不会对GI和DB 产生影响。

Oracle 的CHM也有自己专用的管理工具和命令。图形化的管理工具是CHMOSG(CHM/OSGraphical User Interface),默认没有安装，需要单独从OTN上下载。

CHMOSG 工具会以图形化的方式详细的展示相关数据。

OCLUMON 命令行工具可以查询CHM仓库的相关信息。如果是在安装GI时安装的，oclumon工具默认在$GI_HOME/bin下。如：

[grid@rac1 ~]$ which oclumon
/u01/gridsoft/12.1.0.2/bin/oclumon

如果是手工从OTN 上下载的，linux在/usr/lib/oracrf/bin 目录，Windows在C:\ProgramFiles\oracrf\bin目录。

具体命令使用参考命令帮助:

[grid@rac1 ~]$ oclumon -h
For help from command line   : oclumon <verb> -h
For help in interactive mode : <verb>-h
Currently supported verbs are :
dumpnodeview, manage, version, debug,analyze, quit, exit, and help

[grid@rac1 ~]$ oclumon dumpnodeview -h

dumpnodeview verb usage
=======================
The dumpnodeview command reports monitoredrecords in the text format. The
collection of metrics for a node at a givenpoint in time (a timestamp) is
called a node view.

* Usage
dumpnodeview [-allnodes | -n <node1> ...] [-last <duration>|
                -s <timestamp> -e<timestamp>][-i <interval>][-v]

*Where
-n<node1> ...   = Dump node views forspecified nodes
-allnodes        = Dump node viewsfor all nodes
-s<timestamp>   = Specify start timefor range dump of node views
-e <timestamp>   = Specify end time for range dump of nodeviews
                     Absolute timestamp must bein "YYYY-MM-DD HH24:MI:SS"
                     format, for example"2007-11-12 23:05:00"
-last <duration> = Dump the latest node views for a specifiedduration.
                     Duration must be in"HH24:MI:SS" format, for example
                     "00:45:00"
-v               = Dump verbosenode views
-i               = Dump node viewsseparated by the specified
                     interval in seconds. Mustbe a multiple of 5.

*Requirements and notes
Tostop continuous display, use Ctrl-C on Linux or UNIX and Esc on Windows.
-sand -e need to be specified together for range dumps of node views.
Thelocal System Monitor Service (osysmond) must be running to get dumps.
TheCluster Logger Service (ologgerd) must be running to get dumps.

*Defaults :
Mode      : Continuous mode

*Example :
oclumon dumpnodeview -n node1 node2 node3 -last "12:00:00"
oclumon dumpnodeview -last "00:10:00" -i 30

分析CHM 监控数据

所谓的分析，就是从CHM 仓库中，把我们需要的数据抽取出来，这里需要使用diagcollection.pl命令。

[grid@rac1 ~]$ diagcollection.pl -h
Production Copyright 2004, 2010,Oracle. All rights reserved
Cluster Ready Services (CRS) diagnosticcollection tool
diagcollection
   --collect
            [--crs] For collecting crs diagnostic information
            [--adr] For collecting diagnostic information for ADR; specify ADRlocation
            [--chmos] For collecting Cluster Health Monitor (OS) data
            [--acfs] Unix only. For collecting ACFS diagnostic information
            [--all] Default.For collecting all diagnostic information.
            [--core] UNIX only. Package core files with CRS data
            [--afterdate] UNIX only.Collects archives from the specified date. Specify in mm/dd/yyyy format
            [--aftertime] Supported with -adr option. Collects archives after thespecified time. Specify in YYYYMMDDHHMISS24 format
            [--beforetime] Supported with -adr option. Collects archives before thespecified date. Specify in YYYYMMDDHHMISS24 format
            [--crshome] Argument that specifies the CRS Home location
            [--incidenttime] Collects Cluster Health Monitor (OS) data from thespecified time. Specify inMM/DD/YYYYHH24:MM:SS format
                  If not specified, ClusterHealth Monitor (OS) data generated in the past 24 hours are collected
            [--incidentduration] Collects Cluster Health Monitor (OS) data for theduration after the specified time. Specify in HH:MM format.
                 If not specified, all ClusterHealth Monitor (OS) data after incidenttime are collected
             NOTE:
            1. You can also do the following
                diagcollection.pl --collect--crs --crshome <CRS Home>

    --clean        cleans up thediagnosability
                    information gathered bythis script

    --coreanalyze UNIX only. Extractsinformation from core files
                    and stores it in a textfile
[grid@rac1 ~]$

用grid用户执行命令：diagcollection.pl--collect --chmos

该命令会输出所有CHM 仓库中收集的数据。如果数据很多，那么就需要很长时间，所以一般只查询特定时间内的数据。

[root@rac1 ~]# diagcollection.pl --collect--chmos
Production Copyright 2004, 2010,Oracle. All rights reserved
Cluster Ready Services (CRS) diagnosticcollection tool
ORACLE_BASE is u01/gridbase
Collecting Cluster Health Monitor (OS) data
Version: 12.1.0.2.0
Collecting OS logs
Collecting sysconfig data

[root@rac1 ~]# ls -lrt
-rw-r--r-- 1 root root 9815498 Dec 12 16:28chmosData_rac1_20141212_1628.tar.gz
-rw-r--r-- 1 root root 267373 Dec 12 16:28osData_rac1_20141212_1628.tar.gz

注意：如果是收集所有的数据，在完成之后，会对收集的数据打包，所以这时，就会需要tar命令，所以要注意当前目录是否有权限，否则就需要换对应的用户，我这里是root用户，配置了环境变量，一样使用。

--收集最后一小时的数据：

[root@rac1 ~]# oclumon dumpnodeview-allnodes -v -last "1:00:00"

该命令会分析所有节点最后一小时内的所有数据，但默认情况下，会把所有输出都显示在命令行，这样根本就不能分析，所以一般都是直接重定向输出到某个文件。

如：

[root@rac1 ~]# oclumon dumpnodeview-allnodes -v -last "1:00:00" > tmp/zhixin.log
[root@rac1 ~]# cat tmp/zhixin.log

--收集特定时间段：

[grid@rac1tmp]$ diagcollection.pl --collect --crshome $ORACLE_HOME --chmos--incidenttime "12/12/201414:01:01" --incidentduration "01:00"

Production Copyright 2004, 2010,Oracle. All rights reserved
Cluster Ready Services (CRS) diagnosticcollection tool
Warning: Script executed while not loggedin as as root
Some diagnostic data may not be collected
Collecting Cluster Health Monitor (OS) data
Version: 12.1.0.2.0
Collecting OS logs
/bin/tar: var/log/messages: Cannot open:Permission denied
/bin/tar: var/log/messages-20141208: Cannotopen: Permission denied
/bin/tar: Exiting with failure status dueto previous errors
gzip: osData_rac1_20141212_1643.tar.gzalready exists; do you wish to overwrite (y or n)? y
Collecting sysconfig data
[grid@rac1 tmp]$

注意：

（1）这里的时间格式，必须按这种来，具体参考命令的帮助手册。

（2）在11.2.0.2 中，因为bug 10048487的存在，不能分析所有的CHM数据，只能按时间来收集。

如果想收集更详细的数据，可以提高CHM的log 级别，语法如下：

oclumon debug log all allcomp:<tracelevel from 0 to 3>

级别越高，收集的信息越多，默认level 是1. 如果0，则不会收集log 数据。所以在调整之后，一旦测试完成，要记得修改成1.

用root用户执行：

[root@rac1 ~]# oclumon debug log allallcomp:1
[root@rac1 ~]# oclumon debug log allallcomp:2
[root@rac1 ~]# oclumon debug log allallcomp:1

CHM 对磁盘空间的要求

默认情况下，CHM监控所有节点的数据需要1G的空间，每个节点每天产生约500M的数据。CHM 仓库默认保留3天，所以CHM 仓库的空间也是在不断增加。也就是说，在启动CHM 功能的情况下，CHM仓库最低需要1G的空间。

可以通过如下命令查选CHM 仓库收集数据的保留时间：

[grid@rac1 tmp]$ oclumon manage -getrepsize
CHM Repository Size = 136320 seconds
[grid@rac1 tmp]$

这里的单位是秒，因为CHM 是每秒收集一次。假设每天720MB 的数据，那么默认的策略就是：2*720MB*3=4320M 也就是4G多数据。

可以通过如下命令调整CHM 仓库里数据的保留时间：

oclumon manage -repos checkretentiontime xx

这里单位是秒，Oracle 建议是259200秒，也就是3天。

[grid@rac1 tmp]$ oclumon manage -reposcheckretentiontime 259200
The Cluster HealthMonitor repository is too small for the desired retention. Please first resizethe repository to 3896 MB

[grid@rac1 tmp]$ oclumon manage -getrepsize

CHM Repository Size = 136320 seconds
[grid@rac1 tmp]$

这里提示提示空间不够，需要先增加CHM 仓库的空间。默认情况下，CHM 仓库就是MGMTDB实例，其默认存放在OCR 的磁盘组里，所以我们这里需要增加OCR磁盘组的空间，才能修改。

如果CHM 仓库的占用的空间较大，可以通过如下命令修改CHM 仓库大小：

oclumon manage -repos changerepossize <memsize>.

注意：最低不小于1024MB，否则报错。

[grid@rac1 client]$ oclumon manage -reposchangerepossize 2500
The Cluster Health Monitor repository wassuccessfully resized.The new retention is 166380 seconds.
[grid@rac1 client]$

修改成功。

如何加入"云和恩墨大讲堂"微信群

搜索盖国强（Eygle）：eeygle，或者扫描下面二维码，备注：云和恩墨大讲堂，即可入群。每周与千人共享免费技术分享，与讲师在线讨论。