暂无图片
暂无图片
2
暂无图片
暂无图片
暂无图片

Oracle Trace File Analyzer 的简单了解(OSWatcher 和 收集诊断日志)

原创 张玉龙 2021-12-10
4565

Oracle Autonomous Health Framework(AHF) 包含 Oracle ORAchk, Oracle EXAchk, and Oracle Trace File Analyzer(TFA).
AHF的下载地址:
Autonomous Health Framework (AHF) - Including TFA and ORAchk/EXAchk (Doc ID 2550798.1)

以root用户安装,可以获得完整功能,非root用户安装,其功能会减少。
如果已安装 AHF,则重新安装会在现有位置进行升级,但是根据我的测试来看,建议先卸载现有安装再安装新版本,不然执行命令是各种报错。
Oracle Clusterware 不管理 AHF,因为如果 Oracle Clusterware 停止工作,AHF 必须可用。

安装

安装命令提供了以下选项:

[root@rac1 ~]# ./ahf_setup --help -ahf_loc - Install into the directory supplied. (Default /opt/oracle.ahf) # 安装到指定位置 -data_dir - AHF Data Directory where all the collections, metadata, etc. will be stored # 指定数据目录 -nodes - Comma separated Remote Node List # 逗号分隔的集群中的其他节点列表 -extract - Extract only files from Installer. (Default for non-root users) # 提取安装,比如只安装ORAchk -nosymlink - Do not create symlinks on Exadata DOM0. (Default for non-root users) -notfasetup - Do not Configure TFA when used with -extract # 不配置TFA -local - Only install on the local node # 仅在本地节点安装 -silent - Do not ask any install questions # 静默安装 -tmp_loc - Temporary location directory for AHF to extract the install archive to (must exist) (Default /tmp) -perlhome - Custom location of perl binaries -force - Force AHF Install # 强制安装 -debug - Debug AHF Install Script -level - AHF Instal Debug Level 1-6 (Default 4 with option -debug)

如果只想安装 Oracle ORAchk 或 Oracle EXAchk ,不想运行 Oracle Trace File Analyzer,则使用 -extract -notfasetup 参数.

[root@rac1 ~]# ./ahf_setup -extract -notfasetup

以下列出安装步骤:

[root@rac1 opt]# ./ahf_setup AHF Installer for Platform Linux Architecture x86_64 AHF Installation Log : /tmp/ahf_install_213000_13685_2021_12_10-16_17_25.log Starting Autonomous Health Framework (AHF) Installation AHF Version: 21.3.0 Build Date: 202110290347 Default AHF Location : /opt/oracle.ahf Do you want to install AHF at [/opt/oracle.ahf] ? [Y]|N : # 默认安装位置是 /opt/oracle.ahf AHF Location : /opt/oracle.ahf AHF Data Directory stores diagnostic collections and metadata. AHF Data Directory requires at least 5GB (Recommended 10GB) of free space. Choose Data Directory from below options : 1. /u01/app/grid [Free Space : 50585 MB] 2. Enter a different Location Choose Option [1 - 2] : 1 # 选择数据存放目录 AHF Data Directory : /u01/app/grid/oracle.ahf/data Do you want to add AHF Notification Email IDs ? [Y]|N : N # 输入N回车继续 AHF will also be installed/upgraded on these Cluster Nodes : 1. rac2 The AHF Location and AHF Data Directory must exist on the above nodes AHF Location : /opt/oracle.ahf AHF Data Directory : /u01/app/grid/oracle.ahf/data Do you want to install/upgrade AHF on Cluster Nodes ? [Y]|N : # 回车继续 Extracting AHF to /opt/oracle.ahf Configuring TFA Services Discovering Nodes and Oracle Resources Not generating certificates as GI discovered Starting TFA Services Created symlink from /etc/systemd/system/multi-user.target.wants/oracle-tfa.service to /etc/systemd/system/oracle-tfa.service. Created symlink from /etc/systemd/system/graphical.target.wants/oracle-tfa.service to /etc/systemd/system/oracle-tfa.service. .-------------------------------------------------------------------------. | Host | Status of TFA | PID | Port | Version | Build ID | +------+---------------+-------+------+------------+----------------------+ | rac1 | RUNNING | 15284 | 5000 | 21.3.0.0.0 | 21300020211029034711 | '------+---------------+-------+------+------------+----------------------' Running TFA Inventory... Adding default users to TFA Access list... .------------------------------------------------------------. | Summary of AHF Configuration | +-----------------+------------------------------------------+ | Parameter | Value | +-----------------+------------------------------------------+ | AHF Location | /opt/oracle.ahf | | TFA Location | /opt/oracle.ahf/tfa | | Orachk Location | /opt/oracle.ahf/orachk | | Data Directory | /u01/app/grid/oracle.ahf/data | | Repository | /u01/app/grid/oracle.ahf/data/repository | | Diag Directory | /u01/app/grid/oracle.ahf/data/rac1/diag | '-----------------+------------------------------------------' Starting orachk scheduler from AHF ... AHF install completed on rac1 Installing AHF on Remote Nodes : AHF will be installed on rac2, Please wait. AHF will prompt twice to install/upgrade per Remote Node. So total 2 prompts Do you want to continue Y|[N] : Y # 在其他节点安装 AHF will continue with Installing on remote nodes Installing AHF on rac2 : [rac2] Copying AHF Installer root@rac2's password: # 输入远程节点密码 [rac2] Running AHF Installer root@rac2's password: # 输入远程节点密码 AHF binaries are available in /opt/oracle.ahf/bin AHF is successfully installed Do you want AHF to store your My Oracle Support Credentials for Automatic Upload ? Y|[N] : # 回车继续 Moving /tmp/ahf_install_213000_13685_2021_12_10-16_17_25.log to /u01/app/grid/oracle.ahf/data/rac1/diag/ahf/

出现以下情况,需要每个节点单独安装

Login using root is disabled in sshd config. Installing AHF only on Local Node

Oracle Autonomous Health Framework 安装时默认配置了服务,且开机自启。

升级

部署RAC集群或安装RU时一般会自带,所以下载新版本安装时都会提示升级,但是个人观点,不推荐直接升级,建议先卸载再安装。
以下列出升级步骤:

[root@rac1 ~]# umask 0022 # 22, 022, or 0022 [root@rac1 ~]# unzip AHF-LINUX_v21.3.0.zip [root@rac1 ~]# ./ahf_setup AHF Installer for Platform Linux Architecture x86_64 AHF Installation Log : /tmp/ahf_install_213000_24863_2021_11_26-19_46_09.log Starting Autonomous Health Framework (AHF) Installation AHF Version: 21.3.0 Build Date: 202110290347 AHF is already installed at /opt/oracle.ahf # 发现当前系统已经安装 Installed AHF Version: 21.1.2 Build Date: 202105140934 Do you want to upgrade AHF [Y]|N : Y # 询问是否要升级 AHF will also be installed/upgraded on these Cluster Nodes : # 同时升级其他节点 1. rac2 The AHF Location and AHF Data Directory must exist on the above nodes # 确保升级的所有节点存在以下目录 AHF Location : /opt/oracle.ahf AHF Data Directory : /u01/app/grid/oracle.ahf/data Do you want to install/upgrade AHF on Cluster Nodes ? [Y]|N : Y # 询问是否要在集群节点上安装或升级 AHF Upgrading /opt/oracle.ahf # 对于已安装的进行升级 Shutting down AHF Services TFA-00201 Diagnostic directory not found. Shutting down TFA Removed symlink /etc/systemd/system/multi-user.target.wants/oracle-tfa.service. Removed symlink /etc/systemd/system/graphical.target.wants/oracle-tfa.service. Successfully shutdown TFA.. Starting AHF Services Starting TFA.. Created symlink from /etc/systemd/system/multi-user.target.wants/oracle-tfa.service to /etc/systemd/system/oracle-tfa.service. Created symlink from /etc/systemd/system/graphical.target.wants/oracle-tfa.service to /etc/systemd/system/oracle-tfa.service. Waiting up to 100 seconds for TFA to be started.. . . . . . Successfully started TFA Process.. . . . . . TFA Started and listening for commands TFA-00201 Diagnostic directory not found. AHF upgrade completed on rac1 # 本地节点安装完成 Upgrading AHF on Remote Nodes : AHF will be installed on rac2, Please wait. AHF will prompt twice to install/upgrade per Remote Node. So total 2 prompts Do you want to continue Y|[N] : Y # 提示是否继续升级远程节点 AHF will continue with Upgrading on remote nodes Upgrading AHF on rac2 : [rac2] Copying AHF Installer root@rac2's password: # 需要输入远程节点的root密码 [rac2] Running AHF Installer root@rac2's password: Do you want AHF to store your My Oracle Support Credentials for Automatic Upload ? Y|[N] : AHF is successfully upgraded to latest version .------------------------------------------------------------. | Host | TFA Version | TFA Build ID | Upgrade Status | +------+-------------+----------------------+----------------+ | rac1 | 21.3.0.0.0 | 21300020211029034711 | UPGRADED | | rac2 | 21.3.0.0.0 | 21300020211029034711 | UPGRADED | '------+-------------+----------------------+----------------' Moving /tmp/ahf_install_213000_24863_2021_11_26-19_46_09.log to /u01/app/grid/oracle.ahf/data/rac1/diag/ahf/ [root@rac1 ~]#

我升级完以后执行命令会出现以下报错,不知道是不是偶然:

[root@rac1 ~]# tfactl toolstatus TFA-00201 Diagnostic directory not found. [oracle@rac1 ~]$ tfactl diagcollect -srdc DBASM TFA-00201 Diagnostic directory not found.

MOS Tfactl Failed With TFA-00201 (Doc ID 2659786.1) 建议卸载重装。

卸载

[root@rac1 ~]# tfactl uninstall Starting AHF Uninstall NOTE : Uninstalling will delete the repository as well since Install type is GI AHF will be uninstalled on: rac1 rac2 # 提示是否继续卸载 Do you want to continue with AHF uninstall ? [Y]|N : Y ... ... Stopping and removing AHF in rac2... # 远程到其他节点上执行卸载,需要输入其他节点的密码 root@rac2's password:

命令

  • TFA的启停,安装TFA时默认配置了服务
[root@rac1 ~]# systemctl stop oracle-tfa.service [root@rac1 ~]# systemctl start oracle-tfa.service
  • 查看 TFA 的运行状态
[root@rac1 ~]# tfactl print status .-------------------------------------------------------------------------------------------. | Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status | +------+---------------+------+------+------------+----------------------+------------------+ | rac1 | RUNNING | 2098 | 5000 | 21.3.0.0.0 | 21300020211029034711 | COMPLETE | | rac2 | RUNNING | 2300 | 5000 | 21.3.0.0.0 | 21300020211029034711 | COMPLETE | '------+---------------+------+------+------------+----------------------+------------------'
  • 查看 TFA 工具的状态
[root@rac1 ~]# tfactl toolstatus Running command tfactltoolstatus on rac2 ... .------------------------------------------------------------------. | TOOLS STATUS - HOST : rac2 | +----------------------+--------------+--------------+-------------+ | Tool Type | Tool | Version | Status | +----------------------+--------------+--------------+-------------+ | AHF Utilities | alertsummary | 21.3.0 | DEPLOYED | | | calog | 21.3.0 | DEPLOYED | | | dbglevel | 21.3.0 | DEPLOYED | | | grep | 21.3.0 | DEPLOYED | | | history | 21.3.0 | DEPLOYED | | | ls | 21.3.0 | DEPLOYED | | | managelogs | 21.3.0 | DEPLOYED | | | menu | 21.3.0 | DEPLOYED | | | orachk | 21.3.0 | DEPLOYED | | | param | 21.3.0 | DEPLOYED | | | ps | 21.3.0 | DEPLOYED | | | pstack | 21.3.0 | DEPLOYED | | | summary | 21.3.0 | DEPLOYED | | | tail | 21.3.0 | DEPLOYED | | | triage | 21.3.0 | DEPLOYED | | | vi | 21.3.0 | DEPLOYED | +----------------------+--------------+--------------+-------------+ | Development Tools | oratop | 14.1.2 | DEPLOYED | +----------------------+--------------+--------------+-------------+ | Support Tools Bundle | darda | 2.10.0.R6036 | DEPLOYED | | | oswbb | 8.3.2 | RUNNING | | | prw | 12.1.13.11.4 | NOT RUNNING | '----------------------+--------------+--------------+-------------' Note :- DEPLOYED : Installed and Available - To be configured or run interactively. NOT RUNNING : Configured and Available - Currently turned off interactively. RUNNING : Configured and Available. .------------------------------------------------------------------. | TOOLS STATUS - HOST : rac1 | +----------------------+--------------+--------------+-------------+ | Tool Type | Tool | Version | Status | +----------------------+--------------+--------------+-------------+ | AHF Utilities | alertsummary | 21.3.0 | DEPLOYED | | | calog | 21.3.0 | DEPLOYED | | | dbglevel | 21.3.0 | DEPLOYED | | | grep | 21.3.0 | DEPLOYED | | | history | 21.3.0 | DEPLOYED | | | ls | 21.3.0 | DEPLOYED | | | managelogs | 21.3.0 | DEPLOYED | | | menu | 21.3.0 | DEPLOYED | | | orachk | 21.3.0 | DEPLOYED | | | param | 21.3.0 | DEPLOYED | | | ps | 21.3.0 | DEPLOYED | | | pstack | 21.3.0 | DEPLOYED | | | summary | 21.3.0 | DEPLOYED | | | tail | 21.3.0 | DEPLOYED | | | triage | 21.3.0 | DEPLOYED | | | vi | 21.3.0 | DEPLOYED | +----------------------+--------------+--------------+-------------+ | Development Tools | oratop | 14.1.2 | DEPLOYED | +----------------------+--------------+--------------+-------------+ | Support Tools Bundle | darda | 2.10.0.R6036 | DEPLOYED | | | oswbb | 8.3.2 | RUNNING | | | prw | 12.1.13.11.4 | NOT RUNNING | '----------------------+--------------+--------------+-------------' Note :- DEPLOYED : Installed and Available - To be configured or run interactively. NOT RUNNING : Configured and Available - Currently turned off interactively. RUNNING : Configured and Available.

OSWatcher

不知道从 11G RAC 的哪个 PSU 开始,OSWatcher 就集成 TFA 上了。
TFA 中的 OSWatcher 默认采集间隔是30s,保存时间为48h:

[root@rac1 ~]# ps -ef |grep OSW grid 27395 1 0 16:32 ? 00:00:00 /bin/sh ./OSWatcher.sh 30 48 /bin/gzip /u01/app/grid/oracle.ahf/data/repository/suptools/rac1/oswbb/grid/archive grid 27537 27395 0 16:32 ? 00:00:00 /bin/sh ./OSWatcherFM.sh 48 /u01/app/grid/oracle.ahf/data/repository/suptools/rac1/oswbb/grid/archive

TFA 停了,OSWatcher 不会停

[root@rac1 ~]# systemctl stop oracle-tfa.service [root@rac1 ~]# ps -ef |grep OSW root 12964 17294 0 16:59 pts/0 00:00:00 grep --color=auto OSW grid 27395 1 0 16:32 ? 00:00:00 /bin/sh ./OSWatcher.sh 30 48 /bin/gzip /u01/app/grid/oracle.ahf/data/repository/suptools/rac1/oswbb/grid/archive grid 27537 27395 0 16:32 ? 00:00:00 /bin/sh ./OSWatcherFM.sh 48 /u01/app/grid/oracle.ahf/data/repository/suptools/rac1/oswbb/grid/archive [root@rac1 ~]# tfactl toolstatus TFA-00104 Cannot establish connection with TFA Server. Please check TFA Certificates

测试发现 TFA 中的 OSWatcher 跟集群中的数据库似得,手动停了OSWatcher,重启操作系统不会自动启动 OSWatcher,不停 OSWatcher 重启操作系统会自动启动 OSWatcher。

[root@rac1 ~]# tfactl stop oswbb [root@rac1 ~]# reboot [root@rac1 ~]# ps -ef |grep OSW root 8717 4461 0 17:16 pts/0 00:00:00 grep --color=auto OSW

修改 OSWatcher 的采集间隔和保存时间,采集间隔修改为15s,保存时间修改为168h(也就是7天):

[root@rac1 ~]# tfactl stop oswbb [root@rac1 ~]# tfactl start oswbb 15 168 # 压缩 [root@rac1 ~]# tfactl start oswbb 15 168 gzip [root@rac1 ~]# ps -ef |grep osw grid 31224 1 0 17:26 pts/0 00:00:00 /bin/sh ./OSWatcher.sh 15 168 /bin/gzip /u01/app/grid/oracle.ahf/data/repository/suptools/rac1/oswbb/grid/archive grid 31358 31224 0 17:27 pts/0 00:00:00 /bin/sh ./OSWatcherFM.sh 168 /u01/app/grid/oracle.ahf/data/repository/suptools/rac1/oswbb/grid/archive # 重启操作系统修改的配置会保留 # 手动重启 OSWatcher 配置也是会保留的 [root@rac1 ~]# tfactl stop oswbb Stopped OSWatcher [root@rac1 ~]# tfactl start oswbb Starting OSWatcher [root@rac1 ~]# ps -ef |grep osw grid 13055 1 0 17:36 pts/0 00:00:00 /bin/sh ./OSWatcher.sh 15 168 /bin/gzip /u01/app/grid/oracle.ahf/data/repository/suptools/rac1/oswbb/grid/archive

收集诊断日志

通过调用命令 tfactl diagcollect 来控制TFA收集我们期望的诊断日志。tfactl 提供了多种可选择的模式进行收集,如收集一个时间段内的日志来减少收集日志的量;
具体操作的命令可以通过以下方式看到:

[root@rac1 ~]# tfactl diagcollect -h Collect logs from across nodes in cluster Usage : /opt/oracle.ahf/tfa/bin/tfactl diagcollect [ [component_name1] [component_name2] ... [component_nameN] | [-srdc <srdc_profile>] | [-defips]] [-sr <SR#>] [-node <all|local|n1,n2,..>] [-tag <tagname>] [-z <filename>] [-acrlevel <system,database,userdata>] [-last <n><m|h|d>| -from <time> -to <time> | -for <time>] [-nocopy] [-notrim] [-silent] [-cores][-collectalldirs][-collectdir <dir1,dir2..>][-collectfiles <file1,..,fileN,dir1,..,dirN> [-onlycollectfiles]][-examples] components:-ips|-database|-asm|-crsclient|-dbclient|-dbwlm|-tns|-rhp|-procinfo|-cvu|-afd|-crs|-cha|-wls|-emagenti|-emagent|-oms|-omsi|-ocm|-emplugins|-em|-acfs|-install|-cfgtools|-os|-ashhtml|-ashtext|-awrhtml|-awrtext|-sosreport|-qos|-ahf|-dataguard -srdc Service Request Data Collection (SRDC). -database Specify comma separated list of db unique names for collection -defips Include in the default collection the IPS Packages for: ASM, CRS and Databases -sr Enter SR number to which the collection will be uploaded -node Specify comma separated list of host names for collection -tag <tagname> The files will be collected into tagname directory inside repository -z <zipname> The collection zip file will be given this name within the TFA collection repository -last <n><m|h|d> Files from last 'n' [m]inutes, 'n' [d]ays or 'n' [h]ours -since Same as -last. Kept for backward compatibility. -from "Mon/dd/yyyy hh:mm:ss" From <time> or "yyyy-mm-dd hh:mm:ss" or "yyyy-mm-ddThh:mm:ss" or "yyyy-mm-dd" -to "Mon/dd/yyyy hh:mm:ss" To <time> or "yyyy-mm-dd hh:mm:ss" or "yyyy-mm-ddThh:mm:ss" or "yyyy-mm-dd" -for "Mon/dd/yyyy" For <date>. or "yyyy-mm-dd" -nocopy Does not copy back the zip files to initiating node from all nodes -notrim Does not trim the files collected -silent This option is used to submit the diagcollection as a background process -cores Collect Core files when it would normally not have been collected -collectalldirs Collect all files from a directory marked "Collect All" flag to true -collectdir Specify a comma separated list of directories and the collection will include all files from these irrespective of type and time constraints in addition to the components specified -collectfiles Specify a comma separated list of files/directories and the collection will include the files and directories in addition to the components specified. if -onlycollectfiles is also used, then no other components will be collected. -acrlevel <system,database,userdata> This option is used to specify the ACR level(s) for redaction -examples Show diagcollect usage examples For detailed help on each component use: /opt/oracle.ahf/tfa/bin/tfactl diagcollect [component_name1] [component_name2] ... [component_nameN] -help

DBA 在解决GI(RAC)问题的时候,一个最大的问题就是及时的收集各个节点上和问题相关的诊断日志,特别是收集的日志还有跨节点。以前使用 diagcollection.pl 脚本来收集日志,但是这个脚本的弊端是不会甄别日志里的内容,会把所有的 RAC 日志从头至尾都收集一遍,所以这个脚本收集的日志一般都是非常大的,而且这个脚本必须要在各个节点上分别使用root用户分别运行,使用不便利。
TFA基本上克服了上边的这些问题,TFA通过在每一个节点上运行一个Java的虚拟环境,来判断什么时候需要启动来收集,压缩日志,并且来判断哪些日志是解决问题必要,TFA是运行在 GI 和 RDBMS 之外的产品,所以它甚至和当前使用的版本和平台都没有关系。
所以,在处理 Oracle GI 和 RAC 问题时,使用 TFA 可以一键收集所有需要的日志,而且会过滤掉不需要的日志。
也有客户担心使用 TFA 会对系统有影响,了解了上述它的功能之后,您就可以知道它只是一个日志收集工具,并不会对系统产生变更,他对 OS 的负载压力是轻量级的。

[root@rac1 ~]# tfactl diagcollect -examples Examples: /opt/oracle.ahf/tfa/bin/tfactl diagcollect # 收集并压缩最近 1 小时内所有集群节点更新的所有日志文件以及 chmos/osw 数据。 # 注意:收集的日志量可能会比较大,但作为收集最近发生问题的诊断日志,这是最简单的方法。 /opt/oracle.ahf/tfa/bin/tfactl diagcollect -last 8h # 收集并压缩最近 8 小时内所有集群节点更新的所有日志文件以及 chmos/osw 数据。 /opt/oracle.ahf/tfa/bin/tfactl diagcollect -database hrdb,fdb -last 1d -z foo # 收集并压缩最近 1 天内所有集群节点 hrdb 和 fdb 数据库的诊断日志。 /opt/oracle.ahf/tfa/bin/tfactl diagcollect -crs -os -node node1,node2 -last 6h # 收集并压缩最近 6 小时内 node1 和 node2 节点的 crs 日志、os 日志和 chmos/osw 数据。 /opt/oracle.ahf/tfa/bin/tfactl diagcollect -asm -node node1 -from "Dec/09/2021" -to "Dec/09/2021 21:00:00" # 收集并压缩 node1 节点在 from 和 to 时间段的 ASM 日志。 /opt/oracle.ahf/tfa/bin/tfactl diagcollect -for "Dec/09/2021" # 收集并压缩 for 指定的那一天的所有日志文件以及 chmos/osw 数据。 /opt/oracle.ahf/tfa/bin/tfactl diagcollect -for "Dec/09/2021 21:00:00" # 收集并压缩 for 指定的时间的前后12小时之间的所有日志文件以及 chmos/osw 数据。 /opt/oracle.ahf/tfa/bin/tfactl diagcollect -crs -collectdir /tmp_dir1,/tmp_dir2 # 收集并压缩最近 1 小时内更新的所有 crs 文件,还在发起节点收集来自 /tmp_dir1 和 /tmp_dir2 的所有文件。 tfactl diagcollect -from "2022-01-14 08:00:00" -to "2022-01-14 19:00:00"
最后修改时间:2022-01-24 09:22:29
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论