Performance tuning ‘gc cr&current grant 2-way’ event (当主机扩容cpu后)

张维照 2019-05-31

1544

问题描述

上周遇到了一个案例在主机资源扩容后， gc cr grant 2-way 明显增长，而且在节点1性能慢时人为做了flush buffer_cache 又浇了桶油后，节点1很快hang 住，最后kill 了所有了LOCAL=NO的进程都于事无补，最后kill 了后台进程强型重启了instance, 下面是节点1 重启前的1小时AWR

Cache Sizes
Begin	End		
Buffer Cache:	 41,984M	 41,984M	Std Block Size:	 8K
Shared Pool Size:	 8,192M	 8,192M	Log Buffer:	 264,632K


Load Profile
				Per Second	Per Transaction	 Per Exec	 Per Call
DB Time(s):	 107.3	   		0.6	 	0.03	 0.07
DB CPU(s):	 2.1			 0.0	 0.00	 0.00
Redo size:	 1,521,947.0	 8,505.6	 	 
Logical reads:	 392,261.2	 2,192.2	 	 
Block changes:	 6,776.0	 37.9	 	 
Physical reads:	 2,567.2	 14.4	 	 
Physical writes: 575.0	 3.2	 	 
User calls:		 1,540.5	 8.6	 	 
Parses:	 		 121.2	 0.7	 	 
Hard parses:	 16.0	 0.1	 	 
W/A MB processed: 0.5	 0.0	 	 
Logons:	 		 2.1	 0.0	 	 
Executes:	    3,139.9	 17.6	 	 
Rollbacks:	   0.1	 0.0	 	 
Transactions:	 178.9	 	 	 

Top 5 Timed Foreground Events

Event				Waits		Time(s)		Avg wait (ms)	% DB time	Wait Class
gc cr grant 2-way		957,312		113,999		119		29.50		Cluster
gc current block 2-way	        566,707		68,269		120		17.67		Cluster
gc current grant 2-way	        385,571		47,160		122		12.20		Cluster
gc cr multi block request	196,918	        42,349		215		10.96		Cluster
gc buffer busy acquire	        326,623		40,036		123		10.36		Cluster

Global Cache Load Profile

                              Per Second  Per Transaction
Global Cache blocks received:	 221.24	  1.24
Global Cache blocks served:	 162.41	  0.91
GCS/GES messages received:	 1,639.74	 9.16
GCS/GES messages sent:	      2,984.85	 16.68
DBWR Fusion writes:	      13.10	 0.07
Estd Interconnect traffic (KB)	 3,972.45	 

Global Cache Efficiency Percentages (Target local+remote 100%)

Buffer access - local cache %:	 99.39
Buffer access - remote cache %:	 0.06
Buffer access - disk %:	 0.55

Global Cache and Enqueue Services - Workload Characteristics

Avg global enqueue get time (ms):	 0.3
Avg global cache cr block receive time (ms):	 114.8
Avg global cache current block receive time (ms):	 121.8
Avg global cache cr block build time (ms):	 0.0
Avg global cache cr block send time (ms):	 0.0
Global cache log flushes for cr blocks served %:	 3.8
Avg global cache cr block flush time (ms):	 1.1
Avg global cache current block pin time (ms):	 0.1
Avg global cache current block send time (ms):	 0.0
Global cache log flushes for current blocks served %:	 0.0
Avg global cache current block flush time (ms):	 1.2

Global Cache and Enqueue Services - Messaging Statistics

Avg message sent queue time (ms):	 154.4
Avg message sent queue time on ksxp (ms):	 0.3
Avg message received queue time (ms):	 0.0
Avg GCS message process time (ms):	 0.0
Avg GES message process time (ms):	 0.0
% of direct sent messages:	 21.07
% of indirect sent messages:	 60.94
% of flow controlled messages:	 17.99

专家解答

gc cr&current grant 2-way 是一种 grant message package 的传递，当取cr 或current block 时向block master instance 请求x或s的权限，当请求的block在从任何实例上的buffer cache中都没有发现, lms进程会通知FG进程从disk 读取block到local buffer cache中，如果这个等待时间过长原因如下：
SQL 过多的I/O 操作导致cr grant；
insert 大量的数据导到current grant;
非常小的buffer cache;
flush buffer cache 会加剧gc cr/current grant 2-way;
还有可能是过多的节点间交互访问;
极差的网络性能;
oracle bugs…

通常gc grant 是一种LMS 进程发送的非常小的grant function message packs ，在节点间交互不会占用太大带宽，配合的“gc buffer busy acquire“ 事件及Global Cache Load Profile 中显示的信息，基本可以排除网络问题

在MOS 查询该事件不难发现存在一个情况当两节点的cpu 数不一致时, 启动的LMS 数量不同也会导致该问题，后来找客户确认了主机资源扩容是否有扩CPU？答复是肯定的，而且这点以前是未通知的，只通知增加内存。
High “gc cr grant 2-way” / “gc current block 2-way” Wait due to Different CPU Count on Cluster Nodes (文档 ID 1911398.1)

下面是两节点的情况。

# node 1

SQL> show parameter cpu

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cpu_count                            integer     128
parallel_threads_per_cpu             integer     2
resource_manager_cpu_allocation      integer     128
SQL> show parameter gcs_server_processes

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
gcs_server_processes                 integer     6

#node 2

SQL> show parameter cpu

PARAMETER_NAME                                               TYPE        VALUE
------------------------------------------------------------ ----------- -----------
cpu_count                                                    integer     64
parallel_threads_per_cpu                                     integer     2
resource_manager_cpu_allocation                              integer     64

SQL> show parameter gcs_server

PARAMETER_NAME                                               TYPE        VALUE
------------------------------------------------------------ ----------- ---------------------
gcs_server_processes                                         integer     4

原因：

通过上面发现节点1当天扩容比以前增加了一倍的CPU, 而且节点1 的gcs_server_processes 是6, 节点2是4，默认gcs_server_processes是根据CPU 数据计算出来的，这种不平衡的LMS进程和CPU 会导致 , 在lms 多的节点上(本案例的节点1 ) 有更强的cache fusion 请求的能力疯狂的抛向LMS进程小的节点（节点2）时，节点2 的负载过重无法对称的处理，就会出现这个性能问题。

解决方法：
配置gcs_server_processes 为相同的值后重启实例。本案例把节点1 gcs_server_processes 调为 4. 参数调整后这个性能问题没有再现得到解决。

oracle

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

Performance tuning ‘gc cr&current grant 2-way’ event (当主机扩容cpu后)

问题描述

专家解答

评论