PostgreSQL 13 preview - wait event sample - 等待事件统计(插件,hook) - ASH - performance insight

digoal 2020-01-01

2201

作者

digoal

日期

2020-01-01

背景

PostgreSQL 活跃会话表pg_stat_activity 表示数据库会话的当前状态，其中有表示会话当前正在等待的事件的信息，但是只有当前状态，实际上，我们可能还需要更多信息来分析数据库的性能问题：

1、某个事务，有多少种等待事件，每个等待事件分别等了多少次，每个等待事件分别等了多久。

2、某条sql，在一次特定的请求种：有多少种等待事件，每个等待事件分别等了多少次，每个等待事件分别等了多久。

3、某条sql，在一个统计周期内请求了多少次，一共：有多少种等待事件，每个等待事件分别等了多少次，每个等待事件分别等了多久。

4、某个会话，在一个统计周期内请求了多少次，一共：有多少种等待事件，每个等待事件分别等了多少次，每个等待事件分别等了多久。

5、某个DB，在一个统计周期内请求了多少次，一共：有多少种等待事件，每个等待事件分别等了多少次，每个等待事件分别等了多久。

6、用户，在一个统计周期内请求了多少次，一共：有多少种等待事件，每个等待事件分别等了多少次，每个等待事件分别等了多久。

如果有以上统计信息，可以更加精准的了解数据库的瓶颈在哪？

要完成以上统计，必须记录每次等待的耗时，次数。目前并没有这个信息。

目前有什么？

1、pg_wait_sampling插件，记录等待事件的次数，但是不记录等待时间。

《PostgreSQL 等待事件及等待采样统计(pg_wait_sampling)》

powa有用到这个插件来展示等待事件维度的统计数据。

《powa4 PostreSQL Workload Analyzer - PostgreSQL监控工具、带WEB展示 - 索引推荐,等待事件分析,命中率,配置变更跟踪等》

2、除此以外，aws, alibaba在产品中也提供了等待事件的快照（定期采集pg_stat_activity）的信息，同样，只有等待事件的次数，没有耗时。

《PostgreSQL Oracle 兼容性之 - performance insight - AWS performance insight 理念与实现解读 - 珍藏级》

PG 13可能引入等待事件的统计（包括次数、耗时），类似oracle ash

```
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Call Time

CPU time 559 88.80
log file parallel write 2,181 28 4.42
SQL*Net more data from client 516,611 27 4.24
db file parallel write 13,383 13 2.04
db file sequential read 563 2 .27
```

但是同时行业可能引入一个问题，性能损耗，因为gettimeofday引入的问题。

因此可能会以插件形式存在，用户可以自由开关统计。

```
Michael Paquier writes:

This does not need a configure switch.

It probably is there because the OP realizes that most people wouldn't
accept having this code compiled in.

What's the performance penalty? I am pretty sure that this is
measurable as wait events are stored for a backend for each I/O
operation as well, and you are calling a C routine within an inlined
function which is designed to be light-weight, doing only a four-byte
atomic operation.

On machines with slow gettimeofday(), I suspect the cost of this
patch would be staggering. Even with relatively fast gettimeofday,
it doesn't look acceptable for calls in hot code paths (for instance,
lwlock.c).

A bigger problem is that it breaks stuff. There are countless
calls to pgstat_report_wait_start/pgstat_report_wait_end that
assume they have no side-effects (for example, on errno) and
can never fail. I wouldn't trust GetCurrentTimestamp() for either.
If the report_wait calls can't be dropped into code with complete
certainty that they're safe, that's a big cost.

Why exactly is this insisting on logging timestamps and not,
say, just incrementing a counter? I think doing it like this
is almost certain to end in rejection.

        regards, tom lane

```

于此相关的行为包括：auto_explain 插件开启后，也会引入gettimeofday。凡有时间相关的可能都有类似问题，例如 log_duration, log_min_duration_statement, pg_stat_statements 等。

《PostgreSQL pg_stat_statements 开启后引入slock导致高并发下的性能问题》

在这个patch的讨论中也提到了一个想法，采样统计，不统计每次等待，想法类似

《PostgreSQL 13 preview - SQL采样比例设置 log_min_duration_sample log_min_duration_sample》

《PostgreSQL 12 preview - 支持SQL采样比例设置 - Log all statements from a sample of transactions》

patch 原型

```
Hello hackers,

This proposal is about recording additional statisticsofwait events.

PostgreSQL statistics Issue

The pg_stat_activity view is very useful in analysis for performance issues.
But it is difficult to get information of wait events in detail,
when you need to deep dive into analysis of performance.
It is because pg_stat_activity just shows the current wait status of backend.
If PostgreSQL provides additional statistics about wait events,
it will be helpful in pinpointing the exact cause of throughput issue.

Proposal

Record additional statistics items per wait event for every backend.

total elapsed time to wait
max elapsed time to wait
number of times being waited

I suggest storing the above statistics in the pgBackendStatus structure.

typedef struct PgBackendStatus
{
...
/
* proc's wait_event additional information.
* each wait_events elapsed time count.
/

TimestampTz st_wait_event_start_timestamp;
uint64 st_wait_event_total_elapsed[NUM_WAIT_EVENT];
uint64 st_wait_event_max_elapsed[NUM_WAIT_EVENT];
uint32 st_wait_event_counting[NUM_WAIT_EVENT];
}

PoC test

I wrote a prototype patch.

With this patch, you can get additional wait event stats via

the new procedure ‘pg_stat_get_wait_events()’.
You can test by following steps.

apply git patch
patch -p0 wait_event_stat_patchfile.diff
make distclean
configure --with-wait-event-detail
make make install
start postgreSQL and execute psql
using pg_stat_get_wait_events(null) function
input parameter is pid.

display example

postgres=# select * from pg_stat_get_wait_events(null) where counting 0;
pid | wait_event_type | wait_event | total_elapsed | max_elapsed | counting
-------+-----------------+-----------------------+---------------+-------------+----------
25291| LWLock | WALBufMappingLock |1359|376|6
25291| LWLock | WALWriteLock |679733|113803|8
25291| IO | BufFileRead |780|7|171
25291| IO | BufFileWrite |1247|19|171
25291| IO | DataFileExtend |44703|53|3395
25291| IO | DataFileImmediateSync |268798|72286|12
25291| IO | DataFileRead |91763|22149|30
25291| IO | WALSync |441139|60456|28
25291| IO | WALWrite |9567|637|737
24251| LWLock | WALBufMappingLock |1256|350|6
24251| LWLock | WALWriteLock |649140|153994|7
24251| IO | BufFileRead |620|9|172
24251| IO | BufFileWrite |1228|20|171
24251| IO | DataFileExtend |26884|51|3395
24251| IO | DataFileImmediateSync |208630|21067|12
24251| IO | DataFileRead |426278|17327|128
24251| IO | WALSync |307055|70853|24
24251| IO | WALWrite |17935|961|2720
(18rows)

etc. concept proposal

I allocated arrays for additional statistics per wait event.
Normally the backend doesn’t use all wait events.
So the size of memory used for recording statistics can be reduced
by allocating one hash list as memory pool for statistics of wait events.
This feature can be implemented as extension
if some hooks were provided in following functions,
pgstat_report_wait_start
Pgstat_report_wait_end
Feedback and suggestion will be very welcome.
Thanks!
```

如果不统计时间，只有次数统计的话，如下

wait_event_type | wait_event | calls | times | times/calls -----------------+----------------------+--------+-----------+------------ Client | ClientRead | 925732 | 115270950 | 124.52 IO | DataFileRead | 220357 | 3219893 | 14.61 LWLock | WALWriteLock | 195560 | 192684987 | 985.30 IO | DataFileWrite | 156208 | 4440091 | 28.42 IO | WALWrite | 51035 | 2925808 | 57.33 IO | WALSync | 50812 | 49626240 | 976.66 Lock | transactionid | 6179 | 9905727 | 1603.13 LWLock | buffer_content | 5453 | 295123 | 54.12 IO | DataFileExtend | 2367 | 90430 | 38.20 LWLock | wal_insert | 2053 | 33620 | 16.38 LWLock | WALBufMappingLock | 1301 | 23761 | 18.26 LWLock | CLogControlLock | 514 | 3959 | 7.70 LWLock | buffer_mapping | 162 | 1313 | 8.10 LWLock | XidGenLock | 153 | 2399 | 15.68 LWLock | lock_manager | 122 | 5051 | 41.40 Lock | tuple | 105 | 114632 | 1091.73 LWLock | ProcArrayLock | 99 | 1828 | 18.46 LWLock | CheckpointerCommLock | 33 | 854 | 25.88 Lock | extend | 24 | 20493 | 853.88 IO | RelationMapRead | 22 | 71 | 3.23 LWLock | buffer_io | 11 | 7398 | 672.55 IO | SLRUWrite | 4 | 249 | 62.25 IO | SLRUSync | 4 | 0 | 0.00 IPC | ClogGroupUpdate | 3 | 223 | 74.33 LWLock | SyncRepLock | 1 | 1 | 1.00

精彩讨论

```
Hi all,

I think sampling like Oracle ASH should work for the DBA to find probable bottlenecks in many cases (, so I hope PostgreSQL will incorporate it...) On the other hand, it seems to have the following disadvantages, some of which others have already pointed out:

Doesn't provide precise data
Sampling could miss intermittent short waits, e.g., buffer content lock waits during checkpoints. This might make it difficult or impossible to solve transient performance problems, such as infrequent 100 millisecond response times while the normal response time is a few milliseconds.
The proposed wait event collection doesn't miss anything.
Overuses resources
We may be able to shorten the sampling interval to 10 ms or even 1 ms to detect short periods of problems. However, the sampled data of active sessions become voluminous in memory and storage. It would take longer to analyze those samples. Also, the background sampling process prevents the CPU core from becoming idle to save power, which bgwriter and walwriter tries to avoid by hibernation.
The proposed wait event collection just records what actually happened. No waste. Would it use many resources if waits happen frequently? That leads to our motivation to reduce waits.
Cannot determine the impact or illness of waits just by sampling or counting without time
As the following MySQL and Oracle manual articles describe, precise measurement of wait count and time helps to judge the impact and justify the remedy. They can measure the whole SQL execution and its various processing steps (parse, plan, sort, etc.) as well as waits, so that the most significant areas can be determined.
Also, sampling cannot tell if a single wait took long or the same waits occurred repeatedly in succession (at least easily.) Do the sampled waits indicate an abnormal I/O (which took 2 ms while the normal time is 50 us)?

[MySQL]

Chapter 26 MySQL Performance Schema
https://dev.mysql.com/doc/refman/8.0/en/performance-schema.html

The Performance Schema monitors server events. An “event” is anything the server does that takes time and has been instrumented so that timing information can be collected. In general, an event could be a function call, a wait for the operating system, a stage of an SQL statement execution such as parsing or sorting, or an entire statement or group of statements. Event collection provides access to information about synchronization calls (such as for mutexes) file and table I/O, table locks, and so forth for the server and for several storage engines.

Current events are available, as well as event histories and summaries. This enables you to determine how many times instrumented activities were performed and how much time they took. Event information is available to show the activities of specific threads, or activity associated with particular objects such as a mutex or file.

[Oracle]

https://docs.oracle.com/en/database/oracle/oracle-database/18/tdppt/automatic-database-performance-monitoring.html#GUID-32E92AEC-AF1A-4602-B998-3250920CD3BE

The goal of database performance tuning is to reduce the DB time of the system for a given workload. By reducing DB time, the database can support more user requests by using the same or fewer resources. ADDM reports system resources that are using a significant portion of DB time as problem areas and sorts them in descending order by the amount of related DB time spent.

Instance Tuning Using Performance Views
https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/instance-tuning-using-performance-views.html#GUID-07982549-507F-4465-8843-7F753BCF8F99

Wait event statistics include the number of times an event was waited for and the time waited for the event to complete. If the initialization parameter TIMED_STATISTICS is set to true, then you can also see how long each resource was waited for.
To minimize user response time, reduce the time spent by server processes waiting for event completion. Not all wait events have the same wait time. Therefore, it is more important to examine events with the most total time waited rather than wait events with a high number of occurrences. Usually, it is best to set the dynamic parameter TIMED_STATISTICS to true at least while monitoring performance.

https://docs.oracle.com/en/database/oracle/oracle-database/18/tgdba/measuring-database-performance.html#GUID-811E9E65-C64A-4028-A90E-102BBFF6E68F
5.2.3 Using Wait Events without Timed Statistics

If TIMED_STATISTICS is set to FALSE, then the amount of time spent waiting for an event is not available. Therefore, it is only possible to order wait events by the number of times each event was waited for. Although the events with the largest number of waits might indicate a potential bottleneck, they might not be the main bottleneck. This situation can happen when an event is waited for a large number of times, but the total time waited for that event is small. Conversely, an event with fewer waits might be a bigger bottleneck if the wait time accounts for a significant proportion of the total wait time. Without the wait times to use for comparison, it is difficult to determine whether a wait event is worth investigating.

10.2.2 Using Wait Event Statistics to Drill Down to Bottlenecks

The most effective way to use wait event data is to order the events by the wait time. This is only possible if TIMED_STATISTICS is set to true. Otherwise, the wait events can only be ranked by the number of times waited, which is often not the ordering that best represents the problem.

To get an indication of where time is spent, follow these steps:

Examine the data collection for V$SYSTEM_EVENT. The events of interest should be ranked by wait time.
Identify the wait events that have the most significant percentage of wait time. ...
Alternatively, look at the Top 5 Timed Events section at the beginning of the Automatic Workload Repository report. This section automatically orders the wait events (omitting idle events), and calculates the relative percentage:

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Call Time

CPU time 559 88.80
log file parallel write 2,181 28 4.42
SQL*Net more data from client 516,611 27 4.24
db file parallel write 13,383 13 2.04
db file sequential read 563 2 .27

Look at the number of waits for these events, and the average wait time. For example, for I/O related events, the average time might help identify whether the I/O system is slow. The following example of this data is taken from the Wait Event section of the AWR report:
```
                                                     Avg  
                                        Total Wait   wait     Waits
```
Event Waits Timeouts Time (s) (ms) /txn

log file parallel write 2,181 0 28 13 41.2
SQL*Net more data from clie 516,611 0 27 0 9,747.4
db file parallel write 13,383 0 13 1 252.5
The top wait events identify the next places to investigate. A table of common wait events is listed in Table 10-1. It is usually a good idea to also have quick look at high-load SQL.
Examine the related data indicated by the wait events to see what other information this data provides. Determine whether this information is consistent with the wait event data. In most situations, there is enough data to begin developing a theory about the potential causes of the performance bottleneck.

5. To determine whether this theory is valid, cross-check data you have examined with other statistics available for consistency. The appropriate statistics vary depending on the problem, but usually include load profile-related data in V$SYSSTAT, operating system statistics, and so on. Perform cross-checks with other data to confirm or refute the developing theory.

So, why don't we have the proposed wait event count/time data? I hope we can nurture this to become a database profiling tool like MySQL and Oracle. This is the first step. I think it would be useful to have both sampling and precise statistics. Oracle has both, and MySQL has the latter (I don't know why MySQL doesn't provide sampling, because the Performance Schema should probably have been developed after Oracle's ASH.)

What would make us conservative about doing this? Skimming the old thread, the remaining concern is the timer overhead. As the following article suggests, some lightweight timers seem to be available. We can turn the timing off by default if they aren't light enough.

Performance Schema Timers
https://dev.mysql.com/doc/refman/8.0/en/performance-schema-timing.html

Regards
Takayuki Tsunakawa
```

参考

https://commitfest.postgresql.org/26/2332/

https://www.postgresql.org/message-id/flat/2022793411.102298.1532329482584.JavaMail.jboss@ep1ml402

https://github.com/postgrespro/pg_wait_sampling

https://github.com/pgsentinel/pgsentinel

https://rjuju.github.io/postgresql/2018/07/09/wait-events-support-for-powa.html

《PostgreSQL Oracle 兼容性之 - performance insight - AWS performance insight 理念与实现解读 - 珍藏级》

《powa4 PostreSQL Workload Analyzer - PostgreSQL监控工具、带WEB展示 - 索引推荐,等待事件分析,命中率,配置变更跟踪等》

《PostgreSQL 等待事件及等待采样统计(pg_wait_sampling)》

《PostgreSQL pg_stat_statements 开启后引入slock导致高并发下的性能问题》

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

PostgreSQL 13 preview - wait event sample - 等待事件统计(插件,hook) - ASH - performance insight

作者

日期

标签

背景

patch 原型

精彩讨论

https://docs.oracle.com/en/database/oracle/oracle-database/18/tdppt/automatic-database-performance-monitoring.html#GUID-32E92AEC-AF1A-4602-B998-3250920CD3BE

10.2.2 Using Wait Event Statistics to Drill Down to Bottlenecks

参考

PostgreSQL 许愿链接

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

评论