暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL shared buffer 的刷脏配置 - bgwriter_delay bgwriter_flush_after bgwriter_lru_maxpages bgwriter_lru_multiplier

digoal 2020-11-07
1135

作者

digoal

日期

2020-11-07

标签

PostgreSQL , bgwriter


背景

https://www.postgresql.org/docs/13/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-BACKGROUND-WRITER

PostgreSQL的shared buffer刷脏行为可以由checkpointer, server process, bgwriter发起, 但是bgwriter是最频繁的.

为什么需要bgwriter? 因为刷脏涉及后端存储的IO操作, 这部分操作比cpu cache, 内存都慢, 如果query大量时间都在等待后端存储IO返回, 显然会导致query的rt升高, 影响性能.

bgwriter的配置参数

bgwriter_delay configuration parameter, Background Writer bgwriter_flush_after configuration parameter, Background Writer bgwriter_lru_maxpages configuration parameter, Background Writer bgwriter_lru_multiplier configuration parameter, Background Writer

bgwriter 负责周期性的将shared buffer中的dirty page刷出shared buffer. 注意bgwriter使用的是buffer io(所以不涉及写盘), 所以实际上什么时候刷到后端存储设备取决于OS的调度策略.

但是为了防止无限制的等OS的调度, bgwriter也支持bgwriter_flush_after参数, 当刷出shared buffer的page超过bgwriter_flush_after指定的个数后, 会强制调用OS将page cache的脏页刷到后端存储设备.

OS层page cache刷脏(写盘)配置参数

```
sysctl -a|grep dirty

vm.dirty_background_bytes = 409600000 # 类似postgresql的bgwriter, 由后台进程而不是用户进程刷
vm.dirty_background_ratio = 0
vm.dirty_bytes = 0 # 类似postgresql 的 server process刷脏, 用户进程参与, 所以会导致用户进程的RT升高
vm.dirty_ratio = 95
vm.dirty_writeback_centisecs = 100 # 后台进程的调度间隔
vm.dirty_expire_centisecs = 3000 # 在page cache中存活时间超过这个值的脏页才会被刷盘
```

bgwriter参与刷脏的好处是避免server process, checkpoint的刷脏, 增加用户等待. 但是也可能引入问题, 例如一个checkpoint周期内多次变脏的page, 可能会产生多次的后端IO.

There is a separate server process called the background writer, whose function is to issue writes of “dirty” (new or modified) shared buffers. It writes shared buffers so server processes handling user queries seldom or never need to wait for a write to occur. However, the background writer does cause a net overall increase in I/O load, because while a repeatedly-dirtied page might otherwise be written only once per checkpoint interval, the background writer might write it several times as it is dirtied in the same interval. The parameters discussed in this subsection can be used to tune the behavior for local needs.

参数控制

bgwriter每次工作的间隔, 例如工作一次后, 休息10ms, 避免长时间产生IO, 影响业务.
bgwriter_delay (integer)

Specifies the delay between activity rounds for the background writer. In each round the writer issues writes for some number of dirty buffers (controllable by the following parameters). It then sleeps for the length of bgwriter_delay, and repeats. When there are no dirty buffers in the buffer pool, though, it goes into a longer sleep regardless of bgwriter_delay. If this value is specified without units, it is taken as milliseconds. The default value is 200 milliseconds (200ms). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting bgwriter_delay to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf file or on the server command line.

一个bgwriter工作周期内, 最多刷出多少个dirty page
bgwriter_lru_maxpages (integer)

In each round, no more than this many buffers will be written by the background writer. Setting this to zero disables background writing. (Note that checkpoints, which are managed by a separate, dedicated auxiliary process, are unaffected.) The default value is 100 buffers. This parameter can only be set in the postgresql.conf file or on the server command line.

一个bgwriter工作周期内, 应该刷出多少个dirty page?
让shared buffer中保持有至少多少个clean, reusable buffer pages = 上一个(或多个周期的平均数)周期, 用户申请了多少个new buffer page * bgwriter_lru_multiplier
例如上一个(或多个周期的平均数)周期, 用户申请了100个new buffer page.
bgwriter_lru_multiplier=2.0
那么要求shared buffer中保持有至少100*2.0 也就是200个clean, reusable buffer pages.
如果shared buffer中保持已经有大于或等于200个clean, reusable buffer pages, 那么bgwriter这个周期内就不需要刷脏.
如果要满足shared buffer中保持有至少100*2.0 也就是200个clean, reusable buffer pages. bgwriter需要刷出90个dirty page, 但是bgwriter_lru_maxpages设置的值是50, 那么这个周期bgwriter也只能刷50个dirty page.

bgwriter_lru_multiplier (floating point)

The number of dirty buffers written in each round is based on the number of new buffers that have been needed by server processes during recent rounds. The average recent need is multiplied by bgwriter_lru_multiplier to arrive at an estimate of the number of buffers that will be needed during the next round. Dirty buffers are written until there are that many clean, reusable buffers available. (However, no more than bgwriter_lru_maxpages buffers will be written per round.) Thus, a setting of 1.0 represents a “just in time” policy of writing exactly the number of buffers predicted to be needed. Larger values provide some cushion against spikes in demand, while smaller values intentionally leave writes to be done by server processes. The default is 2.0. This parameter can only be set in the postgresql.conf file or on the server command line.

前面说了bgwriter是buffer io, 脏页其实还在os dirty page中, 所以bgwrite累计刷出的dirty page超过bgwriter_flush_after时, 强制触发OS去刷os dirty page(写盘).
为什么要这么做呢?
因为OS层可能要累积到一个较大值才会去写盘(前面sysctl介绍了), 此时可能导致较大的写盘IO动作, 从而影响|争抢用户的IO(如从磁盘中读数据, 写WAL日志. 毕竟磁盘的IO带宽和IOPS能力都是有限的).
好处:
bgwriter时不时的触发OS写dirty page, 目的是让OS层别累积太多dirty page才去刷脏. 避免大的IO写盘动作影响|争抢用户的IO.

但是实际上OS也有自己的IO调度策略, 前面SYSCTL也介绍了, 可以设置dirty_background_bytes为一个较小值, 避免大IO.

另外bgwriter频繁让OS写盘, 也会带来一些弊端, 例如:
- 一个shared buffer中的page在os层刷脏的一个窗口期内如果有多次变脏并且被bgwriter多次写出,并且被bgwriter触发fsync时, 会导致同样物理存储区间的重复磁盘IO.
- 相邻的page在一个窗口期内被写出时, 由于bgwrite频繁触发fsync, 也会导致无法合并落盘, 实际上也是增加了磁盘IO。

bgwriter_flush_after (integer)

Whenever more than this amount of data has been written by the background writer, attempt to force the OS to issue these writes to the underlying storage. Doing so will limit the amount of dirty data in the kernel's page cache, reducing the likelihood of stalls when an fsync is issued at the end of a checkpoint, or when the OS writes data back in larger batches in the background. Often that will result in greatly reduced transaction latency, but there also are some cases, especially with workloads that are bigger than shared_buffers, but smaller than the OS's page cache, where performance might degrade. This setting may have no effect on some platforms. If this value is specified without units, it is taken as blocks, that is BLCKSZ bytes, typically 8kB. The valid range is between 0, which disables forced writeback, and 2MB. The default is 512kB on Linux, 0 elsewhere. (If BLCKSZ is not 8kB, the default and maximum values scale proportionally to it.) This parameter can only be set in the postgresql.conf file or on the server command line.

Smaller values of bgwriter_lru_maxpages and bgwriter_lru_multiplier reduce the extra I/O load caused by the background writer, but make it more likely that server processes will have to issue writes for themselves, delaying interactive queries.

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论