interested in the feature but they want to know the performance cost. Our answer so far has
been "we don't really know," and if an internet search is any indicator, no one really knows.
There is another challenge with this feature: for PostgreSQL versions lower than 12, it can only
be enabled when the cluster is first created. And if you create a cluster with this option enabled,
there is no way to disable it. PostgreSQL version 12 introduces a utility program pg_checksums
,
which enables or disables data checksums in a PostgreSQL cluster, but the cluster must be
offline while the utility runs. Clients want to understand the performance implications of this
feature before committing a new cluster to it, or before undertaking a migration to a new cluster
with the feature enabled.
Take note: Amazon enables data checksums on all
RDS PostgreSQL clusters, and on their
platform it is not
a configurable option. I intend to leave you with their same clarity and
confidence about this feature.
To find the answer about performance, my first instinct was to run benchmarks and get the
answer empirically. I did run benchmarks and I did get an answer. However, this was a really
tricky benchmark to set up in a meaningful way, and there are several different benchmarks
needed to understand this issue.
I took a step back.
PostgreSQL is complex software, but checksumming data is really simple. It's so simple that
you can checksum a big file and measure the time it takes, and then you can repeat the task
and be assured that it will take nearly the same amount of time. The contents of the file do not
affect the time taken, only the size of the data. This means that ultimately we could know the
CPU cost of the data checksums feature if we could know how much data would be
checksummed; this information is easy to obtain. It's more complicated than that though,
because this cost is directly linked to disk IO.
In order to translate this basic metric "total CPU time" of checksumming some data into
something of real-world value, we need to relate it to some other factors.
Let's explore:
How much data is checksummed?
PostgreSQL only checksums data as it is read or written to disk, so we want to know
specifically: how much is that? Remember this only matters for data
pages -- table data and
indexes -- not WAL files, which are not checksummed. This figure varies dramatically depending
on the system. Commonly, systems with well-sized hardware and correct configuration, 90%+
(even 99%) of pages are read from cache, not disk. Pages read/written to cache are not
checksummed. Even with caching and WAL logging, all data is eventually written to disk. Some
评论