pgBackRest 预演：验证还原时的恢复目标时间线

飞象数据 2025-04-01

210

原文地址：

https://dataegret.com/2025/02/pgbackrest-preview-verify-recovery-target-timeline-on-restore/

您是否曾在 PostgreSQL 恢复过程中遇到过时间线问题？完成恢复后才发现时间线无效，这可能令人沮丧，这通常会导致如下错误消息：

FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/7000028 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/600AA20.

虽然理解PostgreSQL 时间线的重要性至关重要，但对于不熟悉它的人来说，这个主题可能会令人困惑。

https://dataegret.com/2024/03/the-importance-of-postgresql-timelines/

在这篇文章中，我们将探讨为什么会发生这种情况，您可以采取什么措施，以及如何更早地发现它。

让我们从一个简单的场景开始：完整备份，然后生成一些 WAL 档案。

$ pgbackrest info
stanza: ro9pg
status: ok
cipher: aes-256-cbc


db (current)
wal archive min/max (17): 00000001000000000000000E/000000010000000000000031


full backup: 20250206-102246F
timestamp start/stop: 2025-02-06 10:22:46+00 / 2025-02-06 10:22:49+00
wal start/stop: 00000001000000000000000E / 00000001000000000000000E
database size: 22.2MB, database backup size: 22.2MB
repo1: backup size: 2.7MB

现在，有人意外地提升了备用状态或执行了测试恢复以验证备份，但忘记在临时集群上禁用 WAL 归档。

$ pgbackrest info
...
wal archive min/max (17): 000000010000000000000003/00000002000000000000003A

$ pgbackrest repo-ls archive/ro9pg/17-1
0000000100000000
00000002.history
0000000200000000


$ pgbackrest repo-ls archive/ro9pg/17-1/0000000100000000
...
000000010000000000000031-f35618b3a1176b39059b5aac8fd17f8ecd63d3be.zst
000000010000000000000032.partial-826c8a36ae1a5d4a624720382ece08f2b9f4bf29.zst


$ pgbackrest repo-ls archive/ro9pg/17-1/0000000200000000
000000020000000000000032-e22f694a0bd949cda0be235f09b5620351eb3f4c.zst
000000020000000000000033-81d92d49082a857880affca6a084f0e6d9d3bd26.zst
...


$ pgbackrest repo-get archive/ro9pg/17-1/00000002.history
1 0/32899B98 no recovery target specified

已生成新的时间线，并且在此时间线上创建的 WAL 档案的00000002
文件名中将包含。从00000002.history
文件中，我们可以看到新的时间线从第一个时间线分叉于0/32899B98
。

由于这个时间线是意外创建的，所以没有人注意到它。与此同时，每日备份 cron 照常运行，生成新的完整备份。

$ pgbackrest info
stanza: ro9pg
status: ok
cipher: aes-256-cbc


db (current)
wal archive min/max (17): 00000001000000000000000E/00000002000000000000003B


full backup: 20250206-102246F
timestamp start/stop: 2025-02-06 10:22:46+00 / 2025-02-06 10:22:49+00
wal start/stop: 00000001000000000000000E / 00000001000000000000000E
database size: 22.2MB, database backup size: 22.2MB
repo1: backup size: 2.7MB


full backup: 20250206-104949F
timestamp start/stop: 2025-02-06 10:49:49+00 / 2025-02-06 10:49:52+00
wal start/stop: 000000010000000000000033 / 000000010000000000000033
database size: 777.3MB, database backup size: 777.3MB
repo1: backup size: 36.2MB

我们可以看到，新的完整备份 ( wal start/stop: 000000010000000000000033 000000010000000000000033
) 的时间线比存储库中的内容 ( wal archive min/max (17): 00000001000000000000000E 00000002000000000000003B
) 更早。现在，让我们尝试恢复这个新的完整备份，看看会发生什么。

LOG: starting backup recovery with redo LSN 0/36000028, checkpoint LSN 0/36000080, on timeline ID 1
LOG: restored log file "00000002.history" from archive
LOG: restored log file "000000020000000000000036" from archive
LOG: invalid resource manager ID 32 at 0/36000080
LOG: invalid checkpoint record
FATAL: could not locate required checkpoint record at 0/36000080
HINT: If you are restoring from a backup, touch "recovery.signal" or "standby.signal" and add required recovery options.
If you are not restoring from a backup, try removing the file "backup_label".
Be careful: removing "backup_label" will result in a corrupt cluster if restoring from a backup.

此错误消息听起来比我们预期的更令人担忧。自 PostgreSQL 12 以来，遵循最新时间线一直是默认行为。这解释了为什么 PostgreSQL 无法在 WAL 中找到检查点 - 它正在第二个时间线而不是第一个时间线中搜索它。为了解决这个问题，我们可以清理存储库中与第二个时间线相关的所有内容，或者使用pgBackRest--target timeline=current restore选项将recovery_target_timeline
设置为current。

LOG: starting backup recovery with redo LSN 0/36000028, checkpoint LSN 0/36000080, on timeline ID 1
LOG: restored log file "000000010000000000000036" from archive
LOG: starting archive recovery
LOG: redo starts at 0/36000028
LOG: completed backup recovery with redo LSN 0/36000028 and end LSN 0/36000158
LOG: consistent recovery state reached at 0/36000158
...
LOG: selected new timeline ID: 3
LOG: archive recovery complete
LOG: database system is ready to accept connections

下一个 pgBackRest 版本将包含与此相关的新功能：验证恢复目标时间线，该功能在恢复过程中检查恢复目标时间线。这通过在恢复开始之前检测问题并提供有关潜在问题的更多信息来节省时间。让我们看看此功能如何处理我们之前的示例。

回到开始：完整备份，然后生成一些 WAL 档案。

$ pgbackrest info
stanza: ro9pg
status: ok
cipher: aes-256-cbc


db (current)
wal archive min/max (17): 000000010000000000000074/000000010000000000000098


full backup: 20250206-120502F
timestamp start/stop: 2025-02-06 12:05:02+00 / 2025-02-06 12:05:06+00
wal start/stop: 000000010000000000000074 / 000000010000000000000074
database size: 777.3MB, database backup size: 777.3MB
repo1: backup size: 36.2MB

现在，有人意外地提升了备用状态或执行了测试恢复以验证备份，但忘记在临时集群上禁用 WAL 归档，甚至随后进行了差异备份。

$ pgbackrest info
stanza: ro9pg
status: ok
cipher: aes-256-cbc


db (current)
wal archive min/max (17): 000000010000000000000074/0000000200000000000000A5


full backup: 20250206-120502F
timestamp start/stop: 2025-02-06 12:05:02+00 / 2025-02-06 12:05:06+00
wal start/stop: 000000010000000000000074 / 000000010000000000000074
database size: 777.3MB, database backup size: 777.3MB
repo1: backup size: 36.2MB


diff backup: 20250206-120502F_20250206-120917D
timestamp start/stop: 2025-02-06 12:09:17+00 / 2025-02-06 12:09:20+00
wal start/stop: 0000000200000000000000A4 / 0000000200000000000000A4
database size: 179.1MB, database backup size: 152.6MB
repo1: backup size: 6.7MB
backup reference total: 1 full

$ pgbackrest repo-ls archive/ro9pg/17-1
0000000100000000
00000002.history
0000000200000000


$ pgbackrest repo-ls archive/ro9pg/17-1/0000000100000000
...
000000010000000000000099-03270b2754c5e72013af497aa8d10b3d467abea2.zst
000000010000000000000099.partial-587f49dfa38308f08bfd9e37d7e2ddf8c09744e4.zst
00000001000000000000009A-6ec2014948eebb8de2ddc1451da839771897c8f0.zst


$ pgbackrest repo-ls archive/ro9pg/17-1/0000000200000000
000000020000000000000099-4e03a26ae3e6b002f88f4f117d5e8134c112de79.zst
00000002000000000000009A-ecdf1ad475558d5a2b01257beae8423262091884.zst
00000002000000000000009B-da16dc2716dbb264b21d36ae2e69255a948075ae.zst
...


$ pgbackrest repo-get archive/ro9pg/17-1/00000002.history
1 0/99000060 no recovery target specified

一个新的时间线已经生成，在该时间线上生成的WAL档案的文件名将有00000002。从00000002.history文件内容中，我们可以看出新的时间线是从0/99000060的第一个时间线分叉而来的。

由于该时间线是意外创建的，所以没有人会注意到它，并且每日备份 cron 将照常运行，并生成一个新的完整备份。

$ pgbackrest info
...
full backup: 20250206-121012F
timestamp start/stop: 2025-02-06 12:10:12+00 / 2025-02-06 12:10:15+00
wal start/stop: 00000001000000000000009A / 00000001000000000000009A
database size: 777.3MB, database backup size: 777.3MB
repo1: backup size: 36.2MB

我们可以看到，与存储库中的内容相比，新的完整备份（wal开始/停止：000000010000000000000000009A/00000000000000000000009A）的时间线更早（wal存档最小/最大（17）：00000001000000000000000074/0000000000000000000A5）。现在，让我们尝试还原这个新的完整备份，看看会发生什么。

$ pgbackrest --stanza=ro9pg restore --set=20250206-121012F
ERROR: [058]: target timeline 2 forked from backup timeline 1 at 0/99000060 which is before backup lsn of 0/9a000028
HINT: was the target timeline created by accidentally promoting a standby?
HINT: was the target timeline created by testing a restore without --archive-mode=off?
HINT: was the backup made after the target timeline was created?

这次，我们在开始恢复甚至从远程存储库检索文件之前就知道存在潜在的时间线问题。再次使用--target-timeline=current
可以解决问题。

一旦启动，新的临时 PostgreSQL 集群将创建新的第三(!)时间线。

$ pgbackrest repo-ls archive/ro9pg/17-1
0000000100000000
00000002.history
0000000200000000
00000003.history
0000000300000000


$ pgbackrest repo-get archive/ro9pg/17-1/00000003.history
1 0/9B000000 no recovery target specified

目前，我们的最新时间表是第三个。然而，第三个时间线并不知道第二个时间线，因为它是在不同的还原测试中创建的。那么，如果我们恢复在第二个时间线上进行的差异备份，会发生什么？PostgreSQL可能无法恢复，因为它无法定位从时间线2到时间线3的切换点。再次，将recovery_target_timeline设置为“current”将解决此问题。由于上面提到的新pgBackRest功能，我们不再需要等到还原过程完成才能发现问题。

$ pgbackrest --stanza=ro9pg restore --set=20250206-120502F_20250206-120917D
ERROR: [058]: backup timeline 2, lsn 0/a4000028 is not in the history of target timeline 3
HINT: was the target timeline created by promoting from a timeline < latest?

最后的想法

由于PostgreSQL 12中recovery_target_timeline的默认值发生了变化，我注意到在回答pgBackRest关于恢复问题的问题时，我们越来越需要推荐--target timeline=current选项。希望尽早发现潜在的时间线问题，而不需要你检查PostgreSQL日志，对你们中的许多人来说都是非常有益的。😊

文章转载自飞象数据，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

pgBackRest 预演：验证还原时的恢复目标时间线

最后的想法

评论