原文地址:
https://dataegret.com/2025/02/pgbackrest-preview-verify-recovery-target-timeline-on-restore/
FATAL: requested timeline 2 is not a child of this server's historyDETAIL: Latest checkpoint is at 0/7000028 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/600AA20.
虽然理解PostgreSQL 时间线的重要性至关重要,但对于不熟悉它的人来说,这个主题可能会令人困惑。
https://dataegret.com/2024/03/the-importance-of-postgresql-timelines/
在这篇文章中,我们将探讨为什么会发生这种情况,您可以采取什么措施,以及如何更早地发现它。
在这篇文章中,我们将探讨为什么会发生这种情况,您可以采取什么措施,以及如何更早地发现它。
让我们从一个简单的场景开始:完整备份,然后生成一些 WAL 档案。
$ pgbackrest infostanza: ro9pgstatus: okcipher: aes-256-cbcdb (current)wal archive min/max (17): 00000001000000000000000E/000000010000000000000031full backup: 20250206-102246Ftimestamp start/stop: 2025-02-06 10:22:46+00 / 2025-02-06 10:22:49+00wal start/stop: 00000001000000000000000E / 00000001000000000000000Edatabase size: 22.2MB, database backup size: 22.2MBrepo1: backup size: 2.7MB
$ pgbackrest info...wal archive min/max (17): 000000010000000000000003/00000002000000000000003A
$ pgbackrest repo-ls archive/ro9pg/17-1000000010000000000000002.history0000000200000000$ pgbackrest repo-ls archive/ro9pg/17-1/0000000100000000...000000010000000000000031-f35618b3a1176b39059b5aac8fd17f8ecd63d3be.zst000000010000000000000032.partial-826c8a36ae1a5d4a624720382ece08f2b9f4bf29.zst$ pgbackrest repo-ls archive/ro9pg/17-1/0000000200000000000000020000000000000032-e22f694a0bd949cda0be235f09b5620351eb3f4c.zst000000020000000000000033-81d92d49082a857880affca6a084f0e6d9d3bd26.zst...$ pgbackrest repo-get archive/ro9pg/17-1/00000002.history1 0/32899B98 no recovery target specified
已生成新的时间线,并且在此时间线上创建的 WAL 档案的00000002
文件名中将包含 。从00000002.history
文件中,我们可以看到新的时间线从第一个时间线分叉于0/32899B98
。
由于这个时间线是意外创建的,所以没有人注意到它。与此同时,每日备份 cron 照常运行,生成新的完整备份。
$ pgbackrest infostanza: ro9pgstatus: okcipher: aes-256-cbcdb (current)wal archive min/max (17): 00000001000000000000000E/00000002000000000000003Bfull backup: 20250206-102246Ftimestamp start/stop: 2025-02-06 10:22:46+00 / 2025-02-06 10:22:49+00wal start/stop: 00000001000000000000000E / 00000001000000000000000Edatabase size: 22.2MB, database backup size: 22.2MBrepo1: backup size: 2.7MBfull backup: 20250206-104949Ftimestamp start/stop: 2025-02-06 10:49:49+00 / 2025-02-06 10:49:52+00wal start/stop: 000000010000000000000033 / 000000010000000000000033database size: 777.3MB, database backup size: 777.3MBrepo1: backup size: 36.2MB
我们可以看到,新的完整备份 ( wal start/stop: 000000010000000000000033 000000010000000000000033
) 的时间线比存储库中的内容 ( wal archive min/max (17): 00000001000000000000000E 00000002000000000000003B
) 更早。现在,让我们尝试恢复这个新的完整备份,看看会发生什么。
LOG: starting backup recovery with redo LSN 0/36000028, checkpoint LSN 0/36000080, on timeline ID 1LOG: restored log file "00000002.history" from archiveLOG: restored log file "000000020000000000000036" from archiveLOG: invalid resource manager ID 32 at 0/36000080LOG: invalid checkpoint recordFATAL: could not locate required checkpoint record at 0/36000080HINT: If you are restoring from a backup, touch "recovery.signal" or "standby.signal" and add required recovery options.If you are not restoring from a backup, try removing the file "backup_label".Be careful: removing "backup_label" will result in a corrupt cluster if restoring from a backup.
此错误消息听起来比我们预期的更令人担忧。自 PostgreSQL 12 以来,遵循最新时间线一直是默认行为。这解释了为什么 PostgreSQL 无法在 WAL 中找到检查点 - 它正在第二个时间线而不是第一个时间线中搜索它。为了解决这个问题,我们可以清理存储库中与第二个时间线相关的所有内容,或者使用pgBackRest--target timeline=current restore选项将recovery_target_timeline
设置为current。
LOG: starting backup recovery with redo LSN 0/36000028, checkpoint LSN 0/36000080, on timeline ID 1LOG: restored log file "000000010000000000000036" from archiveLOG: starting archive recoveryLOG: redo starts at 0/36000028LOG: completed backup recovery with redo LSN 0/36000028 and end LSN 0/36000158LOG: consistent recovery state reached at 0/36000158...LOG: selected new timeline ID: 3LOG: archive recovery completeLOG: database system is ready to accept connections
下一个 pgBackRest 版本将包含与此相关的新功能:验证恢复目标时间线,该功能在恢复过程中检查恢复目标时间线。这通过在恢复开始之前检测问题并提供有关潜在问题的更多信息来节省时间。让我们看看此功能如何处理我们之前的示例。
回到开始:完整备份,然后生成一些 WAL 档案。
$ pgbackrest infostanza: ro9pgstatus: okcipher: aes-256-cbcdb (current)wal archive min/max (17): 000000010000000000000074/000000010000000000000098full backup: 20250206-120502Ftimestamp start/stop: 2025-02-06 12:05:02+00 / 2025-02-06 12:05:06+00wal start/stop: 000000010000000000000074 / 000000010000000000000074database size: 777.3MB, database backup size: 777.3MBrepo1: backup size: 36.2MB
现在,有人意外地提升了备用状态或执行了测试恢复以验证备份,但忘记在临时集群上禁用 WAL 归档,甚至随后进行了差异备份。
$ pgbackrest infostanza: ro9pgstatus: okcipher: aes-256-cbcdb (current)wal archive min/max (17): 000000010000000000000074/0000000200000000000000A5full backup: 20250206-120502Ftimestamp start/stop: 2025-02-06 12:05:02+00 / 2025-02-06 12:05:06+00wal start/stop: 000000010000000000000074 / 000000010000000000000074database size: 777.3MB, database backup size: 777.3MBrepo1: backup size: 36.2MBdiff backup: 20250206-120502F_20250206-120917Dtimestamp start/stop: 2025-02-06 12:09:17+00 / 2025-02-06 12:09:20+00wal start/stop: 0000000200000000000000A4 / 0000000200000000000000A4database size: 179.1MB, database backup size: 152.6MBrepo1: backup size: 6.7MBbackup reference total: 1 full
$ pgbackrest repo-ls archive/ro9pg/17-1000000010000000000000002.history0000000200000000$ pgbackrest repo-ls archive/ro9pg/17-1/0000000100000000...000000010000000000000099-03270b2754c5e72013af497aa8d10b3d467abea2.zst000000010000000000000099.partial-587f49dfa38308f08bfd9e37d7e2ddf8c09744e4.zst00000001000000000000009A-6ec2014948eebb8de2ddc1451da839771897c8f0.zst$ pgbackrest repo-ls archive/ro9pg/17-1/0000000200000000000000020000000000000099-4e03a26ae3e6b002f88f4f117d5e8134c112de79.zst00000002000000000000009A-ecdf1ad475558d5a2b01257beae8423262091884.zst00000002000000000000009B-da16dc2716dbb264b21d36ae2e69255a948075ae.zst...$ pgbackrest repo-get archive/ro9pg/17-1/00000002.history1 0/99000060 no recovery target specified
一个新的时间线已经生成,在该时间线上生成的WAL档案的文件名将有00000002。从00000002.history文件内容中,我们可以看出新的时间线是从0/99000060的第一个时间线分叉而来的。
由于该时间线是意外创建的,所以没有人会注意到它,并且每日备份 cron 将照常运行,并生成一个新的完整备份。
$ pgbackrest info...full backup: 20250206-121012Ftimestamp start/stop: 2025-02-06 12:10:12+00 / 2025-02-06 12:10:15+00wal start/stop: 00000001000000000000009A / 00000001000000000000009Adatabase size: 777.3MB, database backup size: 777.3MBrepo1: backup size: 36.2MB
我们可以看到,与存储库中的内容相比,新的完整备份(wal开始/停止:000000010000000000000000009A/00000000000000000000009A)的时间线更早(wal存档最小/最大(17):00000001000000000000000074/0000000000000000000A5)。现在,让我们尝试还原这个新的完整备份,看看会发生什么。
$ pgbackrest --stanza=ro9pg restore --set=20250206-121012FERROR: [058]: target timeline 2 forked from backup timeline 1 at 0/99000060 which is before backup lsn of 0/9a000028HINT: was the target timeline created by accidentally promoting a standby?HINT: was the target timeline created by testing a restore without --archive-mode=off?HINT: was the backup made after the target timeline was created?
这次,我们在开始恢复甚至从远程存储库检索文件之前就知道存在潜在的时间线问题。再次使用--target-timeline=current
可以解决问题。
一旦启动,新的临时 PostgreSQL 集群将创建新的第三(!)时间线。
$ pgbackrest repo-ls archive/ro9pg/17-1000000010000000000000002.history000000020000000000000003.history0000000300000000$ pgbackrest repo-get archive/ro9pg/17-1/00000003.history1 0/9B000000 no recovery target specified
目前,我们的最新时间表是第三个。然而,第三个时间线并不知道第二个时间线,因为它是在不同的还原测试中创建的。那么,如果我们恢复在第二个时间线上进行的差异备份,会发生什么?PostgreSQL可能无法恢复,因为它无法定位从时间线2到时间线3的切换点。再次,将recovery_target_timeline设置为“current”将解决此问题。由于上面提到的新pgBackRest功能,我们不再需要等到还原过程完成才能发现问题。
$ pgbackrest --stanza=ro9pg restore --set=20250206-120502F_20250206-120917DERROR: [058]: backup timeline 2, lsn 0/a4000028 is not in the history of target timeline 3HINT: was the target timeline created by promoting from a timeline < latest?
最后的想法
由于PostgreSQL 12中recovery_target_timeline的默认值发生了变化,我注意到在回答pgBackRest关于恢复问题的问题时,我们越来越需要推荐--target timeline=current选项。希望尽早发现潜在的时间线问题,而不需要你检查PostgreSQL日志,对你们中的许多人来说都是非常有益的。😊




