为什么archive_command在PG快速关闭时会出现CORE

原创 wim1234 2023-02-27

274

测试人员提交了一个BUG， BUG2020091100436

基本内涵就是archive_command使用外部可执行程序，

在PG快速关闭时，会产生CORE-DUMP的情况。

经过调研，结论如下：

在现有情况下，不可避免地会出现CORE-DUMP，

且没有完美的解决方法避免。

因为我们在可控的代码去添加SIGQUIT的处理代码，

但是总是有一些孙子进程(grandchildren)不受控制。

1. 为什么

因为快速关闭，就是要快，

不会给相关子进程优雅地退出的机会。

https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FD3366B%40G01JPEXMBYT05

I agree that the user's archiver program should receive the chance for graceful stop in smart or fast shutdown. But I think in immediate shutdown, all should stop immediately. That's what I expect from the word "immediate."

If the grandchildren left running don't disturb the cleanup of PostgreSQL's resources (shared memory, file/directory access, etc.) or restart of PostgreSQL, we may well be able to just advice the grandchildren to stop immediately with SIGINT/SIGTERM. However, for example, in the failover of shared-disk HA clustering, when the clustering software stops PostgreSQL with "pg_ctl stop -m immediate" and then tries to unmount the file systems for $PGDATA and archived WAL, the unmount may take time or fail due to the access from PostgreSQL's grandchildren.

2. 验证过程

使用如下简单代码，编译为可执行程序。

#include <signal.h>

static void b(int no){}

int main(int i,const char *c[])

{

signal(SIGQUIT,b);

system(“sleep 3;date”);

return 0;

}

将此可执行程序配置到PG的archive_command中。

执行BUG相关的PG快速关闭步骤，

每次都会出现CORE-DUMP的情况。

使用file <core.xxx> 可以看到，

CORE文件都是【sh -c sleep 3;date】引起的。

如果我们在源代码中，去掉system这一句，

重新编译，重新执行相关的PG快速关闭步骤，

则重试了若干次都不会产生CORE文件。

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

为什么archive_command在PG快速关闭时会出现CORE

评论