测试人员提交了一个BUG, BUG2020091100436
基本内涵就是archive_command使用外部可执行程序,
在PG快速关闭时,会产生CORE-DUMP的情况。
经过调研,结论如下:
在现有情况下,不可避免地会出现CORE-DUMP,
且没有完美的解决方法避免。
因为我们在可控的代码去添加SIGQUIT的处理代码,
但是总是有一些孙子进程(grandchildren)不受控制。
1. 为什么
因为快速关闭,就是要快,
不会给相关子进程优雅地退出的机会。
https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FD3366B%40G01JPEXMBYT05
I agree that the user's archiver program should receive the chance for graceful stop in smart or fast shutdown. But I think in immediate shutdown, all should stop immediately. That's what I expect from the word "immediate."
If the grandchildren left running don't disturb the cleanup of PostgreSQL's resources (shared memory, file/directory access, etc.) or restart of PostgreSQL, we may well be able to just advice the grandchildren to stop immediately with SIGINT/SIGTERM. However, for example, in the failover of shared-disk HA clustering, when the clustering software stops PostgreSQL with "pg_ctl stop -m immediate" and then tries to unmount the file systems for $PGDATA and archived WAL, the unmount may take time or fail due to the access from PostgreSQL's grandchildren.
2. 验证过程
使用如下简单代码,编译为可执行程序。
#include <signal.h> static void b(int no){} int main(int i,const char *c[]) { signal(SIGQUIT,b); system(“sleep 3;date”); return 0; } |
将此可执行程序配置到PG的archive_command中。
执行BUG相关的PG快速关闭步骤,
每次都会出现CORE-DUMP的情况。
使用file <core.xxx> 可以看到,
CORE文件都是【sh -c sleep 3;date】引起的。
如果我们在源代码中,去掉system这一句,
重新编译,重新执行相关的PG快速关闭步骤,
则重试了若干次都不会产生CORE文件。




