暂无图片
暂无图片
4
暂无图片
暂无图片
暂无图片

Redis哨兵集群,Master故障,无法自动切换故障场景分析

原创 陈举超 2025-11-18
570

问题现象:

Master:192.168.1.101 6377
Slave0:192.168.1.102 6377
Slave1:192.168.1.103 6377

Redis哨兵集群,Master关闭后,两个Slave都无法自动提升为Master。

sentinel日志如下:

Next failover delay: I will not start a failover before ...

详细日志如下:

26960:X 30 Oct 2025 21:59:13.162 # Next failover delay: I will not start a failover before Thu Oct 30 22:05:12 2025
26960:X 30 Oct 2025 22:01:37.514 * +reboot master cjc 192.168.1.101 6377
26960:X 30 Oct 2025 22:01:37.569 # -sdown master cjc 192.168.1.101 6377
26960:X 30 Oct 2025 22:01:37.569 # -odown master cjc 192.168.1.101 6377
26960:X 30 Oct 2025 22:04:59.399 # +sdown master cjc 192.168.1.101 6377
26960:X 30 Oct 2025 22:05:00.548 # +odown master cjc 192.168.1.101 6377 #quorum 3/2
26960:X 30 Oct 2025 22:05:12.271 # +new-epoch 7
26960:X 30 Oct 2025 22:05:12.273 # +vote-for-leader 6e925984391e935ba02b7c03b26fa5a15c77988e 7
26960:X 30 Oct 2025 22:05:12.312 # Next failover delay: I will not start a failover before Thu Oct 30 22:11:12 2025
26960:X 30 Oct 2025 22:06:38.098 * +reboot master cjc 192.168.1.101 6377
26960:X 30 Oct 2025 22:06:38.199 # -sdown master cjc 192.168.1.101 6377
26960:X 30 Oct 2025 22:06:38.199 # -odown master cjc 192.168.1.101 6377
26960:X 30 Oct 2025 22:10:35.921 # +sdown master cjc 192.168.1.101 6377
26960:X 30 Oct 2025 22:10:36.012 # +odown master cjc 192.168.1.101 6377 #quorum 2/2
......

问题分析:

1.检查两个slave权重参数

之前出现过,手动在线修改某一个slave权重参数为0:

config set slave-priority 0

手动执行切换后:

sentinel failover cjc

之前改参数的slave节点的redis.conf配置文件里的权重slave-priority也被改成0。
如果所有slave权重都是0,将无法自动提升为master。
但是本次案例和slave权重参数无关,两个slave的 slave-priority都是100。

config get slave-priority

2.检查两个slave的redis.conf配置文件

发现可疑参数:

rename-command CONFIG cjc123456abcdefgqaz

原CONFIG命令被重命名了,而在自动切换时,需要通过config修改redis.conf,sentinel.conf内容,修改 REPLICA OF 或 SLAVE OF 等参数,修改为新master信息,因为无法识别CONFIG命令,导致自动切换失败。

解决方案:

注释掉 rename-command CONFIG 配置,重启slave。
再次停止master,可以自动切换了,192.168.1.102:6377被提升为新Master。
新master:

10489:S 30 Oct 2025 22:17:11.452 * MASTER <-> REPLICA sync started
10489:S 30 Oct 2025 22:17:11.452 # Error condition on socket for SYNC: Connection refused
10489:S 30 Oct 2025 22:17:12.457 * Connecting to MASTER 192.168.1.101:6377
10489:S 30 Oct 2025 22:17:12.457 * MASTER <-> REPLICA sync started
10489:S 30 Oct 2025 22:17:12.457 # Error condition on socket for SYNC: Connection refused
10489:M 30 Oct 2025 22:17:13.391 * Discarding previously cached master state.
10489:M 30 Oct 2025 22:17:13.391 # Setting secondary replication ID to aaaxxxsssfffgghhjjj, valid up to offset: 10382. New replication ID is ggeessllaajjsdsdggg
10489:M 30 Oct 2025 22:17:13.391 * MASTER MODE enabled (user request from 'id=7 addr=192.168.1.101:63198 laddr=192.168.1.102:6377 fd=11 name=sentinel-342a55c0-cmd age=46 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=188 qbuf-free=40766 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
10489:M 30 Oct 2025 22:17:13.394 # CONFIG REWRITE executed with success.
10489:M 30 Oct 2025 22:17:13.617 * Replica 192.168.1.103:6377 asks for synchronization
10489:M 30 Oct 2025 22:17:13.617 * Partial resynchronization request from 192.168.1.103:6377 accepted. Sending 161 bytes of backlog starting from offset 10382.
10489:M 30 Oct 2025 22:21:27.050 * 10 changes in 300 seconds. Saving...
10489:M 30 Oct 2025 22:21:27.050 * Background saving started by pid 10782
10782:C 30 Oct 2025 22:21:27.054 * DB saved on disk

slave1指向新master:

25637:S 30 Oct 2025 22:17:12.620 * MASTER <-> REPLICA sync started
25637:S 30 Oct 2025 22:17:12.620 # Error condition on socket for SYNC: Connection refused
25637:S 30 Oct 2025 22:17:13.619 * Connecting to MASTER 192.168.1.102:6377
25637:S 30 Oct 2025 22:17:13.619 * MASTER <-> REPLICA sync started
25637:S 30 Oct 2025 22:17:13.619 * REPLICAOF 192.168.1.102:6377 enabled (user request from 'id=6 addr=192.168.1.101:47275 laddr=192.168.1.103:6377 fd=10 name=sentinel-342a55c0-cmd age=49 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=338 qbuf-free=40616 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
25637:S 30 Oct 2025 22:17:13.622 # CONFIG REWRITE executed with success.
25637:S 30 Oct 2025 22:17:13.622 * Non blocking connect for SYNC fired the event.
25637:S 30 Oct 2025 22:17:13.622 * Master replied to PING, replication can continue...
25637:S 30 Oct 2025 22:17:13.623 * Trying a partial resynchronization (request aaaxxxsssfffgghhjjj:10382).
25637:S 30 Oct 2025 22:17:13.623 * Successful partial resynchronization with master.
25637:S 30 Oct 2025 22:17:13.623 # Master replication ID changed to ggeessllaajjsdsdggg
25637:S 30 Oct 2025 22:17:13.623 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.
25637:S 30 Oct 2025 22:21:25.084 * 10 changes in 300 seconds. Saving...
25637:S 30 Oct 2025 22:21:25.084 * Background saving started by pid 26173
26173:C 30 Oct 2025 22:21:25.087 * DB saved on disk
26173:C 30 Oct 2025 22:21:25.088 * RDB: 0 MB of memory used by copy-on-write
25637:S 30 Oct 2025 22:21:25.185 * Background saving terminated with success

欢迎关注我的公众号《IT小Chen

「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论