
环境说明:
redis实例:192.168.126.128:6379 主192.168.126.128:6380 从192.168.126.128:6381 从
redis哨兵:192.168.126.128:26379192.168.126.128:26380192.168.126.128:26381
redis版本:测试4.x,5.x,6.x等版本都有类似问题。
问题现象1:
redis 哨兵架构下,关闭master主库后,主从没有自动切换,哨兵日志提示如下:
tail -100f sentinel_26379.log
......9569:X 08 Jul 2023 18:57:09.888 # +sdown master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:09.956 # +odown master mymaster 192.168.126.128 6379 #quorum 2/29569:X 08 Jul 2023 18:57:09.956 # +new-epoch 19569:X 08 Jul 2023 18:57:09.956 # +try-failover master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:09.966 # +vote-for-leader d3c28e6d3315e1b63cefffc09de23131f67c9309 19569:X 08 Jul 2023 18:57:10.268 # b98812cd6e0b3e0a3a8527244cc09bb3a0c5f251 voted for d3c28e6d3315e1b63cefffc09de23131f67c9309 19569:X 08 Jul 2023 18:57:10.270 # c56a629c526c80483792ecae8d80eba98796583b voted for d3c28e6d3315e1b63cefffc09de23131f67c9309 19569:X 08 Jul 2023 18:57:10.343 # +elected-leader master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:10.343 # +failover-state-select-slave master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:10.410 # -failover-abort-no-good-slave master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:10.463 # Next failover delay: I will not start a failover before Sat Jul 8 19:03:10 2023
显示:-failover-abort-no-good-slave 没有可用的slave,故障转移终止,6分钟后再次尝试故障转移:
[redis@cjc-db-01 log]$ cat sentinel_26379.log |grep delay9569:X 08 Jul 2023 18:57:10.463 # Next failover delay: I will not start a failover before Sat Jul 8 19:03:10 20239569:X 08 Jul 2023 19:03:10.337 # Next failover delay: I will not start a failover before Sat Jul 8 19:09:10 20239569:X 08 Jul 2023 19:09:10.515 # Next failover delay: I will not start a failover before Sat Jul 8 19:15:10 20239569:X 08 Jul 2023 19:15:10.697 # Next failover delay: I will not start a failover before Sat Jul 8 19:21:10 20239569:X 08 Jul 2023 19:21:11.081 # Next failover delay: I will not start a failover before Sat Jul 8 19:27:11 20239569:X 08 Jul 2023 19:27:11.709 # Next failover delay: I will not start a failover before Sat Jul 8 19:33:11 2023......
最终故障转移一直失败,没有成功
问题现象2:
通过redis哨兵failover切换命令,执行报错:
127.0.0.1:26379> sentinel failover mymaster(error) NOGOODSLAVE No suitable replica to promote
两个问题现象都表示主从无法正常切换。
问题原因:
通常情况下,Next failover delay: I will not start a failover before Sat Jul ..错误的可能原因有以下几种:
1:bind 参数
配置文件中 bind 参数没有配置或配置有问题,如无特殊要求, 可按如下方式配置:bind 0.0.0.0
2.auth-pass参数
哨兵配置文件没有记录redis密码,或密码记录错误添加参数,例如:sentinel auth-pass mymaster 密码
3.protected-mode 参数
哨兵配置文件中启用了protected-mode,需要关闭,添加:protected-mode no
4.rename-command CONFIG配置
redis.conf配置文件中,配置了rename-command CONFIG参数,例如:rename-command CONFIG ""将CONFIG命令禁用,防止人为修改需要关闭此参数:###rename-command CONFIG ""
经排查,并不是上述四种原因导致的。
既然切换时找不到可以提升为master的从库,还有一种可能:
从库的slave-priority或replica-priority参数配置为0,其中0表示不能提升为master。
查询从库slave-priority/replica-priority参数:
127.0.0.1:6379> config get slave-priority1) "slave-priority"2) "0"
127.0.0.1:6381> config get slave-priority1) "slave-priority"2) "0"
127.0.0.1:6380> config get slave-priority1) "slave-priority"2) "0"
居然都是0,不允许提升为master主库,检查配置文件也都是0:
[redis@cjc-db-01 conf]$ cat redis_6379.conf |grep -i priorityreplica-priority 0[redis@cjc-db-01 conf]$ cat redis_6380.conf |grep -i priorityreplica-priority 0[redis@cjc-db-01 conf]$ cat redis_6381.conf |grep -i priorityreplica-priority 0
很奇怪,slave-priority/replica-priority参数值默认值100,安装redis时replica-priority配置的值也是100。
为什么在没有手动调整redis.conf配置文件情况下,replica-priority值自动变成0了?
之前手动切换时,确实在redis内部修改过slave-priority参数,但是没有手动修改redis.conf配置文件,只是临时修改一次,用于切换。
redis-cli -p 6379config set slave-priority 0
难道redis修改参数类似于oracle的scope=both方式吗,修改后内存和配置文件同时生效?
验证:
修改参数前:
127.0.0.1:6379> config get slave-priority1) "slave-priority"2) "100"
查看配置文件
[redis@cjc-db-01 conf]$ cat redis_6379.conf |grep -i priorityreplica-priority 100
开始修改参数:
127.0.0.1:6379> config set slave-priority 0
查看配置文件,参数并没有发生改变
[redis@cjc-db-01 conf]$ cat redis_6379.conf |grep -i priorityreplica-priority 100
此时执行一次redis切换
redis-cli -p 26379
查看
info Sentinel
切换
sentinel failover mymaster
查看,切换完成
info Sentinel
再次查看配置文件:
[redis@cjc-db-01 conf]$ cat redis_6379.conf |grep -i priorityreplica-priority 0
此时redis_6379.conf配置文件中replica-priority参数自动发生了改变,从100自动变成了0。
如果重启此redis实例,重启后,值仍为0。
解决方案:
在redis哨兵架构下,为了实现自动故障转移,建议将slave-priority或replica-priority参数都调为100。
即:
当前运行参数应该为100
127.0.0.1:6379> config get slave-priority 100
如不是100,手动修改
127.0.0.1:6379> config set slave-priority 100
配置文件中replica-priority参数应为100:
cat redis_6379.conf |grep -i priority
如果不是,需要手动修改
replica-priority 100
如果要模拟切换,保障至少1个节点slave-priority/replica-priority参数不为0。
总结:
由于之前模拟过几次redis哨兵自动切换,为了实现将主库切换到指定节点,手动调整了redis实例的slave-priority,从100改成0,修改后主库故障就会将master切换到另一个没有修改slave-priority参数的节点上,然后将之前修改的参数改回原值100。
但是:
切换完成后,修改slave-priority参数节点的redis.conf配置文件中replica-priority参数会被自动更新为0,如果之后这个节点发生过重启,由于redis.conf配置文件记录的replica-priority是0,重启后redis当前运行值也是0。如果两个从节点都经历过类似的情况,最终就会出现两个从节点的slave-priority/replica-priority运行参数都是0,最终导致无法正常执行切换。

实验过程如下:
启动实例
redis-server redis/conf/redis_6379.confredis-server redis/conf/redis_6380.confredis-server redis/conf/redis_6381.conf
查看进程
ps -ef|grep redis|grep redisredis 8984 1 0 17:56 ? 00:00:00 redis-server 0.0.0.0:6379redis 8990 1 0 17:56 ? 00:00:00 redis-server 0.0.0.0:6380redis 8998 1 0 17:56 ? 00:00:00 redis-server 0.0.0.0:6381
查看主从状态
redis@cjc-db-01 conf]$ redis-cli -p 6379127.0.0.1:6379> auth 111OK127.0.0.1:6379> info Replication# Replicationrole:masterconnected_slaves:2min_slaves_good_slaves:2slave0:ip=192.168.126.128,port=6380,state=online,offset=112,lag=0slave1:ip=192.168.126.128,port=6381,state=online,offset=112,lag=1master_failover_state:no-failovermaster_replid:009ba10699b6689dbce7d1919495ca5786fe9ba4master_replid2:0000000000000000000000000000000000000000
写入测试数据
127.0.0.1:6379> get xxx(nil)127.0.0.1:6379> set xxx cjcOK127.0.0.1:6379> get xxx"cjc"
检查数据同步
[redis@cjc-db-01 conf]$ redis-cli -p 6380127.0.0.1:6380> auth 111OK127.0.0.1:6380> get xxx"cjc"[redis@cjc-db-01 conf]$ redis-cli -p 6381127.0.0.1:6381> auth 111OK127.0.0.1:6381> get xxx"cjc"
先不启动哨兵,执行手动切换:
模拟主库故障
[redis@cjc-db-01 conf]$ ps -ef|grep 6379[redis@cjc-db-01 conf]$ kill -9 8984
查看从库状态
[redis@cjc-db-01 conf]$ redis-cli -p 6380127.0.0.1:6380> auth 111OK127.0.0.1:6380> info Replication# Replicationrole:slavemaster_host:192.168.126.128master_port:6379master_link_status:downmaster_last_io_seconds_ago:-1......
没有哨兵,不会自动切换
从库不支持写操作
127.0.0.1:6380> get xxx"cjc"127.0.0.1:6380> set yyy aaa(error) READONLY You can't write against a read only replica.
[redis@cjc-db-01 conf]$ redis-cli -p 6381127.0.0.1:6381> auth 111OK127.0.0.1:6381> info Replication# Replicationrole:slavemaster_host:192.168.126.128master_port:6379master_link_status:down......127.0.0.1:6381> get xxx"cjc"127.0.0.1:6381> set yyy aaa(error) READONLY You can't write against a read only replica.
手动切换
将6380提升为主
6380:
[redis@cjc-db-01 conf]$ redis-cli -p 6380127.0.0.1:6380> auth 111OK
中断主从关系,角色变成master,原来同步所得的数据集不会被丢弃
127.0.0.1:6380> slaveof no oneOK127.0.0.1:6380> info Replication# Replicationrole:masterconnected_slaves:0min_slaves_good_slaves:0master_failover_state:no-failovermaster_replid:43bb3d9953e3af5b34a8c83193043c55ad45fb09master_replid2:009ba10699b6689dbce7d1919495ca5786fe9ba4master_repl_offset:390second_repl_offset:391repl_backlog_active:1repl_backlog_size:1048576repl_backlog_first_byte_offset:1repl_backlog_histlen:390
将6381节点对应主库改成6380
[redis@cjc-db-01 conf]$ redis-cli -p 6381127.0.0.1:6381> auth 111OK127.0.0.1:6381> info Replication# Replicationrole:slavemaster_host:192.168.126.128master_port:6379master_link_status:down......127.0.0.1:6381> slaveof 192.168.126.128 6380OK127.0.0.1:6381> info Replication# Replicationrole:slavemaster_host:192.168.126.128master_port:6380master_link_status:upmaster_last_io_seconds_ago:2......
主从同步
redis@cjc-db-01 conf]$ redis-cli -p 6380127.0.0.1:6380> auth 111OK127.0.0.1:6380> info Replication# Replicationrole:masterconnected_slaves:1min_slaves_good_slaves:1slave0:ip=192.168.126.128,port=6381,state=online,offset=488,lag=0master_failover_state:no-failovermaster_replid:43bb3d9953e3af5b34a8c83193043c55ad45fb09master_replid2:009ba10699b6689dbce7d1919495ca5786fe9ba4127.0.0.1:6380> set zzz iii
启动6379,并加入到主从
redis-server redis/conf/redis_6379.conf[redis@cjc-db-01 conf]$ redis-cli -p 6379127.0.0.1:6379> auth 111OK127.0.0.1:6379> info Replication# Replicationrole:masterconnected_slaves:0min_slaves_good_slaves:0加入到主从127.0.0.1:6379> get zzz(nil)127.0.0.1:6379> slaveof 192.168.126.128 6380OK127.0.0.1:6379> get zzz"iii"[redis@cjc-db-01 conf]$ redis-cli -p 6380127.0.0.1:6380> auth 111OK127.0.0.1:6380> info Replication# Replicationrole:masterconnected_slaves:2min_slaves_good_slaves:2slave0:ip=192.168.126.128,port=6381,state=online,offset=2068,lag=1slave1:ip=192.168.126.128,port=6379,state=online,offset=2068,lag=0master_failover_state:no-failover
手动切回原主库
[redis@cjc-db-01 conf]$ redis-cli -p 6379127.0.0.1:6379> auth 111OK127.0.0.1:6379> SLAVEOF NO ONE[redis@cjc-db-01 conf]$ redis-cli -p 6380127.0.0.1:6380> auth 111OK127.0.0.1:6380> slaveof 192.168.126.128 6379OK[redis@cjc-db-01 conf]$ redis-cli -p 6381127.0.0.1:6381> auth 111OK127.0.0.1:6381> slaveof 192.168.126.128 6379OK
查看
[redis@cjc-db-01 conf]$ redis-cli -p 6379127.0.0.1:6379> auth 111OK127.0.0.1:6379> info Replication# Replicationrole:masterconnected_slaves:2min_slaves_good_slaves:2slave0:ip=192.168.126.128,port=6380,state=online,offset=2222,lag=0slave1:ip=192.168.126.128,port=6381,state=online,offset=2222,lag=0master_failover_state:no-failover
以上就是手动故障转移过程,如果使用哨兵,会帮我们自动完成上述过程:
启动哨兵
redis-sentinel /redis/conf/redis_26379.conf
可以看到,启动26379哨兵后,对应配置文件新增# Generated by CONFIG REWRITE部分,自动发现了两个从节点:
原文件内容:
[redis@cjc-db-01 conf]$ cat redis_26379.confport 26379logfile "/redis/log/sentinel_26379.log"dir "/redis/26379/data"pidfile "/redis/26379/pid/redis_26379.pid"bind 0.0.0.0protected-mode nodaemonize yessentinel monitor mymaster 192.168.126.128 6380 2sentinel down-after-milliseconds mymaster 8000sentinel auth-pass mymaster 111
启动后,文件内容:
[redis@cjc-db-01 conf]$ cat redis_26379.confport 26379logfile "/redis/log/sentinel_26379.log"dir "/redis/26379/data"pidfile "/redis/26379/pid/redis_26379.pid"bind 0.0.0.0protected-mode nodaemonize yessentinel monitor mymaster 192.168.126.128 6379 2sentinel down-after-milliseconds mymaster 8000sentinel auth-pass mymaster 111# Generated by CONFIG REWRITEuser default on nopass ~* &* +@allsentinel myid d3c28e6d3315e1b63cefffc09de23131f67c9309sentinel config-epoch mymaster 0sentinel leader-epoch mymaster 0sentinel current-epoch 0sentinel known-replica mymaster 192.168.126.128 6381sentinel known-replica mymaster 192.168.126.128 6380
启动其他哨兵
redis-sentinel /redis/conf/redis_26380.confredis-sentinel /redis/conf/redis_26381.conf
查看最终的哨兵配置文件
26379自动写入部分:
# Generated by CONFIG REWRITEuser default on nopass ~* &* +@allsentinel myid d3c28e6d3315e1b63cefffc09de23131f67c9309sentinel config-epoch mymaster 0sentinel leader-epoch mymaster 0sentinel current-epoch 0sentinel known-replica mymaster 192.168.126.128 6381sentinel known-replica mymaster 192.168.126.128 6380sentinel known-sentinel mymaster 192.168.126.128 26380 b98812cd6e0b3e0a3a8527244cc09bb3a0c5f251sentinel known-sentinel mymaster 192.168.126.128 26381 c56a629c526c80483792ecae8d80eba98796583b
26380自动写入部分:
# Generated by CONFIG REWRITEuser default on nopass ~* &* +@allsentinel myid b98812cd6e0b3e0a3a8527244cc09bb3a0c5f251sentinel config-epoch mymaster 0sentinel leader-epoch mymaster 0sentinel current-epoch 0sentinel known-replica mymaster 192.168.126.128 6380sentinel known-replica mymaster 192.168.126.128 6381sentinel known-sentinel mymaster 192.168.126.128 26381 c56a629c526c80483792ecae8d80eba98796583bsentinel known-sentinel mymaster 192.168.126.128 26379 d3c28e6d3315e1b63cefffc09de23131f67c9309
26381自动写入部分:
# Generated by CONFIG REWRITEuser default on nopass ~* &* +@allsentinel myid c56a629c526c80483792ecae8d80eba98796583bsentinel config-epoch mymaster 0sentinel leader-epoch mymaster 0sentinel current-epoch 0sentinel known-replica mymaster 192.168.126.128 6380sentinel known-replica mymaster 192.168.126.128 6381sentinel known-sentinel mymaster 192.168.126.128 26380 b98812cd6e0b3e0a3a8527244cc09bb3a0c5f251sentinel known-sentinel mymaster 192.168.126.128 26379 d3c28e6d3315e1b63cefffc09de23131f67c9309
可以看到,哨兵节点启动后,会在配置文件中自动写入从节点IP、端口信息,其他两个哨兵节点的IP、端口信息。
停止主库,查看哨兵变化
[redis@cjc-db-01 conf]$ redis-cli -p 6379127.0.0.1:6379> auth 111OK127.0.0.1:6379> info Replication# Replicationrole:masterconnected_slaves:2min_slaves_good_slaves:2slave0:ip=192.168.126.128,port=6380,state=online,offset=317423,lag=1slave1:ip=192.168.126.128,port=6381,state=online,offset=317423,lag=0master_failover_state:no-failovermaster_replid:fc7968519b15a9090407133791ea3551cdd67716master_replid2:43bb3d9953e3af5b34a8c83193043c55ad45fb09......127.0.0.1:6379> shutdown savenot connected>
查看哨兵日志
tail -100f sentinel_26379.log......9569:X 08 Jul 2023 18:57:09.888 # +sdown master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:09.956 # +odown master mymaster 192.168.126.128 6379 #quorum 2/29569:X 08 Jul 2023 18:57:09.956 # +new-epoch 19569:X 08 Jul 2023 18:57:09.956 # +try-failover master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:09.966 # +vote-for-leader d3c28e6d3315e1b63cefffc09de23131f67c9309 19569:X 08 Jul 2023 18:57:10.268 # b98812cd6e0b3e0a3a8527244cc09bb3a0c5f251 voted for d3c28e6d3315e1b63cefffc09de23131f67c9309 19569:X 08 Jul 2023 18:57:10.270 # c56a629c526c80483792ecae8d80eba98796583b voted for d3c28e6d3315e1b63cefffc09de23131f67c9309 19569:X 08 Jul 2023 18:57:10.343 # +elected-leader master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:10.343 # +failover-state-select-slave master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:10.410 # -failover-abort-no-good-slave master mymaster 192.168.126.128 63799569:X 08 Jul 2023 18:57:10.463 # Next failover delay: I will not start a failover before Sat Jul 8 19:03:10 2023
从这里可以看到sentinel由于failover超时,导致切换延迟,并告知在几点之后进行下一次failover。
最新等待6分钟后仍然没有自动切换完成。
查看主从状态
[redis@cjc-db-01 conf]$ redis-cli -p 6380127.0.0.1:6380> auth 111OK127.0.0.1:6380> info Replication# Replicationrole:slavemaster_host:192.168.126.128master_port:6379master_link_status:down
[redis@cjc-db-01 conf]$ redis-cli -p 6381127.0.0.1:6381> auth 111OK127.0.0.1:6381> info Replication# Replicationrole:slavemaster_host:192.168.126.128master_port:6379master_link_status:downmaster_last_io_seconds_ago:-1
查看哨兵状态
127.0.0.1:26379> info Sentinel# Sentinelsentinel_masters:1sentinel_tilt:0sentinel_running_scripts:0sentinel_scripts_queue_length:0sentinel_simulate_failure_flags:0master0:name=mymaster,status=odown,address=192.168.126.128:6379,slaves=2,sentinels=3
启动原主库 6379
[redis@cjc-db-01 conf]$ redis-server redis_6379.conf[redis@cjc-db-01 conf]$ redis-cli -p 6379127.0.0.1:6379> auth 111OK
启动后,master会自动加回原主库
127.0.0.1:6379> info Replication# Replicationrole:masterconnected_slaves:2min_slaves_good_slaves:2slave0:ip=192.168.126.128,port=6380,state=online,offset=6658,lag=1slave1:ip=192.168.126.128,port=6381,state=online,offset=6658,lag=1master_failover_state:no-failovermaster_replid:361d438f4fae84a4c9501954f515d02cf31669camaster_replid2:0000000000000000000000000000000000000000
调整权重
redis-cli -p 6379redis-cli -p 6380redis-cli -p 6381config set slave-priority 100
再次关闭主库
查看,可以自动切换了
127.0.0.1:6379> info Replication# Replicationrole:masterconnected_slaves:2min_slaves_good_slaves:2slave0:ip=192.168.126.128,port=6380,state=online,offset=63656,lag=0slave1:ip=192.168.126.128,port=6381,state=online,offset=63510,lag=1
切换后,redis.conf配置文件会自动更新:
查看redis配置文件自动更新
cat redis_6381.conf
故障自动转移后,自动删除了:
slaveof 192.168.126.128 6379
自动添加了如下内容:
replicaof 192.168.126.128 6380# Generated by CONFIG REWRITEuser default on #f6e0a1e2ac41945a9aa7ff8a8aaa0cebc12a3bcc981a929ad5cf810a090e11ae ~* &* +@all
查看6380节点
cat redis_6380.conf
自动删除了:
slaveof 192.168.126.128 6379
自动添加了
# Generated by CONFIG REWRITEuser default on #f6e0a1e2ac41945a9aa7ff8a8aaa0cebc12a3bcc981a929ad5cf810a090e11ae ~* &* +@all

###chenjuchao 20230708 17:00###




