暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

MySQL高可用MHA原理及测试

陶老师运维笔记 2020-03-22
1280

MySQL高可用MHA原理及测试

1. MHA 简介

  • MHA 介绍:https://github.com/yoshinorim/mha4mysql-manager/wiki/Architecture

  • github下载地址:https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads

MHA 简介:

MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由日本的youshimaton开发,是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中,MHA能做到在0~30秒之内自动完成数据库的故障切换操作,并且在进行故障切换的过程中,MHA能在最大程度上保证数据的一致性,以达到真正意义上的高可用。

MHA优点:

  1. master failover和slave promotion非常快速。2. 自动探测,多重检测,切换过程中支持调用其他脚本的接口。

  2. master crash不会导致数据不一致,自动补齐数据,维护数据一致性。

  3. 不需要修改复制的任何设置,简单易部署,对现有架构无影响。

  4. 不需要增加很多额外的机器来部署MHA,支持多实例集中管理。

  5. 没有任何性能影响。

  6. 跨存储引擎,支持任何引擎

MHA不支持的场景:

  • 多层次复制 (M1->M2->Slave)

  • MySQL5.0.45或更低版本

  • 复制规则(replication filteing rules(binlog-do-db,replicate-ignore-db,etc) 必须一致

  • Load Data [Local] Infile with SBR 不支持。

2. MHA工作原理

  • https://github.com/yoshinorim/mha4mysql-manager/wiki/Architecture

  • https://www.slideshare.net/matsunobu/automated-master-failover

MHA工作原理总结为以下几条:(1)从宕机崩溃的master保存二进制日志事件(binlog events); (2)识别含有最新更新的slave; (3)应用差异的中继日志(relay log) 到其他slave; (4)应用从master保存的二进制日志事件(binlog events); (5)提升一个slave为新master; (6)使用其他的slave连接新的master进行复制。

在MHA自动故障切换过程中,MHA试图从宕机的主服务器上保存二进制日志,最大程度的保证数据的不丢失,但这并不总是可行的。例如,如果主服务器硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失了最新的数据。使用MySQL 5.5的半同步复制,可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来。如果只有一个slave已经收到了最新的二进制日志,MHA可以将最新的二进制日志应用于其他所有的slave服务器上,因此可以保证所有节点的数据一致性。

2.1 主库故障处理场景

主库故障:

  • 场景1: 全部从库都有全部binlog event

这是一种最理想的情况,但是事情常不可能这样幸运。


  • 场景2:Master有事务没有同步到从库使用了半同步复制可以避免这个风险。



  • 场景3:部分从库缺失binlog event


2.2 问题及困难

主库Failover的困难点,最近的从库还是缺失了主库binlog event。

2.3 问题解决

目标实现:

保存binlog event

找出最近master 的slave:

识别出各从库丢失的event

实施恢复

3. MHA组件架构

3.1 MHA组件架构


  • MHA manager:管理节点,通常单独部署在一台独立的服务器上,用来管理多个master/slave集群,也可部署在一台slave节点上,每个master/slave集群称为一个application。MHA Manager会定时探测集群中的master节点,当发现master节点出现故障时,它可以自动将具有最新数据的slave节点提升为新的master节点,然后将所有其它 的slave节点重新指向新的master节点。



  • MHA node:数据节点,运行在每台MariaDB服务器上(manager/master/slave),它通过监控具备解析和清理logs功能的脚本来加快故障转移。


3.2 软件包功能介绍

Manager工具:

  1. Manager工具包:

  2. masterha_manger 启动MHA

  3. masterha_check_ssh 检查MHASSH配置状况

  4. masterha_check_repl 检查MySQL复制状况

  5. masterha_master_monitor 检测master是否宕机

  6. masterha_check_status 检测当前MHA运行状态

  7. masterha_master_switch 控制故障转移(自动或者手动)

  8. masterha_conf_host 添加或删除配置的server信息

Node工具:

  1. Node工具(所有集群节点):

  2. 这些工具通常由MHA Manager的脚本触发,无需人为操作

  3. save_binary_logs 保存和复制master的二进制日志

  4. apply_diff_relay_logs 识别差异的中继日志事件并将其差异的事件应用于其他的

  5. purge_relay_logs 清除中继日志(不会阻塞SQL线程)

3.3 MHA处理过程

  1. ====== monitor node 监控节点======

  2. (1) 监控所有节点,重点是master

  3. (2) 监控到master宕机(实例(ssh能),主机(ssh不能连))

  4. (3) 监控主从状态

  5. ====== failover 故障转移 ======

  6. (3) 对比各节点的GTID号码。

  7. (3) 数据补偿1:如果ssh能连,从节点立即保存自己缺失部分的二进制日志

  8. (4) 选主:对比各节点的GTID号码即可,选一个最接近于主库数据的从节点,恢复缺失的日志,并将从库切换为主库 stop slave reset slave all

  9. (5) 数据补偿2:如果ssh不能连,计算两个从库的relaylog的差异,恢复到数据少的从库中.

  10. (6) 2号从库change master to 新主,开启新的主从关系

  11. ====== 应用透明=====

  12. (7) 使用vip机制实现应用透明

  13. ====== 补充功能 ======

  14. (8) 自动修复主库(加入集群)待开发...

  15. (9) 二次数据补偿的问题 (binlog server)

  16. (10) 提醒功能(send_report)

  17. (11) 权重的问题

4. MHA环境搭建

  • https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads

  • https://github.com/yoshinorim/mha4mysql-manager/wiki/Installation

4.1 环境规划

使用三台机器来做一个简易的MHA环境,MHA软件版本为mha-0.56。

IPPortDB角色MHA角色软件版本
192.124.64.2123307DB1 mastermha-nodecentos6,mha-0.56
192.124.64.2133307DB2 slavemha-nodecentos6,mha-0.56
192.124.64.2143307DB3 slavemha-node node-managercentos6,mha-0.56

安装建议:

1.manager可以单独装在任意一台机器上;2.一个manager可以管理多套mysql集群;3.建议不要将manager装在主库上(防止主库断电,断网);4.所有数据库必须安装node包;5.manager的依赖有node

4.2 各节点SSH互信

  1. #各节点执行如下操作

  2. ssh-keygen -t rsa -P '' -f /root/.ssh/id_rsa

  3. #

  4. ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.124.64.213

  5. ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.124.64.212

  6. ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.124.64.214

4.3 MySQL主从安装

1.安装MySQL

  1. #使用自已写的脚本安装MySQL

  2. mysql_install -P 3307 -r m -b 2G -v 5.6.27

2.搭建主从关系

  1. DB1做为主,DB2,DB3为从库

  2. #授权

  3. grant replication client,replication slave on *.* to 'repl'@'10.%' IDENTIFIED BY 'repl123';

  4. grant all privileges on *.* to mha@'10.%' identified by 'mha123';


  5. #DB2,DB3建立主从关系

  6. CHANGE MASTER TO

  7. MASTER_HOST='192.124.64.212',

  8. MASTER_PORT=3307,

  9. MASTER_USER='repl',

  10. MASTER_PASSWORD='repl123',

  11. MASTER_AUTO_POSITION = 1;

  12. #

  13. start slave ;

  14. show slave status\G

说明:

  • 从库建议用命令设为只读 set global read_only=1,不要将该参数写进配置文件中.

  • 从库关闭中继日志的清除 set global relaylogpurge=0. MHA在发生切换的过程中,从库的恢复过程中依赖于relay log的相关信息,所以这里要将relay log的自动清除设置为OFF,采用手动清除relay log的方式。

4.4 MHA软件安装

下载软件并安装:所有节点(数据库master,slave,MHA manager节点)都需要安装MHA node。因为MHA manager也需要依赖MHA node。

  1. #软件下载

  2. mha官网:https://code.google.com/archive/p/mysql-master-ha/

  3. github下载地址:https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads


  4. 所有节点安装Node软件依赖包

  5. yum install perl-DBD-MySQL -y

  6. rpm -ivh mha4mysql-node-0.56-0.el6.noarch.rpm


  7. #在DB3节点上安装mha-manager

  8. yum install mha4mysql-manager-0.56-0.el6.noarch.rpm

5. MHA配置

  • https://github.com/yoshinorim/mha4mysql-manager/wiki/Configuration

为保证MHA正常工作,需要配置MHA的配置文件,为参数设置合理正确的值,这些参数包括服务器IP,数据库用户名密码,工作目录与日志等。MHA源码安装,则会有两个配置文件模板,在路径 $MHABASE/samples/conf/ 下的app1.cnf 和 masterhadefault .cnf。

创建目录:

  1. mkdir /etc/mha/script -p

  2. 创建日志目录

  3. mkdir -p /var/log/mha/

编辑mha配置文件:

  1. vim /etc/mha/mysql3307.cnf

  2. [server default]

  3. manager_log=/var/log/mha/mysql3307/manager

  4. manager_workdir=/var/log/mha/mysql3307

  5. master_binlog_dir=/data1/mysql_3307/

  6. user=mha

  7. password=mha123

  8. ping_interval=2

  9. repl_user=repl

  10. repl_password=repl123

  11. ssh_user=root

  12. #master_ip_failover_script=/etc/mha/script/master_ip_failover

  13. #shutdown_script= etc/mha/script/power_manager

  14. #report_script= etc/mha/script/send_master_failover_mail


  15. [server1]

  16. hostname=192.124.64.212

  17. port=3307

  18. [server2]

  19. hostname=192.124.64.213

  20. port=3307

  21. [server3]

  22. hostname=192.124.64.214

  23. port=3307

6. 检查状态

1.互信检查

  1. $masterha_check_ssh --conf=/etc/mha/mysql3307.cnf

  2. Sat Mar 21 23:14:28 2020 - [warning] Global configuration file /etc/masterha_default .cnf not found. Skipping.

  3. Sat Mar 21 23:14:28 2020 - [info] Reading application default configuration from /etc/mha/mysql3307.cnf..

  4. Sat Mar 21 23:14:28 2020 - [info] Reading server configuration from /etc/mha/mysql3307.cnf..

  5. Sat Mar 21 23:14:28 2020 - [info] Starting SSH connection tests..

  6. ...

  7. Sat Mar 21 23:14:29 2020 - [info] All SSH connection tests passed successfully.

2.检查复制

  1. masterha_check_repl --conf=/etc/mha/mysql3307.cnf

  2. Sat Mar 21 23:17:00 2020 - [warning] Global configuration file /etc/masterha_default .cnf not found. Skipping.

  3. Sat Mar 21 23:17:00 2020 - [info] Reading application default configuration from /etc/mha/mysql3307.cnf..

  4. Sat Mar 21 23:17:00 2020 - [info] Reading server configuration from /etc/mha/mysql3307.cnf..

  5. Sat Mar 21 23:17:00 2020 - [info] MHA::MasterMonitor version 0.56.

  6. Sat Mar 21 23:17:01 2020 - [info] GTID failover mode = 1

  7. Sat Mar 21 23:17:01 2020 - [info] Dead Servers:

  8. Sat Mar 21 23:17:01 2020 - [info] Alive Servers:

  9. Sat Mar 21 23:17:01 2020 - [info] 192.124.64.212(192.124.64.212:3307)

  10. Sat Mar 21 23:17:01 2020 - [info] 192.124.64.213(192.124.64.213:3307)

  11. Sat Mar 21 23:17:01 2020 - [info] 192.124.64.214(192.124.64.214:3307)

  12. Sat Mar 21 23:17:01 2020 - [info] Alive Slaves:

  13. Sat Mar 21 23:17:01 2020 - [info] 192.124.64.213(192.124.64.213:3307) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled

  14. Sat Mar 21 23:17:01 2020 - [info] GTID ON

  15. Sat Mar 21 23:17:01 2020 - [info] Replicating from 192.124.64.212(192.124.64.212:3307)

  16. Sat Mar 21 23:17:01 2020 - [info] 192.124.64.214(192.124.64.214:3307) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled

  17. Sat Mar 21 23:17:01 2020 - [info] GTID ON

  18. Sat Mar 21 23:17:01 2020 - [info] Replicating from 192.124.64.212(192.124.64.212:3307)

  19. Sat Mar 21 23:17:01 2020 - [info] Current Alive Master: 192.124.64.212(192.124.64.212:3307)

  20. Sat Mar 21 23:17:01 2020 - [info] Checking slave configurations..

  21. Sat Mar 21 23:17:01 2020 - [info] read_only=1 is not set on slave 192.124.64.213(192.124.64.213:3307).

  22. Sat Mar 21 23:17:01 2020 - [info] read_only=1 is not set on slave 192.124.64.214(192.124.64.214:3307).

  23. Sat Mar 21 23:17:01 2020 - [info] Checking replication filtering settings..

  24. Sat Mar 21 23:17:01 2020 - [info] binlog_do_db= , binlog_ignore_db=

  25. Sat Mar 21 23:17:01 2020 - [info] Replication filtering check ok.

  26. Sat Mar 21 23:17:01 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.

  27. Sat Mar 21 23:17:01 2020 - [info] Checking SSH publickey authentication settings on the current master..

  28. Warning: Permanently added '192.124.64.212' (RSA) to the list of known hosts.

  29. Sat Mar 21 23:17:01 2020 - [info] HealthCheck: SSH to 192.124.64.212 is reachable.

  30. Sat Mar 21 23:17:01 2020 - [info]

  31. 192.124.64.212(192.124.64.212:3307) (current master)

  32. +--192.124.64.213(192.124.64.213:3307)

  33. +--192.124.64.214(192.124.64.214:3307)


  34. Sat Mar 21 23:17:01 2020 - [info] Checking replication health on 192.124.64.213..

  35. Sat Mar 21 23:17:01 2020 - [info] ok.

  36. Sat Mar 21 23:17:01 2020 - [info] Checking replication health on 192.124.64.214..

  37. Sat Mar 21 23:17:01 2020 - [info] ok.

  38. Sat Mar 21 23:17:01 2020 - [warning] master_ip_failover_script is not defined.

  39. Sat Mar 21 23:17:01 2020 - [warning] shutdown_script is not defined.

  40. Sat Mar 21 23:17:01 2020 - [info] Got exit code 0 (Not master dead).


  41. MySQL Replication Health is OK.

7. 启动MHA

启动MHA:

  1. #查看MHA manager监控状态,这里没有运行

  2. # masterha_check_status --conf=/etc/mha/mysql3307.cnf

  3. mysql3307 is stopped(2:NOT_RUNNING).


  4. #启动MHA监控 --remove_dead_master_conf --ignore_last_failover

  5. $nohup masterha_manager --conf=/etc/mha/mysql3307.cnf --remove_dead_master_conf --ignore_last_failover >> /var/log/mha/mysql3307/mha-3307.log 2>&1 &


  6. #检查状态

  7. $masterha_check_status --conf=/etc/mha/mysql3307.cnf

  8. mysql3307 (pid:10265) is running(0:PING_OK), master:192.124.64.212

停止MHA监控:

  1. 停止MHA监控

  2. masterha_stop --conf=/etc/mha/mysql3307.cnf

8. 测试MHA

8.1 检查当前状态

  1. $masterha_check_status --conf=/etc/mha/mysql3307.cnf

  2. mysql3307 (pid:10265) is running(0:PING_OK), master:192.124.64.212


  3. $mysql -h 192.124.64.214 -P 3307 -e "set global relay_log_purge=0"

  4. $mysql -h 192.124.64.214 -P 3307 -e "show global variables like '%relay_log_purge%'"

  5. +-----------------+-------+

  6. | Variable_name | Value |

  7. +-----------------+-------+

  8. | relay_log_purge | OFF |

  9. +-----------------+-------+


  10. $mysql -h 192.124.64.214 -P 3307 -e "show slave status\G" |egrep 'Master_Host|Master_Port|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master' -i

  11. Master_Host: 192.124.64.212

  12. Master_Port: 3307

  13. Slave_IO_Running: Yes

  14. Slave_SQL_Running: Yes

  15. Seconds_Behind_Master: 0

  16. Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it


8.2 测试自动切换

1. 测试主库故障,自动切换。

  1. #kill master DB1

  2. $kill mysql_pid;

2. 查看详细日志:

观察manager 日志,末尾必须显示successfully,才算正常切换成功。tail -f var/log/mha/mysql3307/manager.log

  1. cat /var/log/mha/mysql3307/manager

  2. Sat Mar 21 23:24:59 2020 - [info] MHA::MasterMonitor version 0.56.

  3. Sat Mar 21 23:25:01 2020 - [info] GTID failover mode = 1

  4. Sat Mar 21 23:25:01 2020 - [info] Dead Servers:

  5. Sat Mar 21 23:25:01 2020 - [info] Alive Servers:

  6. Sat Mar 21 23:25:01 2020 - [info] 192.124.64.212(192.124.64.212:3307)

  7. Sat Mar 21 23:25:01 2020 - [info] 192.124.64.213(192.124.64.213:3307)

  8. Sat Mar 21 23:25:01 2020 - [info] 192.124.64.214(192.124.64.214:3307)

  9. Sat Mar 21 23:25:01 2020 - [info] Alive Slaves:

  10. Sat Mar 21 23:25:01 2020 - [info] 192.124.64.213(192.124.64.213:3307) Version=5.6.27-log (oldest major version between slaves) log-bin:enabled

  11. Sat Mar 21 23:25:01 2020 - [info] GTID ON

  12. ...

  13. ----- Failover Report -----


  14. mysql3307: MySQL Master failover 192.124.64.212(192.124.64.212:3307) to 192.124.64.213(192.124.64.213:3307) succeeded


  15. Master 192.124.64.212(192.124.64.212:3307) is down!


  16. Check MHA Manager logs at LeDB-VM-124064214:/var/log/mha/mysql3307/manager for details.


  17. Started automated(non-interactive) failover.

  18. Invalidated master IP address on 192.124.64.212(192.124.64.212:3307)

  19. Selected 192.124.64.213(192.124.64.213:3307) as a new master.

  20. 192.124.64.213(192.124.64.213:3307): OK: Applying all logs succeeded.

  21. 192.124.64.213(192.124.64.213:3307): OK: Activated master IP address.

  22. 192.124.64.214(192.124.64.214:3307): OK: Slave started, replicating from 192.124.64.213(192.124.64.213:3307)

  23. 192.124.64.213(192.124.64.213:3307): Resetting slave info succeeded.

  24. Master failover to 192.124.64.213(192.124.64.213:3307) completed successfully.

3.检查结果

DB2变为了主库,DB3成为了DB2的从库。

  1. masterha_check_status --conf=/etc/mha/mysql3307.cnf

  2. mysql3307 is stopped(2:NOT_RUNNING).

  3. $mysql -h 192.124.64.213 -P 3307 -e "show slave status\G"

  4. $mysql -h 192.124.64.214 -P 3307 -e "show slave status\G" |egrep 'Master_Host|Master_Port|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master' -i

  5. Master_Host: 192.124.64.213

  6. Master_Port: 3307

  7. Slave_IO_Running: Yes

  8. Slave_SQL_Running: Yes

  9. Seconds_Behind_Master: 0

8.3 手动切换

MHA Manager 必须没有运行。手动failover,这种场景意味着在业务上没有启用MHA自动切换功能,当主服务器故障时,人工手动调用MHA来进行故障切换操作,具体命令如下:注意:如果,MHA manager检测到没有dead的server,将报错,并结束failover。

  1. #当前DB2 是master。关停MHA并kill 主库192.124.64.213:3307。

  2. #手动切换

  3. $masterha_master_switch --master_state=dead --conf=/etc/mha/mysql3307.cnf --dead_master_host=192.124.64.213 --dead_master_port=3307 --new_master_host=192.124.64.212 --new_master_port=3307 --ignore_last_failover

输出信息是交互式,会询问你是否进行切换:建议阅读输出以理解切换手动切换过程。

  1. Sat Mar 21 23:55:43 2020 - [info] MHA::MasterFailover version 0.56.

  2. Sat Mar 21 23:55:43 2020 - [info] Starting master failover.

  3. Sat Mar 21 23:55:43 2020 - [info]

  4. Sat Mar 21 23:55:43 2020 - [info] * Phase 1: Configuration Check Phase..

  5. ...

  6. Sat Mar 21 23:55:46 2020 - [info] ** Phase 1: Configuration Check Phase completed.

  7. Sat Mar 21 23:55:46 2020 - [info]

  8. Sat Mar 21 23:55:46 2020 - [info] * Phase 2: Dead Master Shutdown Phase..

  9. ----- Failover Report -----


  10. mysql3307: MySQL Master failover 192.124.64.213(192.124.64.213:3307) to 192.124.64.212(192.124.64.212:3307) succeeded


  11. Master 192.124.64.213(192.124.64.213:3307) is down!


  12. Check MHA Manager logs at LeDB-VM-124064214 for details.


  13. Started manual(interactive) failover.

  14. Invalidated master IP address on 192.124.64.213(192.124.64.213:3307)

  15. Selected 192.124.64.212(192.124.64.212:3307) as a new master.

  16. 192.124.64.212(192.124.64.212:3307): OK: Applying all logs succeeded.

  17. 192.124.64.212(192.124.64.212:3307): OK: Activated master IP address.

  18. 192.124.64.214(192.124.64.214:3307): OK: Slave started, replicating from 192.124.64.212(192.124.64.212:3307)

  19. 192.124.64.212(192.124.64.212:3307): Resetting slave info succeeded.

  20. Master failover to 192.124.64.212(192.124.64.212:3307) completed successfully.


参考:

  • 官方文档:MHA 架构 https://github.com/yoshinorim/mha4mysql-manager/wiki/

  • MHA作者PPT https://www.slideshare.net/matsunobu/automated-master-failover

  • MySQL高可用架构MHA https://blog.csdn.net/qq_35209838/article/details/86497864

  • MySQL-MHA高可用技术

  • 陶老师运维笔记 MySQL GTID介绍及搭建GTID从库


文章转载自陶老师运维笔记,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论