

点击上方
蓝色
文字 关注我吧!

在MySQL官方版本中,为了保证其的高可用性,一般情况我们会采用主从复制的方式来解决。当然,方法很多。而我们今天所要处理的是采用GTID方式并且开了多线程复制后,仍然延迟的情况,糟糕的是,延迟还在不断扩大!
环境概要
环境概要
| 序号 | 清单 | 说明 |
| 1 | 系统 | Redhat 6.x(4c,32g) |
| 2 | 数据库 | MySQL-5.7.25 |
| 3 | 同步方式 | 基于GTID主从同步 |
环境排查
环境排查
1)已经配置的重要参数:
#relay for slave slave_parallel_type = LOGICAL_CLOCK slave_parallel_workers = 6 master_info_repository = TABLE relay_log_info_repository = TABLE relay_log_recovery = on sync_relay_log = 10000
注:此时没有设置```slave_preserve_commit_order```参数。 2)从库延迟状态查询
mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: xxx.xxx.xxx.xxx Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.008978 Read_Master_Log_Pos: 696914605 Relay_Log_File: DB41-relay-bin.001259 Relay_Log_Pos: 207377582 Relay_Master_Log_File: mysql-bin.008970 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: neteagle3 Replicate_Ignore_DB: mysql Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1068770059 Relay_Log_Space: 8425484286 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 187358Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 42 Master_UUID: eab7fcac-3cda-11e6-ada8-fa163e648db2 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Waiting for dependent transaction to commit Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: eab7fcac-3cda-11e6-ada8-fa163e648db2:58031191-59927276 Executed_Gtid_Set: eab7fcac-3cda-11e6-ada8-fa163e648db2:1-58080239:58080241 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec)
简单介绍一下几个指标信息:
1. Master_Log_File 1. Read_Master_Log_Pos 1. Seconds_Behind_Master 1. Relay_Log_File 1. Relay_Log_Pos 1. Relay_Master_Log_File 1. Exec_Master_Log_Pos Master_Log_File , Read_Master_Log_Pos :这两个参数是成对的,表示的是从库IO thread传输主库的binlog日志号及具体位置。 Relay_Log_File , Relay_Log_Pos :这两个参数也是成对的,表示的是从库sql thread应用中继日志(relay log)号及具体位置。 Relay_Master_Log_File , Relay_Log_Pos :这两个参数也是成对的,表示的是上一项中的中继日志对应的主库binlog日志及具体位置(有点绕)。 Seconds_Behind_Master :此参数可简单理解为主从延迟时间,单位为秒。
从上面这段MySQL从状态信息中,我们可以看到,```Seconds_Behind_Master: 187358```这是从库sql应用延迟主库的时间为187358秒,转换成天,大概两天多。这说明,我们从库复制的数据是两天前的。3)验证并行复制
mysql> show full processlist; +----+-------------+--------------+-----------+------------------+--------+---------------------------------------------------------------+-----------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+-------------+--------------+-----------+------------------+--------+---------------------------------------------------------------+-----------------------+ | 1 | system user | | NULL | Connect | 18204 | Waiting for master to send event | NULL | | 2 | system user | | NULL | Connect | 0 | Waiting for dependent transaction to commit | NULL | | 3 | system user | | NULL | Connect | 154914 | System lock | NULL | | 4 | system user | | NULL | Connect | 154914 | Waiting for an event from Coordinator | NULL | | 5 | system user | | NULL | Connect | 154918 | Waiting for an event from Coordinator | NULL | | 6 | system user | | NULL | Connect | 155525 | Waiting for an event from Coordinator | NULL | | 7 | system user | | NULL | Connect | 180427 | Waiting for an event from Coordinator | NULL | | 8 | system user | | NULL | Connect | 18204 | Waiting for an event from Coordinator | NULL | | 10 | root | localhost | neteagle3 | Query | 0 | starting | show full processlist | | 11 | repl | DBSlave:9683 | NULL | Binlog Dump GTID | 18156 | Master has sent all binlog to slave; waiting for more updates | NULL | | 13 | root | localhost | neteagle3 | Sleep | 4962 | | NULL | +----+-------------+--------------+-----------+------------------+--------+---------------------------------------------------------------+-----------------------+
mysql> select * from performance_schema.replication_applier_status_by_worker ; +--------------+-----------+-----------+---------------+-----------------------------------------------+-------------------+--------------------+----------------------+ | CHANNEL_NAME | WORKER_ID | THREAD_ID | SERVICE_STATE | LAST_SEEN_TRANSACTION | LAST_ERROR_NUMBER | LAST_ERROR_MESSAGE | LAST_ERROR_TIMESTAMP | +--------------+-----------+-----------+---------------+-----------------------------------------------+-------------------+--------------------+----------------------+ | | 1 | 51 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240805 | 0 | | 0000-00-00 00:00:00 | | | 2 | 52 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:80240210 | 0 | | 0000-00-00 00:00:00 | | | 3 | 53 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:80235089 | 0 | | 0000-00-00 00:00:00 | | | 4 | 54 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:80191268 | 0 | | 0000-00-00 00:00:00 | | | 5 | 55 | ON | eab7fcac-3cda-11e6-ada8-fa163e648db2:75296683 | 0 | | 0000-00-00 00:00:00 | | | 6 | 56 | ON | | 0 | | 0000-00-00 00:00:00 | +--------------+-----------+-----------+---------------+-----------------------------------------------+-------------------+--------------------+----------------------+ 6 rows in set (0.00 sec)通过本条查询,可以看到开启了6个并行进行复制。
细找瓶颈?
细找瓶颈?
通过以上来看,所有一切似乎都正常,并行复制开了,并且查看到CPU,IO,内存均没有达到瓶颈地步。主库写binlog日志大概为2MB/s,这样的日质量并不是非常高。
从库也同样查看了是否存在锁的情况,也没有发现。
因此反复的在查看slave状态,看能否发现一些细节,结果还真看到了一些异常现象。Relay_Log_Pos
这个参数在频繁的刷slave状态时,发现时常会卡着不动(此时已确认没有看到锁)。或许问题真正的原因正在这里!
顺藤摸瓜!
顺藤摸瓜!
分析binlog或者relay log日志,看有啥线索:
[mysql@xxx data]$ mysqlbinlog --no-defaults -v -v --base64-output=DECODE-ROWS relay-bin.001384 --start-position=420090430|more //; //; DELIMITER //;
我们查看了中继日志
relay-bin.001384卡住的位置号420090430,为设置GTID_NEXT,信息没什么用。
继续看在卡住时刻,数据库open的是什么表?
mysql> show open tables where In_use=1; +-----------+---------------------+--------+-------------+ | Database | Table | In_use | Name_locked | +-----------+---------------------+--------+-------------+ | net | f_currentxxx | 1 | 0 | +-----------+---------------------+--------+-------------+
这张表有什么特殊的么?查看其表结构
mysql> show create table net.f_currentxxx\G 1. row Table: f_currentxxx Create Table: CREATE TABLE ( int(20) NOT NULL COMMENT 'xxx', bigint(20) NOT NULL COMMENT 'xxx', int(11) DEFAULT NULL COMMENT 'xxx', int(11) DEFAULT NULL COMMENT 'xxx, KEY () ) ENGINE=MEMORY DEFAULT CHARSET=gbk COMMENT='xxx' 1 row in set (0.00 sec)
有没有看到什么与众不同?
没错,就是表的存储引擎
ENGINE=MEMORY。MEMORY的表进行主从复制,首先来说如果从库不做查询,一点意义没有,另外对Memory表做复制,性能是非常堪忧的。如果必须进行同步,考虑将表的存储引擎改为InnoDB
mysql> select table_name from information_schema.tables where TABLE_SCHEMA='net' and ENGINE='memory'; +----------------------+ | table_name | +----------------------+ | f_currentxxx | +----------------------+ 1 row in set, 6 warnings (0.01 sec)
措施
措施
停止复制进程,将选项中添加replicate-ignore-table=net.f_currentxxx
,重新启动复制进程,观察slave状态。
mysql> show slave status\G 1. row Slave_IO_State: Waiting for master to send event Master_Host: xxx.xxx.xxx.xxx Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.009194 Read_Master_Log_Pos: 939698255 Relay_Log_File: relay-bin.001964 Relay_Log_Pos: 444060572 Relay_Master_Log_File: mysql-bin.009027 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: net Replicate_Ignore_DB: mysql Replicate_Do_Table:
我们可以看到net.f_currentxxx表已经被忽略复制。持续观察一段时间后,Seconds_Behind_Master在逐渐缩小。
中继日志应用速度大约5分钟一个(每个中继日志为1GB大小),而主库binlog日志大约为10分钟一个(每个binlog日志为1GB大小)。
总结
总结
在梳理了整个处理过程后,其实难度不高,主要是要细心,细心去排查每一个想到的点。 在非轻量级的数据库中问题发生的概率也会随着量级的增加而增多。而这恰恰是能够磨练个人的成长。
同时,知识的储备也要充足,这是进阶高手的必要前提!




