

现象
org.springframework.dao.DeadlockLoserDataAccessException:### Error updating database. Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction### The error may exist in com/atome/ops/mapper/CampaignJobInstanceMapper.java (best guess)### The error may involve com.atome.ops.mapper.CampaignJobInstanceMapper.update-Inline### The error occurred while setting parameters### SQL: update `campaign_job_instance` SET `total_count`=?, `status`=?, `fetch_start_time`=?, `fetch_end_time`=?, `send_start_time`=?, `send_end_time`=?, `update_timestamp`=? where id=?### Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction; Deadlock found when trying to get lock; try restarting transaction; nested exception is com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction

Mysql锁争用分析
死锁是指两个或两个以上的进程在执行过程中,由于竞争资源或者由于彼此通信而造成的一种阻塞的现象。若无外力作用,它们都将无法推进下去。此时称系统处于死锁状态或系统产生了死锁,这些永远在互相等待的进程称为死锁进程。
在数据库MySQL的并发读写请求场景中。当两个及以上的事务,双方都在等待对方释放已经持有的锁或因为加锁顺序不一致造成循环等待锁资源,就会出现“死锁”。常见的报错信息为 Deadlock found when trying to get lock; try restarting transaction...。
从死锁的定义来看,MySQL 出现死锁的几个要素为:
两个或者两个以上事务
每个事务都已经持有锁并且申请新的锁
锁资源同时只能被同一个事务持有或者不兼容
事务之间因为持有锁和申请锁导致彼此循环等待
在Mysql中死锁主要体现在InnoDB存储引擎,锁争用主要体现在next-key lock、insert intention lock
CREATE TABLE `t` (`id` int(11) NOT NULL,`c` int(11) DEFAULT NULL,`d` int(11) DEFAULT NULL,PRIMARY KEY (`id`),KEY `c` (`c`)) ENGINE=InnoDB;insert into t values(0,0,0),(5,5,5),(10,10,10),(15,15,15),(20,20,20),(25,25,25);

原则 1:加锁的基本单位是 next-key lock,前开后闭区间
原则 2:查找过程中访问到的对象才会加锁
优化 1:索引上的等值查询,给唯一索引加锁的时候,匹配上数据,next-key lock 退化为行锁
优化 2:索引上的等值查询,向右遍历时且最后一个值不满足等值条件的时候,next-key lock 退化为间隙锁
一个 bug:唯一索引上的范围查询会访问到不满足条件的第一个值为止
以上规则,其实可以理解为数据查找过程中,扫描到的对象应该加锁,排除逻辑上明显不需要加锁的对象,即为加锁范围
重点:
加锁是分步进行的,例如:c>=10 and c<=11,分解为c=10、c>10 and c<11、c=11依次进行锁申请
间隙由右边的间隙记录,这也导致了不同方向上扫描加锁范围不一样
从扫描到的第一个记录上作为起点,例如:普通索引c取值为[0,5,10,15,20],c>10和c>=10其分别第一个扫描到的数为15、10,因此第一个间隙锁为(10,15]、(5,10]
3.insert intention lock
插入意向锁,仅用于insert语句,表明将在某间隙插入记录,与间隙锁互斥关系如下:

间隙锁之间不存在互斥关系(X、S表示是什么语句导致的间隙锁) 间隙锁可以堵塞区间内的插入意向锁,但插入意向锁不会堵塞后续的间隙锁 唯一键冲突,如果是主键加记录锁,如果是唯一索引加next-key lock
mysql> CREATE TABLE `t` (`id` int(11) NOT NULL,`c` int(11) DEFAULT NULL,`d` int(11) DEFAULT NULL,PRIMARY KEY (`id`),KEY `c` (`c`)) ENGINE=InnoDB;mysql> insert into t values(0,0,0),(5,5,5),(10,10,10),(15,15,15),(20,20,20),(25,25,25);-- 开启事务Amysql> begin;/**在事务A中执行修改语句,id=7不存在,添加(5,10)的间隙锁,LOCK_MODE=X,GAP LOCK_DATA=10可以验证两个观点:1、间隙锁是加在右边间隙上的2、此处X并不代表10上加行锁,仅代表什么语句造成的,若改为select * from t where id=7 lock in share mode,LOCK_MODE就变为S,GAP**/mysql> update t set d=d+1 where id=7;-- 在事务B中插入id=6的数据,需要申请插入意向锁,进入堵塞状态mysql> insert into t values(6,6,6);/**事务A中已经添加了间隙锁,相同间隙的插入意向锁堵塞,LOCK_MODE=X,GAP,INSERT_INTENTION,LOCK_STATUS=WAITINGv8.0.11时,LOCK_MODE=X,GAP,INSERT_INTENTION标识是高版本新加的(此处使用的是8.0.21),插入意向锁是一种特殊的间隙锁**/mysql> select THREAD_ID,OBJECT_SCHEMA,OBJECT_NAME,INDEX_NAME,LOCK_TYPE,LOCK_MODE,LOCK_STATUS,LOCK_DATAfrom performance_schema.data_locks;+-----------+---------------+-------------+------------+-----------+------------------------+-------------+-----------+| THREAD_ID | OBJECT_SCHEMA | OBJECT_NAME | INDEX_NAME | LOCK_TYPE | LOCK_MODE | LOCK_STATUS | LOCK_DATA |+-----------+---------------+-------------+------------+-----------+------------------------+-------------+-----------+| 54 | demo | t | NULL | TABLE | IX | GRANTED | NULL || 54 | demo | t | PRIMARY | RECORD | X,GAP | GRANTED | 10 || 53 | demo | t | NULL | TABLE | IX | GRANTED | NULL || 53 | demo | t | PRIMARY | RECORD | X,GAP,INSERT_INTENTION | WAITING | 10 |+-----------+---------------+-------------+------------+-----------+------------------------+-------------+-----------+-- 开启事务Cmysql> begin;-- 在事务C中插入id=16的数据,由于该间隙上没有间隙锁,申请插入意向锁成功mysql> insert into t values(16,16,16);/**查询当前加锁情况,并没有发现插入意向锁,为什么?插入意向锁是为了配合间隙锁解决幻读问题,在有间隙锁的情况下进行堵塞。此时没有间隙锁,不需要堵塞,所以就不用加插入意向锁吗?但其他事务中相同行插入会产生冲突,说明这里还是有其他约束的,只是不用堵塞的插入意向锁转换成另外一种约束了**/mysql> select THREAD_ID,OBJECT_SCHEMA,OBJECT_NAME,INDEX_NAME,LOCK_TYPE,LOCK_MODE,LOCK_STATUS,LOCK_DATAfrom performance_schema.data_locks;+-----------+---------------+-------------+------------+-----------+-----------+-------------+-----------+| THREAD_ID | OBJECT_SCHEMA | OBJECT_NAME | INDEX_NAME | LOCK_TYPE | LOCK_MODE | LOCK_STATUS | LOCK_DATA |+-----------+---------------+-------------+------------+-----------+-----------+-------------+-----------+| 53 | demo | t | NULL | TABLE | IX | GRANTED | NULL |+-----------+---------------+-------------+------------+-----------+-----------+-------------+-----------+/**证明其他约束的存在,新启一个事务,同样执行insert into t values(16,16,16),可以看到申请S锁堵塞,正常上一个事务中的插入有其他约束这里需要进行唯一约束验证,获取id=16的读锁**/mysql> select THREAD_ID,OBJECT_SCHEMA,OBJECT_NAME,INDEX_NAME,LOCK_TYPE,LOCK_MODE,LOCK_STATUS,LOCK_DATAfrom performance_schema.data_locks;+-----------+---------------+-------------+------------+-----------+---------------+-------------+-----------+| THREAD_ID | OBJECT_SCHEMA | OBJECT_NAME | INDEX_NAME | LOCK_TYPE | LOCK_MODE | LOCK_STATUS | LOCK_DATA |+-----------+---------------+-------------+------------+-----------+---------------+-------------+-----------+| 53 | demo | t | NULL | TABLE | IX | GRANTED | NULL || 56 | demo | t | NULL | TABLE | IX | GRANTED | NULL || 56 | demo | t | PRIMARY | RECORD | X,REC_NOT_GAP | GRANTED | 16 || 56 | demo | t | PRIMARY | RECORD | S,REC_NOT_GAP | WAITING | 16 |+-----------+---------------+-------------+------------+-----------+---------------+-------------+-----------+-- 开启事务Dmysql> begin;-- 在事务D中插入id=10,mysql> insert into t values(10,10,10);1062 - Duplicate entry '10' for key 'PRIMARY'-- 在事务E中插入id=9mysql> insert into t values(9,9,9);(blocked)/**V8.0.11查看当前加锁情况,事务D插入语句检测到唯一冲突后在id=10上加了一个S锁事务E中插入id=9,等待插入意向锁,没有间隙锁冲突,为什么会堵塞呢?唯一键冲突加的应该不是一个记录S锁,应该是一个next-key lock (5,10],因为已经存在间隙锁,所以插入意向锁才会堵塞这是MySQL的一个bug,在V8.0.16已经修复,事务E中插入不会堵塞(主键唯一冲突就是一个单纯的记录锁)https://bugs.mysql.com/bug.php?id=93806**/mysql> select THREAD_ID,OBJECT_SCHEMA,OBJECT_NAME,INDEX_NAME,LOCK_TYPE,LOCK_MODE,LOCK_STATUS,LOCK_DATAfrom performance_schema.data_locks;+--------+-----------+---------------+-------------+------------+-----------+-----------+-------------+-----------+| ENGINE | THREAD_ID | OBJECT_SCHEMA | OBJECT_NAME | INDEX_NAME | LOCK_TYPE | LOCK_MODE | LOCK_STATUS | LOCK_DATA |+--------+-----------+---------------+-------------+------------+-----------+-----------+-------------+-----------+| INNODB | 109 | demo | t | NULL | TABLE | IX | GRANTED | NULL || INNODB | 109 | demo | t | PRIMARY | RECORD | S | GRANTED | 10 || INNODB | 108 | demo | t | NULL | TABLE | IX | GRANTED | NULL || INNODB | 108 | demo | t | PRIMARY | RECORD | X,GAP | WAITING | 10 |+--------+-----------+---------------+-------------+------------+-----------+-----------+-------------+-----------+
原因及解决
show engine innodb status查看最近一次的死锁日志,如下所示:
## 执行以下语句查看最近一次死锁日志show engine innodb status;...------------------------LATEST DETECTED DEADLOCK------------------------2022-02-07 21:00:00 0x7f3972be1700*** (1) TRANSACTION:TRANSACTION 401368548, ACTIVE 0 sec starting index readmysql tables in use 1, locked 1LOCK WAIT 4 lock struct(s), heap size 1136, 4 row lock(s), undo log entries 3MySQL thread id 24621163, OS thread handle 139885409634048, query id 1074586194 172.21.32.245 root Updatingupdate `campaign_job_instance`SET `total_count`=6,`status`='SENDING',`fetch_start_time`=1644238560063,`fetch_end_time`=1644238560115,`send_start_time`=1644238800550,`send_end_time`=null,`update_timestamp`=1644238800550where id=5193*** (1) WAITING FOR THIS LOCK TO BE GRANTED:RECORD LOCKS space id 10003 page no 263 n bits 104 index PRIMARY of table `user_label`.`campaign_job_instance` trx id 401368548 lock_mode X locks rec but not gap waitingRecord lock, heap no 35 PHYSICAL RECORD: n_fields 17; compact format; info bits 00: len 4; hex 00001449; asc I;;1: len 6; hex 000017ec6739; asc g9;;2: len 7; hex 390002800f1934; asc 9 4;;3: len 28; hex 6a6f622d363166653865333464636531613630303031346636333166; asc job-61fe8e34dce1a600014f631f;;4: len 24; hex 363230313136653033333863656530303031326637346337; asc 620116e0338cee00012f74c7;;5: len 30; hex 3631666538653334646365316136303030313466363331642d3230323230; asc 61fe8e34dce1a600014f631d-20220; (total 37 bytes);6: len 3; hex 534d53; asc SMS;;7: len 24; hex 363166653865333464636531613630303031346636333164; asc 61fe8e34dce1a600014f631d;;8: len 30; hex 66353464623661372d336263662d346361312d613630302d643337616234; asc f54db6a7-3bcf-4ca1-a600-d37ab4; (total 36 bytes);9: len 4; hex 53454e54; asc SENT;;...*** (2) TRANSACTION:TRANSACTION 401368889, ACTIVE 0 sec starting index readmysql tables in use 1, locked 13 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1MySQL thread id 24621104, OS thread handle 139884714923776, query id 1074586215 172.21.32.245 root Updatingupdate `campaign_job_instance`SET `total_count`=6,`status`='SENT',`fetch_start_time`=1644238560026,`fetch_end_time`=1644238560120,`send_start_time`=1644238800280,`send_end_time`=1644238800554,`update_timestamp`=1644238800554where id=5192*** (2) HOLDS THE LOCK(S):RECORD LOCKS space id 10003 page no 263 n bits 104 index PRIMARY of table `user_label`.`campaign_job_instance` trx id 401368889 lock_mode X locks rec but not gapRecord lock, heap no 35 PHYSICAL RECORD: n_fields 17; compact format; info bits 00: len 4; hex 00001449; asc I;;1: len 6; hex 000017ec6739; asc g9;;2: len 7; hex 390002800f1934; asc 9 4;;3: len 28; hex 6a6f622d363166653865333464636531613630303031346636333166; asc job-61fe8e34dce1a600014f631f;;4: len 24; hex 363230313136653033333863656530303031326637346337; asc 620116e0338cee00012f74c7;;5: len 30; hex 3631666538653334646365316136303030313466363331642d3230323230; asc 61fe8e34dce1a600014f631d-20220; (total 37 bytes);6: len 3; hex 534d53; asc SMS;;7: len 24; hex 363166653865333464636531613630303031346636333164; asc 61fe8e34dce1a600014f631d;;8: len 30; hex 66353464623661372d336263662d346361312d613630302d643337616234; asc f54db6a7-3bcf-4ca1-a600-d37ab4; (total 36 bytes);9: len 4; hex 53454e54; asc SENT;;...*** (2) WAITING FOR THIS LOCK TO BE GRANTED:RECORD LOCKS space id 10003 page no 263 n bits 104 index PRIMARY of table `user_label`.`campaign_job_instance` trx id 401368889 lock_mode X locks rec but not gap waitingRecord lock, heap no 23 PHYSICAL RECORD: n_fields 17; compact format; info bits 00: len 4; hex 00001448; asc H;;1: len 6; hex 000017ec65e4; asc e ;;2: len 7; hex 0e000002a31da0; asc ;;3: len 28; hex 6a6f622d363230303832346464636531613630303031346636333261; asc job-6200824ddce1a600014f632a;;4: len 24; hex 363230313136653033333863656530303031326637346334; asc 620116e0338cee00012f74c4;;5: len 30; hex 3632303038323464646365316136303030313466363332382d3230323230; asc 6200824ddce1a600014f6328-20220; (total 37 bytes);6: len 3; hex 534d53; asc SMS;;7: len 24; hex 363230303832346464636531613630303031346636333238; asc 6200824ddce1a600014f6328;;8: len 30; hex 33333839633535662d663262612d346630392d393432632d353839633865; asc 3389c55f-f2ba-4f09-942c-589c8e; (total 36 bytes);9: len 7; hex 53454e44494e47; asc SENDING;;...*** WE ROLL BACK TRANSACTION (2)------------TRANSACTIONS------------...
锁争用体现在主键行锁(RECORD LOCKS... index PRIMARY...lock_mode X locks rec but not gap waiting) 事务1更新id=5193的数据,且等待其行锁 事务2更新id=5192的数据,且等待其行锁 事务2已持有id=5193的行锁
LOCK WAIT 4 lock struct(s), heap size 1136, 4 row lock(s), undo log entries 3还是可以看出事务1中持有其他行锁。
通过上述日志分析,基本上可以直接定位到业务代码中锁冲突发生的位置
/*** 读取文件写入数据明细表,最后通过主键更新campaign_job_instance*/class SaveUserJob implements SimpleJob {@Autowired@TransactionalCampaignJobInstanceService campaignJobInstanceService@Overridevoid execute(ShardingContext shardingContext) {log.info("save user:${JsonUtil.toString(shardingContext)}")List<CampaignJobInstance> jobInstances = campaignJobInstanceService.findByStatus(CampaignJobInstanceStatusEnum.SAVING)jobInstances.each { jobInstance ->...updateById(jobInstance)...}}}/*** 发送SaveUserJob写入的数据明细,最后主键更新campaign_job_instance*/class SendJob implements DataflowJob<CampaignUserLog> {...@Override@Transactionalvoid processData(ShardingContext shardingContext, List<CampaignUserLog> logs) {connectorExecuteService.execute(logs.findAll { it.status == CampaignUserJobStatusEnum.SENDING })logs.groupBy { it.jobInstanceId }.each {CampaignJobInstance jobInstance = findByInstanceId(it.key)...updateById(jobInstance)}}...}
SaveUserJob、SendJob中均涉及到遍历主键更新campaign_job_instance,且均开启事务。造成死锁的原因就很清楚了,两个事务中更新顺序不同带来的加锁顺序不同,造成相互持有对方需要的锁,从而死锁。
但是为什么之前测试时没有出发?
圈选的用户量过少,save user的瞬间还没来得及更新状态就发送完成了
SendJob的频次由2分钟改为了5分钟,会累积多个jobInstance到一个事务中
知道了死锁发生的原因,解决方案就很简单了,减小锁粒度即可。
class SaveUserJob implements SimpleJob {@AutowiredCampaignJobInstanceService campaignJobInstanceService@Override// 外层仅查询,去除事务void execute(ShardingContext shardingContext) {log.info("save user:${JsonUtil.toString(shardingContext)}")List<CampaignJobInstance> jobInstances = campaignJobInstanceService.findByStatus(CampaignJobInstanceStatusEnum.SAVING)jobInstances.each { jobInstance ->// 抽取到service层,加上事务campaignJobInstanceService.saveUser(jobInstance)}}}class SendJob implements DataflowJob<CampaignUserLog> {...@Override// 外层仅查询,去除事务void processData(ShardingContext shardingContext, List<CampaignUserLog> logs) {...logs.groupBy { it.jobInstanceId }.each {...// 抽取到service层,加上事务CampaignJobInstance jobInstance = findByInstanceId(it.key)campaignJobInstanceService.checkJobInstance(jobInstance)}}...}
TODO










