之前的一篇文章中介绍了如何去看信号量等待日志,这篇结合一个实际例子去看下
背景介绍
一台线上MySQL实例运行中Hang死情况,导致数据库不可用, 当时对应的 show engine innodb status\G日志如下:
020-11-25T07:45:35.437594+08:00 2887925 [Note] Aborted connection 2887925 to db: 'xxx' user: 'xxx' host: 'xxxx' (Got an error reading communication packets)2020-11-25T09:33:15.302725+08:00 0 [Warning] InnoDB: A long semaphore wait:--Thread 140477271320320 has waited at ibuf0ibuf.cc line 3439 for 241.00 seconds the semaphore:S-lock on RW-latch at 0x7fc5a5faa520 created in file buf0buf.cc line 1460a writer (thread id 140477635114752) has reserved it in mode exclusivenumber of readers 0, waiters flag 1, lock_word: fffffffff0000000Last time read locked in file ibuf0ibuf.cc line 4552Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/ibuf/ibuf0ibuf.cc line 4062020-11-25T09:33:15.302783+08:00 0 [Warning] InnoDB: A long semaphore wait:--Thread 140477265372928 has waited at btr0cur.cc line 976 for 241.00 seconds the semaphore:SX-lock on RW-latch at 0x7fc3189b57b0 created in file dict0dict.cc line 2687a writer (thread id 140477291595520) has reserved it in mode SXnumber of readers 0, waiters flag 1, lock_word: 10000000Last time read locked in file btr0cur.cc line 1008Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/btr/btr0cur.cc line 9762020-11-25T09:33:15.302798+08:00 0 [Warning] InnoDB: A long semaphore wait:--Thread 140477883373312 has waited at ibuf0ibuf.cc line 2518 for 241.00 seconds the semaphore:S-lock on RW-latch at 0x7fc5a5faa520 created in file buf0buf.cc line 1460a writer (thread id 140477635114752) has reserved it in mode exclusivenumber of readers 0, waiters flag 1, lock_word: fffffffff0000000Last time read locked in file ibuf0ibuf.cc line 4552Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/ibuf/ibuf0ibuf.cc line 406
数据库中报大量的信号量等待,时间均在240s以上
故障分析:
从日志最开始报信号量等待看起:
--Thread 140477271320320 has waited at ibuf0ibuf.cc line 3439 for 241.00 seconds the semaphore:S-lock on RW-latch at 0x7fc5a5faa520 created in file buf0buf.cc line 1460a writer (thread id 140477635114752) has reserved it in mode exclusivenumber of readers 0, waiters flag 1, lock_word: fffffffff0000000Last time read locked in file ibuf0ibuf.cc line 4552Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/ibuf/ibuf0ibuf.cc line 406
从上述信息可以得出以下内容:
线程140477271320320 在:ibuf0ibuf.cc line 3439 添加S锁时 发生了等待
锁创建在:buf0buf.cc line 1460
这里可以再看下之前《MySQL信号量等待日志阅读》这篇
线程140477635114752以独占方式,占有了这个锁
lock_word: fffffffff0000000 = - 268435456
此问题中 lock_word: fffffffff0000000 = - 268435456 即- X_LOCK_HALF_DECR,代表添加了X和SX锁,线程
140477271320320再次申请S锁时就发生了等待。
已知线程140477271320320等待线程140477635114752,查询线程140477635114752持有锁和锁等待情况:
--Thread 140477635114752 has waited at btr0sea.ic line 128 for 241.00 seconds the semaphore:S-lock on RW-latch at 0xffbbc08 created in file btr0sea.cc line 195a writer (thread id 140458712454912) has reserved it in mode wait exclusivenumber of readers 1, waiters flag 1, lock_word: ffffffffffffffffLast time read locked in file btr0sea.ic line 146Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/btr/btr0sea.cc line 1300
lock_word: ffffffffffffffff = -1,说明存在S类型的锁且有写者在等待,等待的线程是140458712454912
继续查看140458712454912持有锁情况
--Thread 140458712454912 has waited at btr0cur.cc line 3874 for 241.00 seconds the semaphore:X-lock (wait_ex) on RW-latch at 0xffbbc08 created in file btr0sea.cc line 195a writer (thread id 140458712454912) has reserved it in mode wait exclusivenumber of readers 1, waiters flag 1, lock_word: ffffffffffffffffLast time read locked in file btr0sea.ic line 146Last time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/btr/btr0sea.cc line 1300
140458712454912线程要申请latch的x-lock发生了等待
--Thread 140458712454912 has waited at btr0cur.cc line 3874 for 272.00 seconds the semaphore:rw_lock_x_lock(btr_get_search_latch(index));锁创建在:X-lock (wait_ex) on RW-latch at 0xffbbc08 created in file btr0sea.cc line 195a writer (thread id 140458712454912) has reserved it in mode wait exclusive对应代码如下:rw_lock_create(btr_search_latch_key, btr_search_latches[i], SYNC_SEARCH_SYS); /** Latches protecting access to adaptive hash index. */extern rw_lock_t** btr_search_latches;根据信息知道是一个AHI的latch,是在数据库启动时候创建的
这个锁最后一次读取和写入分别是在:
Last time read locked in file btr0sea.ic line 146对应的函数是btr_search_s_lock_allLast time write locked in file /dba/app/mysql_home/5.7.19/storage/innobase/btr/btr0sea.cc line 1300对应的函数是btr_search_drop_page_hash_index
但不是140458712454912读取和写入这个锁,因为 lock_word: ffffffffffffffff = -1 ,说明存在S类型的锁且写者在等待,等待的线程就是140458712454912。
查看错误日志发现,有一个truncate操作,与客户工程师确认,truncate的表100G,而且是在truncate操作执行之后,数据库就开始处于hang住状态
原因分析
通过gdb跟踪truncate 堆栈信息,发现truncate操作时会扫描page并删除对应的自适应hash索引:
#0 btr_search_drop_page_hash_index (block=0x7f186bb88760) at /mydata/Project/mysql-server/storage/innobase/btr/btr0sea.cc:1175#1 0x0000000001b30adc in btr_search_drop_page_hash_when_freed (page_id=..., page_size=...) at /mydata/Project/mysql-server/storage/innobase/btr/btr0sea.cc:1387#2 0x0000000001be8b2c in fseg_free_extent (seg_inode=0x7f1863d400f2 "", space=37, page_size=..., page=3072, ahi=true, mtr=0x7f192027c3b0) at /mydata/Project/mysql-server/storage/innobase/fsp/fsp0fsp.cc:3795#3 0x0000000001be90e2 in fseg_free_step (header=0x7f18713d004a "", ahi=true, mtr=0x7f192027c3b0) at /mydata/Project/mysql-server/storage/innobase/fsp/fsp0fsp.cc:3887#4 0x0000000001afdfe8 in btr_free_but_not_root (block=0x7f186bf7c0a0, log_mode=MTR_LOG_NO_REDO) at /mydata/Project/mysql-server/storage/innobase/btr/btr0btr.cc:1152#5 0x0000000001afe18a in btr_free_if_exists (page_id=..., page_size=..., index_id=55, mtr=0x7f192027cae0) at /mydata/Project/mysql-server/storage/innobase/btr/btr0btr.cc:1200#6 0x0000000001b82112 in dict_drop_index_tree (rec=0x7f190ba68cfd "", pcur=0x7f192027cff0, mtr=0x7f192027cae0) at /mydata/Project/mysql-server/storage/innobase/dict/dict0crea.cc:1160#7 0x0000000001a4f926 in DropIndex::operator() (this=0x7f192027d150, mtr=0x7f192027cae0, pcur=0x7f192027cff0) at /mydata/Project/mysql-server/storage/innobase/row/row0trunc.cc:947#8 0x0000000001a57bd5 in IndexIterator::for_each<DropIndex> (this=0x7f192027cae0, callback=...) at /mydata/Project/mysql-server/storage/innobase/row/row0trunc.cc:102#9 0x0000000001a5604d in SysIndexIterator::for_each<DropIndex> (this=0x7f192027d304, callback=...) at /mydata/Project/mysql-server/storage/innobase/row/row0trunc.cc:160#10 0x0000000001a51b3c in row_truncate_table_for_mysql (table=0x7f17b0947c00, trx=0x7f1921d6e8c0) at /mydata/Project/mysql-server/storage/innobase/row/row0trunc.cc:2007#11 0x00000000018be415 in ha_innobase::truncate (this=0x7f17b0de3a30) at /mydata/Project/mysql-server/storage/innobase/handler/ha_innodb.cc:12382#12 0x0000000000f2d0e3 in handler::ha_truncate (this=0x7f17b0de3a30) at /mydata/Project/mysql-server/sql/handler.cc:4675#13 0x000000000176c94d in Sql_cmd_truncate_table::handler_truncate (this=0x7f17b0006680, thd=0x7f17b0000b70, table_ref=0x7f17b00060f0, is_tmp_table=false) at /mydata/Project/mysql-server/sql/sql_truncate.cc:244#14 0x000000000176d1ae in Sql_cmd_truncate_table::truncate_table (this=0x7f17b0006680, thd=0x7f17b0000b70, table_ref=0x7f17b00060f0) at /mydata/Project/mysql-server/sql/sql_truncate.cc:504#15 0x000000000176d315 in Sql_cmd_truncate_table::execute (this=0x7f17b0006680, thd=0x7f17b0000b70) at /mydata/Project/mysql-server/sql/sql_truncate.cc:560#16 0x0000000001533b80 in mysql_execute_command (thd=0x7f17b0000b70, first_level=true) at /mydata/Project/mysql-server/sql/sql_parse.cc:4856#17 0x00000000015358ce in mysql_parse (thd=0x7f17b0000b70, parser_state=0x7f192027e690)
同时在删除表上对应的buffer_pool中的page时,需要对所有的AHI latch添加S类型锁。
/* Lock the search latch in shared mode to prevent userfrom disabling AHI during the scan */btr_search_s_lock_all();DEBUG_SYNC_C("simulate_buffer_pool_scan");buf_LRU_flush_or_remove_pages(id, BUF_REMOVE_ALL_NO_WRITE, 0);btr_search_s_unlock_all();
删除的表比较大,所以会频繁的在AHI latch上添加x-lock和s-lock,140458712454912线程同样在latch上添加x-lock时,就
会进入自旋状态,通过错误日志中内容,可以看到数据库自旋状态非常严重:
OS WAIT ARRAY INFO: signal count 81353664 4191: RW-shared spins 0, rounds 192090016, OS waits 65971716 4192: RW-excl spins 0, rounds 294354293, OS waits 4971351 4193: RW-sx spins 2999653, rounds 34961484, OS waits 846918
140458712454912线程通过日志查看,是update操作,gdb调试看到,当buffpool不够用时,会LRU链表中释放page,这时
就需要删除AHI自适应hash索引
0 btr_search_drop_page_hash_index (block=0x7f6303580678) at /mydata/Project/mysql-server/storage/innobase/btr/btr0sea.cc:1300#1 0x0000000001b6fa18 in buf_LRU_free_page (bpage=0x7f6303580678, zip=true) at /mydata/Project/mysql-server/storage/innobase/buf/buf0lru.cc:2062#2 0x0000000001b6d038 in buf_LRU_free_from_common_LRU_list (buf_pool=0x44af3d8, scan_all=false) at /mydata/Project/mysql-server/storage/innobase/buf/buf0lru.cc:1078#3 0x0000000001b6d2a7 in buf_LRU_scan_and_free_block (buf_pool=0x44af3d8, scan_all=false) at /mydata/Project/mysql-server/storage/innobase/buf/buf0lru.cc:1120#4 0x0000000001b6dd59 in buf_LRU_get_free_block (buf_pool=0x44af3d8) at /mydata/Project/mysql-server/storage/innobase/buf/buf0lru.cc:1353#5 0x0000000001b4d843 in buf_page_init_for_read (err=0x7f62f838c1b4, mode=132, page_id=..., page_size=..., unzip=0) at /mydata/Project/mysql-server/storage/innobase/buf/buf0buf.cc:5145#6 0x0000000001b74299 in buf_read_page_low (err=0x7f62f838c1b4, sync=false, type=64, mode=132, page_id=..., page_size=..., unzip=false) at /mydata/Project/mysql-server/storage/innobase/buf/buf0rea.cc:149#7 0x0000000001b755bc in buf_read_ahead_linear (page_id=..., page_size=..., inside_ibuf=0) at /mydata/Project/mysql-server/storage/innobase/buf/buf0rea.cc:723#8 0x0000000001b4c077 in buf_page_get_gen (page_id=..., page_size=..., rw_latch=1, guess=0x0, mode=10, file=0x2185b88 "/mydata/Project/mysql-server/storage/innobase/btr/btr0pcur.cc", line=448, mtr=0x7f62f838cdb0, dirty_with_no_latch=false) at /mydata/Project/mysql-server/storage/innobase/buf/buf0buf.cc:4651#9 0x0000000001b261c7 in btr_block_get_func (page_id=..., page_size=..., mode=1, file=0x2185b88 "/mydata/Project/mysql-server/storage/innobase/btr/btr0pcur.cc", line=448, index=0x7f629c019d10, mtr=0x7f62f838cdb0) at /mydata/Project/mysql-server/storage/innobase/include/btr0btr.ic:63#10 0x0000000001b27f55 in btr_pcur_move_to_next_page (cursor=0x7f627003c628, mtr=0x7f62f838cdb0) at /mydata/Project/mysql-server/storage/innobase/btr/btr0pcur.cc:448#11 0x0000000001a3be3a in btr_pcur_move_to_next (cursor=0x7f627003c628, mtr=0x7f62f838cdb0) at /mydata/Project/mysql-server/storage/innobase/include/btr0pcur.ic:360#12 0x0000000001a48b1c in row_search_mvcc (buf=0x7f627003bb00 "\377\362\251\006", mode=PAGE_CUR_G, prebuilt=0x7f627003c410, match_mode=0, direction=1) at /mydata/Project/mysql-server/storage/innobase/row/row0sel.cc:6079#13 0x00000000018b8d5a in ha_innobase::general_fetch (this=0x7f627003b6f0, buf=0x7f627003bb00 "\377\362\251\006", direction=1, match_mode=0) at /mydata/Project/mysql-server/storage/innobase/handler/ha_innodb.cc:8996#14 0x00000000018b8f83 in ha_innobase::index_next (this=0x7f627003b6f0, buf=0x7f627003bb00 "\377\362\251\006") at /mydata/Project/mysql-server/storage/innobase/handler/ha_innodb.cc:9065#15 0x0000000000f29627 in handler::ha_index_next (this=0x7f627003b6f0, buf=0x7f627003bb00 "\377\362\251\006")#16 0x0000000001456e37 in rr_index (info=0x7f62f838de00) at /mydata/Project/mysql-server/sql/records.cc:475#17 0x00000000015e2a20 in mysql_update (thd=0x7f6270000b70, fields=..., values=..., limit=18446744073709114893, handle_duplicates=DUP_ERROR, found_return=0x7f62f838e458, updated_return=0x7f62f838e450) at /mydata/Project/mysql-server/sql/sql_update.cc:689#18 0x00000000015e9428 in Sql_cmd_update::try_single_table_update (this=0x7f6270006410, thd=0x7f6270000b70, switch_to_multitable=0x7f62f838e4ff) at /mydata/Project/mysql-server/sql/sql_update.cc:2883#19 0x00000000015e9959 in Sql_cmd_update::execute (this=0x7f6270006410, thd=0x7f6270000b70) at /mydata/Project/mysql-server/sql/sql_update.cc:3010#20 0x000000000153018e in mysql_execute_command (thd=0x7f6270000b70, first_level=true) at /mydata/Project/mysql-server/sql/sql_parse.cc:3579#21 0x00000000015358ce in mysql_parse (thd=0x7f6270000b70, parser_state=0x7f62f838f690) at /mydata/Project/mysql-server/sql/sql_parse.cc:5574#22 0x000000000152b1a0 in dispatch_command (thd=0x7f6270000b70, com_data=0x7f62f838fdf0, command=COM_QUERY) at /mydata/Project/mysql-server/sql/sql_parse.cc:1461#23 0x000000000152a092 in do_command (thd=0x7f6270000b70) at /mydata/Project/mysql-server/sql/sql_parse.cc:999#24 0x0000000001659c80 in handle_connection (arg=0x5180df0) at /mydata/Project/mysql-server/sql/conn_handler/connection_handler_per_thread.cc:300#25 0x0000000001cddf8b in pfs_spawn_thread (arg=0x5ae6b80) at /mydata/Project/mysql-server/storage/perfschema/pfs.cc:2188#26 0x00007f6317173dd5 in start_thread (arg=0x7f62f8390700) at pthread_create.c:307#27 0x00007f6315e37ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:11
从现象来看,是应为truncate 时候,释放AHI需要频繁添加x-lock,同时其它线程也需要添加x-lock,这时发生了阻塞现象。




