下面的案例是几年前老白在一个金融客户那边遇到的,死锁问题是我们十分常见的一种问题,在很多用户那边,几乎每天都会出现死锁报错。绝大多数情况下,死锁是因为应用软件本身的逻辑导致的,很多情况下是无法彻底解决的,应用程序可以通过设置行锁超时,使用nowait锁等方式主动解开死锁,但是这种处置方式十分低效,而且业务逻辑较难控制,所以数据库对死锁情况采取主动检测的方式,一旦发现死锁,自动解锁,从而避免真正的锁死情况。从死锁的产生与解决上来看,绝大多数的死锁问题并不会引起应用系统的故障,因此绝大多数用户对死锁也就采取忽视的态度了。
---------Blocker(s)-------- ---------Waiter(s)---------
Resource Name process session holds waits process session holds waits
TX-0018001c-0009869a 167 921 X 162 915 S
TX-002c002d-0009972b 162 915 X 167 921 S
session 921: DID 0001-00A7-000001D2session 915: DID 0001-00A2-000004F0
session 915: DID 0001-00A2-000004F0session 921: DID 0001-00A7-000001D2
Rows waited on:
Session 915: obj - rowid = 000029A0 - AAAkiyAAiAABzOfAAA
(dictionary objn - 10656, file - 34, block - 471967, slot - 0)
Session 921: obj - rowid = 000029A0 -AAAkiyAAMAAC4DfAAA
(dictionary objn - 10656, file - 12, block - 753887, slot - 0)
Information on the OTHER waiting sessions:
Session 915:
pid=162serial=17986 audsid=37348943 user: 27/OPS$CBS
O/S info:user: cbs, term: pts/0, ospid: 1008342, machine: SOP-P
program: ddddP (TNS V1-V3)dddd
applicationname:d'd'd'd (TNS V1-V3), hash value=0
Current SQ LStatement:
select * from XXXX where ((((XXXX='4' and XXXXin('0','2')) and XXXX='0') and XXXX=:b0) and XX=:b1) order XXXX,XXXX,XXXX,XXXX for update
End of information on OTHER waiting sessions.
很明显可以看到一个SELECT ... FOR UPDATE语句,这种语句也是最容易出现死锁的语句。比较幸运的是这是一个单表查询。只要去检查一下相关的表的情况,就比较容易发现问题了。
从TRACE文件中看到,有两个会话出现了死锁:
Session 915: obj - rowid = 000029A0 -AAAkiyAAiAABzOfAAA
(dictionary objn- 10656, file - 34, block - 471967, slot - 0)
Session 921: obj - rowid = 000029A0 -AAAkiyAAMAAC4DfAAA
(dictionaryobjn - 10656, file - 12, block - 753887, slot - 0)
SO: 70000060e50e660, type: 4, owner:70000060fcf59d8, flag: INIT/-/-/0x00
(session) sid: 921 trans: 700000605fbdc38,creator: 70000060fcf59d8, flag: (41) USR/- BSY/-/-/-/-/-
DID: 0001-00A7-000001D2,short-term DID: 0000-0000-00000000
txn branch: 0 oct: 3, prv: 0, sql:7000005e371d398, psql: sss7000005fc981008, user: 27/ssss
O/S info: user: s's's, term: pts/0, ospid:311304, machine: ddd
program: ddd (TNS V1-V3)
application name: ddd (TNS V1-V3),hash value=0
last wait for
'enq: TX - allocate ITL entry' blocking sess=0x70000060e506a40 seq=57854 wait_time=2929708 secondssince wait started=51 name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
Dumping Session Wait History
for 'enq: TX - allocate ITL entry' count=1wait_time=2929708 name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
for 'enq: TX - allocate ITL entry' count=1wait_time=2929729
name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
for 'enq: TX - allocate ITL entry' count=1wait_time=2929719
name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
for 'enq: TX - allocate ITL entry' count=1wait_time=2929719
name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
for 'enq: TX - allocate ITL entry' count=1wait_time=2929719
name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
for 'enq: TX - allocate ITL entry' count=1wait_time=2929739
name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
for 'enq: TX - allocate ITL entry' count=1wait_time=2929719
name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
for 'enq: TX - allocate ITL entry' count=1wait_time=2929711
name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
for 'enq: TX - allocate ITL entry' count=1wait_time=2929736
name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
for 'enq: TX - allocate ITL entry' count=1wait_time=2929714
name|mode=54580004,usn<<16 | slot=2c002d, sequence=9972b
temporary object counter: 1
会话一直在等待itl事务槽。从这上面推断,一种很大的可能性是由于ITL不足,导致死锁的几率增加(由于没有SYSTEM STATE DUMP,因此仅仅是推断,不是完全的确认)。从业务角度分析,类似的进程有几十个,同时会访问objn=10656,通过数据字典检查,发现这张表和相关索引上面都没有设置initrans参数。一般情况下,ORACLE的事务锁是行锁,只会锁定某一行。不过某个事务在某一块中需要修改某条记录,首先需要占用一个事物槽(事物槽的数量由initrans参数确定,如果事物槽数量不足,可以自动扩展,不过如果块中无空闲空间,则事物槽无法扩展),如果事物槽数量不足,则可能在未锁定某行前,先等待事物槽,这种等待一旦比较严重,事务锁的时间会大幅度增加,加大了死锁的可能性,甚至可能直接导致死锁。
原因定位后,我们修改了这张表和表上的索引的INITRANS和PCTFREE参数,然后对这张表做了MOVE操作。优化后,这个问题消失了。




