DB CACHE是Oracle十分重要的数据结构。老白刚刚开始学习ORACLE数据库的时候,有个前辈指导老白,看oracle性能好不好,检查下DB CACHE的命中率就行了,保证DB CACHE的命中率在90%以上,确保不低于85%,你的系统性能就不会有太大的问题了。后来在很多项目中验证过,发现这一招还真的挺灵的。不过那时候是90年代初,应用软件还比较简单,并发量还不大。确实如果确保大部分数据是在CACHE中获得的,并且通过DB CACHE命中率发现一些有问题的SQL,并进行优化,效果也挺好的。于是凭着这手绝活,老白在圈子里渐渐有了优化高手的名号。
随着计算机技术的发展,海量内存的使用,现在绝大多数系统的DB CACHE命中率都已经高达99%以上了。很多初学者对DB CACHE为何有如此高的命中率十分疑惑,实际上,这和Oracle DB CACHE巧妙的数据结构和卓越的算法是分不开的。今天我们来探究一下DB CACHE里的一些重要的数据结构和算法。
DB CACHE的算法是基于一系列链的,DB CACHE中有很多重要的链,这些链或者和DB CACHE的分配、换出、刷新优化,或者和DB BLOCK的查找有关。下面我们先来了解这些链的基本情况,然后我们来探讨一下这些链在各种DB CACHE活动中所起的作用。
LRU list 和LRU-AUX list
LRU-W list (write list)
LRU-W list就是我们常说的脏数据链,对于LRU-W链,也存在一个AUX链LRUW-AUX。引入AUX链的目的是为了提高LRU-W的效率,支持异步DBWR作业。在这个链没有引入以前,当一个DBWR作业没有完成前,是无法进行下一个DBWR作业的。引入AUX后,AUX链存放的是当前DBWR正在写入的数据,LRU-W就腾出来处理下一个DBWR作业了。前台进程在查找空闲CACHE的时候,如果扫描到了一个脏块,就会主动将脏块移到LRU-W LIST上,从而避免这个脏块被多次无效的扫描到。Oracle这种将一些脏活累活分散到某个前台会话中的做法,是一种十分有效的提高性能的方法,十分值得我们数据库开发厂家学习。
LRU-XO list (reuse object list)
LRU-XR list (reuse range list)
和LRU-XO链类似,XO是针对某个SEGMENT的,而LRU-XR主要针对表空间这类对象。LRU-XR list也被称为Reuse range链,当reuse range cross instance call事件发生的时候,CKPT搜索这些BUFFER,把脏数据块放入LRU-XR链,然后由DBWR将这些数据写盘。当CKPT发现这个链已经空的时候,本次CHECKPOINT结束。如果我们要将某个表空间OFFLINE或者将某个表空间设置为READ ONLY的时候,这个时候就会产生ruse range cross instance call事件,因为要安全的完成这些操作,必须要将表空间上的脏数据全部写入磁盘。
struct kcbbh { * 10201 struct kcbbh */ kgglk kcbbhha; * hash chain buffer is on */ ktsn kcbbhtsn; * tablespace number */ krdba kcbbhrdba; * relative DBA */ ub4 kcbbhflg; * flags: all changes require hash latch */ kobjd kcbbhobj; * Object # (disk )for block (if known) */ kobjn kcbbhobjn; * dictionary object # (if known) */ BitMask for "struct kcbbh.kcbbhobjn" KCBBHFBD BIT 0x00000001 // Buffer Dirty KCBBHREU BIT 0x00000002 // buffer has been reused KCBBHFMS BIT 0x00000004 // modification started, no new writes KCBBHFPB BIT 0x00000008 // private buffer KCBBHFTD BIT 0x00000010 // temporary data, no redo for changes KCBBHFBW BIT 0x00000020 // being written, cannot modify KCBBHFFG BIT 0x00000040 // foreground waiting for buffer KCBBHFCP BIT 0x00000080 // clone of a private buffer KCBBHFIM BIT 0x00000100 // pinned for IMU, do not write KCBBHFUL BIT 0x00000200 // unlink from lock element and make cr KCBBHFDG BIT 0x00000400 // down grade lock KCBBHFCW BIT 0x00000800 // a clone copy is being written KCBBHFCR BIT 0x00001000 // reading from disk into KCBBHCR buffer KCBBHFGC BIT 0x00002000 // has been gotten in current mode KCBBHFST BIT 0x00004000 // stale, unused CR buf made from current KCBBHFDP BIT 0x00008000 // deferred ping KCBBHFDA BIT 0x00010000 // direct Access to buffer contents KCBBHFVB BIT 0x00020000 // buffer being evicted for cache shrink KCBBHFPW BIT 0x00040000 // prewarmed buffer KCBBHFSQ BIT 0x00080000 // sequential scan only flag KCBBHFBP BIT 0x00100000 // Buffer was Prefetched KCBBHFFW BIT 0x00200000 // buffer has been written once KCBBHFFB BIT 0x00400000 // buffer is logically flushed KCBBHFTP BIT 0x01000000 // dma transfer in progress KCBBHFRG BIT 0x02000000 // Redo Generated since block read KCBBHFWC BIT 0x04000000 // Write has been Completed KCBBHFSI BIT 0x08000000 // block transferred from another instance KCBBHFDB BIT 0x20000000 // buffer is directly from a foreign DB KCBBHFAW BIT 0x40000000 // flush after writing b1 kcbbhst; /* state of the buffer */ BitMask for "struct kcbbh.kcbbhst" KCBBHFREE BIT 0 // buffer free KCBBHEXLCUR BIT 1 // buffer current (and is DFS locked X) KCBBHSHRCUR BIT 2 // buffer current (and is DFS locked S) KCBBHCR BIT 3 // buffer consistant read KCBBHREADING BIT 4 // being read KCBBHMRECOVERY BIT 5 // media recovery KCBBHIRECOVERY BIT 6 // crash or instance recovery KCBBHWRITING BIT 7 // write clone KCBBHPI BIT 8 // past image KCBBHMEMORY BIT 9 // allocated memory buffer KCBBHMWRITING BIT 10 // media recovery write clone KCBBHDONATED BIT 11 // donated to another memory component b1 kcbbhmd; /* mode owned in (KCBMNULL, KCBMSHR, KCBMEXCL) */ ub2 kcbbhcla; /* block class */ ub2 kcbbhafn; /* absolute file number */ b1 kcbbhccnt; /* number of changes to buffer in single kcbchg() */ b1 kcbbhcst; /* change state for recovery if failure during kcbchg() */ BitMask for "struct kcbbh.kcbbhcst" KCBBHCNUL BIT 0 // No change in progress. KCBBHFMS should be off KCBBHCNAY BIT 1 // changes not active yet: redo may have been written KCBBHCACT BIT 2 // changes active: buffer contents unpredictable KCBBHCRCV BIT 3 // changes recovering: block recovery in progress KCBBHCCMP BIT 4 // changes complete: buffer cleanup still needed ub2 kcbbhssid; /* simulation set id */ ub1 kcbbhsflg; /* rac flags */ BitMask for "struct kcbbh.kcbbhsflg" KCBBHFMW BIT 0x01 // multiple waiters when gc lock acquired KCBBHFAK BIT 0x02 // locked with an affinity lock KCBBHFPA BIT 0x04 // PI buffer has been aged once KCBBHFSP BIT 0x08 // recovery suspect resolution in progress KCBBHFRE BIT 0x10 // renounce in progress KCBBHFSU BIT 0x20 // write suspect resolution in progress ub1 kcbbhcqid; /* which ckpt queue in working set buffer is */ kgglk kcbbhckql; /* link for checkpoint queue */ kgglk kcbbhfql; /* link for per-file checkpoint queue */ ptr_t kcbbhba; /* buffer base address (set when mapped) */ kscn kcbbhdscn; /* incremental Transactional DSCN */ kgglk kcbbhus; /* list of users */ kgglk kcbbhwa; /* list of waiters */ kgglk kcbbhrpl; /* link for maintaining position on replacement chain */ b1 kcbbhfoq; /* TRUE iff the buffer is on a write list */ b1 kcbbhlpf; /* LRU latch protected flags */ BitMask for "struct kcbbh.kcbbhlpf" KCBBHLDF BIT 0x01 // LRU Dump Flag used in debug print routine KCBBHLMT BIT 0x02 // moved to tail of lru (for extended stats) KCBBHLAL BIT 0x04 // on auxiliary list KCBBHLHB BIT 0x08 // hot buffer - not in cold portion of lru KCBBHOAX BIT 0x10 // buffer in auxiliary object queue KCBBHOWR BIT 0x20 // buffer in write object queue ub2 kcbbhtch; /* touch count */ ub4 kcbbhtim; /* time of last touch count increment */ union kcbbh.UNK_10526973 kcbbhu; void * kcbbhxcb; /* pointer to transaction state object */ kcocv * kcbbhlcv; /* pointer to last chg in a series of chg's */ struct kcbwds * kcbbhds; /* system set descriptor */ krfgda kcbbhgda; /* flashback generation disk address */ kgglk kcbbhoq; /* object queue the buffer is on */ struct kcbbh.UNK_10526983 kcbbhsh; /* all fields that are needed in shared mode */ kcbtrcbuf kcbbhhashtrc; /* trace of events under hash latch */ } |
首先,我们来看看CHECKPOINT QUEUE。CHECKPOINT QUEUE是通过LOW RBA排序的一个链表,当某个BLOCK发生变化的时候,会将该BLOCK按照LOW RBA地址链入相关的CHECKPOINT QUEUE,这些链表包括标准CKPT QUEUE(老白在这里用了“标准”两个字,实际上这不是Oracle 内部的术语,这部分内容公布的很少,因此老白用“标准”两字来将其区别于RANGE和OBJECT两个CKPT QUEUE)、XO CKPT QUEUE(OBJECT CKPT,每个对象一个)和XR CKPT QUEUE(RANGE CKPT,术语称为PER-FILE CKPT QUEUE,每个数据文件一个)。
在做普通的CHECKPOINT操作的时候,CKPT进程找到CKPT链的头部,将CKPT QUEUE交给DBWR,由DBWR组织CHECKPOINT DBWR批处理,DBWR进程开始写入操作,由于CKPT QUEUE是按照LOW RBA的顺序排序的,因此在做CHECKPOINT DBWR批处理的时候,是按照RBA的顺序写入脏数据的。当DBWR完成写操作后,CKPT进程更高控制文件和相关的文件头的SCN数据,记录本次CKPT操作的结果。
由于CKPT QUEUE上的BUFFER有可能在LRU链上,也有可能在LRU-W链上,在组织DBWR批处理的时候,LRU链上的BUFFER是否要摘下放入LRU-W,这些算法目前CHECKPOINT 写入的时候,这些BLOCK不从LRU LIST上移走,写完后,从CKPTQ里清除,这和普通的LRUW中的块被写入数据库是不同的。
当TRUNCATE,DROP之类的操作发生的时候,也会发生CHECKPOINT,这类的CHECKPOINT被称为是REUSE OBJECT CHECKPOINT,发生这个操作的时候的具体操作目前很少有资料说明。老白猜测当REUSE OBJECT CHECKPOINT发生时,CKPT进程将kcbbhoq指向的OBJECT CHECKPOINT链上的脏块链入LRU-XO,交给DBWR处理。DBWR组织写批处理,进行写入操作。只有当所有的LRU-XO上的块全部被写入后,DDL操作才能完成。这种算法也造成了TRUNCATE一个大对象可能会很慢(如果这个对象很多块是脏块的话)。kcbbhoq字段是一个对象的DB CACHE链。这条链的存在对于REUSED OBJECT CROSS INSTANCE CALL的性能有很大的帮助。
当表空间设置READ ONLY或者OFFLINE的时候,会触发REUSE RANGE的CHECKPOINT,其原理和REUSE OBJECT的CHECKPOINT类似。通过在BUFFER HEAD中的kcbbhfql将每个数据文件的脏块串成一条链。这样,当REUSE RANGE CHECKPOINT发生时,CKPT将这条链链入LRU-XR,然后交给DBWR处理。




