极速性能：Oracle基于闪存技术的Redo日志写优化

戴明明 2016-04-06

1437

戴明明

宝存科技数据库方案架构师

下图是使用 Hammer DB 压了5000个 warehouses，然后使用 200 个 Virtual User 来进行压力测试。我们注意到事务率达到了67，000笔/秒，这是非常高的一个数字，而且日志（Redo Size）每秒高达340M，也就是每小时约1TB的日志量，这在Oracle数据库中是一个非常高的数值，而且实践中很难达到。

云和恩墨zData的解决方案正是利用闪存的高性能，构建了分布式存储解决方案：

我们继续来分析一下这一架构在核心处理上的优化。在Oracle数据库的极限测试时，最核心的性能瓶颈会来自于Redo日志写以及Log File Sync等待，我们来看一下在使用闪存卡的情况下，数据库的 Online redo log 的进一步的优化。

Oracle 官方的建议

在 MOS 的文档（ID 857576.1）中提到如下一句话：

Also putting the SLOG on an SSD (Solid State Disk) will reduce redo log latency further. This will help improve the performance of synchronous writes.

在另一篇MOS文档（ID 1376916.1）中提到：

If the proportion of the 'log filesync' time spent on 'log file parallel write' times is high, then most ofthe wait time is due to IO (waiting for the redo to be written). Theperformance of LGWR in terms of IO should be examined. As a rule of thumb,an average time for 'log file parallel write' over 20 milliseconds suggests aproblem with IO subsystem.

Recommendations

Work with the system administrator to examine the filesystems where the redologs are located with a view to improving the performance of IO.
Do not place redo logfiles on a RAID configuration which requires the calculation of parity, such as RAID-5 or RAID-6.
Do not put redo logs on Solid State Disk (SSD)
Although generally, Solid State Disks writeperformance is good on average, they may endure write peaks which will highlyincrease waits on 'log file sync'.
(Exception to this would be for Engineered Systems(Exadata, SuperCluster and Oracle Database Appliance) which have been optimizedto use SSDs for REDO)
Look for other processes that may be writing to that same location and ensure that the disks have sufficient bandwidth to cope with the required capacity. If they don't then move the activity or the redo.
Ensure that the log_buffer is not too big. A very large log_buffer can have an adverse affect as waits will be longer when flushes occur. When the buffer fills up, it has to write all the data into the redo log file and the LGWR will wait until the last I/O is completed.

Oracle 不建议把 redo log 放在 SSD上，但 Exadata 系统中 redo 是存放在 SSD 上的。不建议的理由是：

Although generally,Solid State Disks write performance is good on average, they may endure writepeaks which will highly increase waits on 'log file sync'.

Oracle 担心的是可能存在的 writepeaks 导致 log file sync 等待的增加。

Flasn 闪存卡使用的 Flash 介质分三种：SLC,MLC,TLC。

民用级的 SSD 采用的是 MLC 和 TLC，并且 OP （Over-Provision 空间）值一般也控制在10%以内，这样可以控制成本，但 OP 值低，会导致写放大系数高，也会影响整体闪存卡的性能。所以在这种情况下，确实可能出现 oracle 担心的 write peaks 带来的性能下降问题。

但企业级的 PCIE Flash 闪存卡采用的是 MLC，OP 值可以做到20%以上，OP 值高，写放大系数可以控制的更低，大的 OP 值也可以给闪存卡提供更好的性能。所以在这种情况下，不会出现 Oracle 担心的 write peaks 带来的性能问题。

4K Online Redo Log

① 扇区大小

上一代存储多采用 512 bytes 的扇区，现在的存储则采用 4k 的扇区，扇区即每次最小 IO 的大小。

4k 扇区有两种工作模式：nativemode 和 emulation mode。

Native mode：即 4k 模式，物理和逻辑的 block 大小一样，都是 4096bytes。 Native mode 的缺点是需要操作系统和软件（如 DB）的支持。Oracle 从 11gR2 开始支持 4k IO 操作。 Linux 内核在 2.6.32 之后也开始支持 4k IO 操作。
emulation mode：物理块是 4k，但逻辑块是 512bytes。在该模式下，IO 操作时底层物理还是 4k 进行操作，所以就会导致 Partial I/O 和 4k 对齐的问题。

在 emulation mode下，每次 IO 操作大小是 512bytes，但存储底层的 IO 操作大小必须是 4k，如果要读 512 bytes 的数据，实际需要读 4k，是原来的8倍，就是 partial IO。而在写时，也是先读 4k 的物理 block，然后更新其中的 512 bytes 的数据，再把 4k 写回去。所以在 emulation mode 下，增加的工作会增加延时，降低性能。

② Online Redo Logs

在 Oracle 数据库的文件中，默认情况下，datafile 的 block 是 8KB，控制文件是 16KB，所以都没有 partial IO 的问题，唯有 online redo log，默认是 512 bytes，存在 partial IO 的问题。

从 Oracle 11gR2 开始，在存储支持 4k 扇区的情况下，可以创建 Blocksize 为 512，1024，4096 的 redo log。

如：alter database add logfilegroup 5 size 100m blocksize 4096;

如果是 emulation mode 的 4k 扇区，创建 4k 的 redo log 时可能会触发如下错误：

ORA-01378: Thelogical block size (4096) of file +DATA is not compatible with the disk sectorsize (media sector size is 512 and host sector size is 512)

只要确认存储物理是 4k 的扇区，可以设置_disk_sector_size_override 参数为 true，来覆盖扇区的设置。该参数支持动态修改，如：

ALTERSYSTEM SET “_DISK_SECTOR_SIZE_OVERRIDE”=”TRUE”;

实际测试

数据库： 12.1.0.2，Online redo log 存放在 PCIE 闪存卡

SQL> selectgroup#,bytes/1024/1024||'M' from v$log;

   GROUP# BYTES/1024/1024||'M'
---------------------------------------------------
           4 2000M
           5 2000M
           6 2000M
           7 2000M

AWR 数据，我们这里只看2个部分：Load Profile 和 Top 10 Foreground Events by Total Wait Time：

Online redo log 存放在 SAS 硬盘

使用4k 的Redo log 并存放在PCIE 闪存卡

创建4k的online redo log，把 redo log 迁移到 PCIE SSD 上，然后使用4k的 blocksize。

SQL> alter database add logfile group5 ('/u01/app/oracle/oradata/DAVE/onlinelog/dave05.log') size 2000M blocksize 4096;
alter database add logfile group 5('/u01/app/oracle/oradata/DAVE/onlinelog/dave05.log') size 2000M blocksize 4096
*
ERROR at line 1:
ORA-01378: The logical block size (4096) of file
/u01/app/oracle/oradata/DAVE/onlinelog/dave05.logis not compatible with
the disk sector size (media sector size is512 and host sector size is 512)

[root@dave ~]# fdisk -lu dev/dfa
Note: sector size is 4096 (not 512)

Disk dev/dfa: 3454.0 GB, 3454011441152bytes
32 heads, 32 sectors/track, 823500cylinders, total 843264512 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes/ 4096 bytes
I/O size (minimum/optimal): 4096 bytes 65536 bytes
Disk identifier: 0x00000000

我们这里确实是 4k 的 sector size.

SQL> ALTER SYSTEM SET"_DISK_SECTOR_SIZE_OVERRIDE"=true;
System altered.

SQL> alter database addlogfile group 5('/u01/app/oracle/oradata/DAVE/onlinelog/dave05.log') size 2000M blocksize4096;
Database altered.
SQL> alter database add logfile group 6('/u01/app/oracle/oradata/DAVE/onlinelog/dave06.log') size 2000M blocksize4096;
Database altered.
SQL> alter database add logfile group 7 ('/u01/app/oracle/oradata/DAVE/onlinelog/dave07.log')size 2000M blocksize 4096;
Database altered.
SQL> alter database add logfile group 8('/u01/app/oracle/oradata/DAVE/onlinelog/dave08.log') size 2000M blocksize4096;
Database altered.
SQL> select group# ,member fromv$logfile;
   GROUP# MEMBER
----------------------------------------------------------------------
           5/u01/app/oracle/oradata/DAVE/onlinelog/dave05.log
           6/u01/app/oracle/oradata/DAVE/onlinelog/dave06.log
           7/u01/app/oracle/oradata/DAVE/onlinelog/dave07.log
           8/u01/app/oracle/oradata/DAVE/onlinelog/dave08.log

TPCC 测试使用 20个 virtual 进行压测：

AWR 数据

三种测试方案的数据对比

压力测试的方法是把系统的CPU 压倒100%，看最大的TPM，最终对比数据如下表：

从这个数据对比，可以看出，在使用 SSD 的情况下，Log file sync 占 DB time 的比率下降非常明显，从63.6% 到18.5%，性能也有明显的提升。并且从测试结果看，在使用SAS 盘的情况下，TPM 波动更加明显，根据实测数据，在Oracle 数据库中，使用 4k 的 online redolog 加企业级的 PCIE Flash 闪存卡可以明显提升系统的性能。

以上测试数据供参考。

如何加入云和恩墨大讲堂微信群

搜索盖国强（Eygle）：eeygle，或者扫描下面二维码，备注：云和恩墨大讲堂，即可入群。每周与千人共享免费技术分享，与讲师在线讨论。

oracle

文章转载自戴明明，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。