原文作者:Jonathan lewis
原文地址:https://jonathanlewis.wordpress.com/2020/02/18/interval-partitions-2/
译文如下(原文附后):
今天Twitter上出现了一个奇怪的区间划分的小特性——一个只使用一个PX slave来进行插入的并行插入。由于有11亿行,并且可以选择运行parallel 32,这使得加载过程比应该的速度要慢得多。
幸运的是,对这种奇怪现象进行建模(和解决)相当容易。这是一个小的数据集和一个空的分区表:
rem
rem Script: pt_int_load_anomaly.sql
rem Author: Jonathan Lewis
rem Dated: Jan 2020
rem
create table t1
nologging
as
select
ao.*
from
all_Objects ao,
(select rownum id from dual connect by level <= 20)
;
create table pt1
partition by range (object_id) interval (1000000) (
partition p1 values less than (1)
)
as
select *
from all_Objects
where rownum = 0
/
我通过复制all_objects 20次创建了一个表,对于我的小沙箱来说,总共有大约1.2M行。然后,我创建了一个空的all_objects的间隔分区克隆,其中第一个分区定义为保存object_id小于1的所有行(并且数据库中没有对象可以匹配该条件)。我将间隔定义为1,000,000,由于我的数据库中最高的object_id大约是90000,因此添加到这个表的第一个分区将能够容纳来自t1的所有数据。
现在我们试着从t1并行插入到pt1,检查执行计划和并行执行统计信息:
set serveroutput off
insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select * from t1;
select * from table(dbms_xplan.display_cursor);
start pq_tqstat
注意我是如何使用提示/*+ enable_parallel_dml */(可能是一个反向移植到11.2.0.4的12c提示)而不是使用“alter session”的,只是能够在SQL中嵌入该指令有点方便。pq_tqstat脚本是我不久前发布的脚本,用于在运行并行语句后立即报告特定于会话的动态性能视图v$pq_tqstat的内容。
下面是执行计划:
SQL_ID 25hub68pf1z1s, child number 0
-------------------------------------
insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select *
from t1
Plan hash value: 2888707464
-------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | | | 631 (100)| | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | 1235K| 159M| 631 (10)| 00:00:01 | Q1,01 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT (HIGH WATER MARK BROKERED)| PT1 | | | | | Q1,01 | PCWP | |
| 4 | OPTIMIZER STATISTICS GATHERING | | 1235K| 159M| 631 (10)| 00:00:01 | Q1,01 | PCWP | |
| 5 | PX RECEIVE | | 1235K| 159M| 631 (10)| 00:00:01 | Q1,01 | PCWP | |
| 6 | PX SEND RANDOM LOCAL | :TQ10000 | 1235K| 159M| 631 (10)| 00:00:01 | Q1,00 | P->P | RANDOM LOCA|
| 7 | PX BLOCK ITERATOR | | 1235K| 159M| 631 (10)| 00:00:01 | Q1,00 | PCWC | |
|* 8 | TABLE ACCESS FULL | T1 | 1235K| 159M| 631 (10)| 00:00:01 | Q1,00 | PCWP | |
-------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
8 - access(:Z>=:Z AND :Z<=:Z)
Note
-----
- Degree of Parallelism is 6 because of hint
该计划最重要的细节是,PX从服务器作为select执行加载(操作3),然后向查询协调器发送消息(PX send QC,操作2),告诉它有关数据加载的信息。它们不将数据发送给QC,让QC进行加载。
所以计划说我们将做并行DM,但是v$pq_tqstat告诉我们:
DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS BYTES ROW_SHARE DATA_SHARE WAITS TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
1 0 Producer 1 P006 215880 34785363 17.47 16.86 16 0 0
1 P007 202561 34436325 16.39 16.69 17 0 0
1 P008 207519 34564496 16.79 16.75 17 0 0
1 P009 208408 34594770 16.86 16.77 17 0 0
1 P00A 198915 33529627 16.10 16.25 16 0 0
1 P00B 202537 34430603 16.39 16.69 16 0 0
Consumer 1 P000 0 144 0.00 0.00 51 47 0
1 P001 0 144 0.00 0.00 51 47 0
1 P002 1235820 206340464 100.00 100.00 75 47 0
1 P003 0 144 0.00 0.00 51 47 0
1 P004 0 144 0.00 0.00 1138 1134 0
1 P005 0 144 0.00 0.00 1137 1133 0
1 Producer 1 P000 0 24 0.00 5.91 51 42 0
1 P001 0 24 0.00 5.91 50 41 0
1 P002 2 286 100.00 70.44 58 14 0
1 P003 0 24 0.00 5.91 51 43 0
1 P004 0 24 0.00 5.91 51 42 0
1 P005 0 24 0.00 5.91 51 43 0
Consumer 1 QC 2 406 100.00 100.00 311 179 0
19 rows selected.
查询确实如提示的那样运行了parallel 6,并且6个PX从服务器扫描了t1表;但是它们都把所有的数据发送给了第二个从集合中的一个PX从然后这个PX从做了所有的插入。这个计划是并行的,但执行起来却是连续的。(您将注意到,当p004和p005充当使用者时,它们的等待和超时有些奇怪。我以后可能会担心这个问题,但这可能是在一个有4个cpu的VM上运行parallel 6的基于主机的副作用)。
序列化会导致两个问题:
(1)到底是哪里出了错?
(2)我们如何解决这个问题,并使插入“真正”并行
我对(1)的回答是“我不知道——但如果有必要我会查看它”与猜测相结合——这与表在开始只有一个分区有关,这对PX分布的随机函数产生了意想不到的副作用。
我对(2)的回答是“如果我对(1)的判断是正确的,为什么不尝试预定义两个分区,我甚至会让它们都是空的”。
这是我对pt1的新定义:
create table pt1
partition by range (object_id) interval (1000000) (
partition p0 values less than (0),
partition p1 values less than (1)
)
as
select *
from all_Objects
where rownum = 0
/
使用完全多余且永久为空的p0分区重新运行测试,计划没有改变,但是v$pq_tqstat的结果发生了戏剧性的变化:
DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS BYTES ROW_SHARE DATA_SHARE WAITS TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
1 0 Producer 1 P006 207897 34581153 16.82 16.76 23 4 0
1 P007 215669 34786429 17.45 16.86 30 5 0
1 P008 221474 36749626 17.92 17.81 28 5 0
1 P009 204959 34497164 16.58 16.72 22 2 0
1 P00A 177755 30141002 14.38 14.61 21 0 0
1 P00B 208066 35585810 16.84 17.25 25 2 0
Consumer 1 P000 213129 35612973 17.25 17.26 82 57 0
1 P001 200516 33570586 16.23 16.27 84 55 0
1 P002 203395 33950449 16.46 16.45 83 56 0
1 P003 205458 34235575 16.63 16.59 82 54 0
1 P004 204111 33999932 16.52 16.48 581 555 0
1 P005 209211 34971669 16.93 16.95 580 553 0
1 Producer 1 P000 2 286 16.67 16.67 422 149 0
1 P001 2 286 16.67 16.67 398 130 0
1 P002 2 286 16.67 16.67 405 128 0
1 P003 2 286 16.67 16.67 437 161 0
1 P004 2 286 16.67 16.67 406 116 0
1 P005 2 286 16.67 16.67 440 148 0
Consumer 1 QC 12 1716 100.00 100.00 242 111 0
19 rows selected.
每个使用者接收和插入大约200,000行,这是一个完全公平的并行DML。对于这么小的数据集,计时是相当无关紧要的,但是当并行性“正常”工作时,执行时间确实从7秒降到了4秒。
我已经在12.2.0.1和19.3.0.0上测试了这个脚本——在这两个版本中都出现了相同的异常,不过值得注意的是,在等待和超时方面没有出现奇怪的倾斜。
原文:
Oracle Scratchpad
February 18, 2020
Interval Partition(s)
Filed under: Oracle,Parallel Execution,Partitioning — Jonathan Lewis @ 1:45 pm GMT Feb 18,2020
A quirky little feature of interval partitioning showed up on Twitter today – a parallel insert that would only use a single PX slave to do the inserting. With 1.1 billion rows and the option for running parallel 32 this made the loading process rather slower than it ought to have been.
Fortunately it’s quite easy to model (and work around) the oddity. So here’s a small data set and an empty partitioned table to work with:
rem
rem Script: pt_int_load_anomaly.sql
rem Author: Jonathan Lewis
rem Dated: Jan 2020
rem
create table t1
nologging
as
select
ao.*
from
all_Objects ao,
(select rownum id from dual connect by level <= 20)
;
create table pt1
partition by range (object_id) interval (1000000) (
partition p1 values less than (1)
)
as
select *
from all_Objects
where rownum = 0
/
I’ve created a table by copying all_objects 20 times which, for my little sandbox, has given me a total of about 1.2M rows. Then I’ve created an empty interval-partitioned clone of all_objects, with the first partition defined to hold all rows where the object_id is less than 1 (and there’s no object in the database that could match that criterion). I’ve defined the interval to be 1,000,000 and since the highest object_id in my database is about 90,000 the first partition that gets added to this table will be able to hold all the data from t1.
So now we try to do a parallel insert from t1 into pt1, and check the execution plan and parallel execution statistics:
set serveroutput off
insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select * from t1;
select * from table(dbms_xplan.display_cursor);
start pq_tqstat
Note how I’ve used the hint /*+ enable_parallel_dml */ (possible a 12c hint back-ported to 11.2.0.4) rather than using an “alter session”, it’s just a little convenience to be able to embed the directive in the SQL. The pq_tqstat script is one I published some time ago to report the contents of the session-specific dynamic performance view v$pq_tqstat immediately after running a parallel statement.
Here’s the plan:
SQL_ID 25hub68pf1z1s, child number 0
-------------------------------------
insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select *
from t1
Plan hash value: 2888707464
-------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | | | 631 (100)| | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | 1235K| 159M| 631 (10)| 00:00:01 | Q1,01 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT (HIGH WATER MARK BROKERED)| PT1 | | | | | Q1,01 | PCWP | |
| 4 | OPTIMIZER STATISTICS GATHERING | | 1235K| 159M| 631 (10)| 00:00:01 | Q1,01 | PCWP | |
| 5 | PX RECEIVE | | 1235K| 159M| 631 (10)| 00:00:01 | Q1,01 | PCWP | |
| 6 | PX SEND RANDOM LOCAL | :TQ10000 | 1235K| 159M| 631 (10)| 00:00:01 | Q1,00 | P->P | RANDOM LOCA|
| 7 | PX BLOCK ITERATOR | | 1235K| 159M| 631 (10)| 00:00:01 | Q1,00 | PCWC | |
|* 8 | TABLE ACCESS FULL | T1 | 1235K| 159M| 631 (10)| 00:00:01 | Q1,00 | PCWP | |
-------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
8 - access(:Z>=:Z AND :Z<=:Z)
Note
-----
- Degree of Parallelism is 6 because of hint
The most important detail of this plan is that the PX slaves do the load as select (operation 3), then send a message to the query coordinator (PX send QC, operation 2) to tell it about the data load. They do not send their data to the QC for the QC to do the load.
So the plan says we will be doing parallel DM, but here’s what v$pq_tqstat tells us:
DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS BYTES ROW_SHARE DATA_SHARE WAITS TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
1 0 Producer 1 P006 215880 34785363 17.47 16.86 16 0 0
1 P007 202561 34436325 16.39 16.69 17 0 0
1 P008 207519 34564496 16.79 16.75 17 0 0
1 P009 208408 34594770 16.86 16.77 17 0 0
1 P00A 198915 33529627 16.10 16.25 16 0 0
1 P00B 202537 34430603 16.39 16.69 16 0 0
Consumer 1 P000 0 144 0.00 0.00 51 47 0
1 P001 0 144 0.00 0.00 51 47 0
1 P002 1235820 206340464 100.00 100.00 75 47 0
1 P003 0 144 0.00 0.00 51 47 0
1 P004 0 144 0.00 0.00 1138 1134 0
1 P005 0 144 0.00 0.00 1137 1133 0
1 Producer 1 P000 0 24 0.00 5.91 51 42 0
1 P001 0 24 0.00 5.91 50 41 0
1 P002 2 286 100.00 70.44 58 14 0
1 P003 0 24 0.00 5.91 51 43 0
1 P004 0 24 0.00 5.91 51 42 0
1 P005 0 24 0.00 5.91 51 43 0
Consumer 1 QC 2 406 100.00 100.00 311 179 0
19 rows selected.
The query did run parallel 6 as hinted – and 6 PX slaves scanned the t1 table; but they all sent all their data to one PX slave in the second slave set and that one PX slave did all the inserts. The plan was parallel, but the execution was effectively serial. (You’ll note there is something a little odd about the waits and timeout for p004 and p005 when they are acting as consumers. I may worry about that later, but it could be a host-based side effect of running parallel 6 on a VM with 4 CPUs).
The serialization leads to two questions
What went wrong?
How do we work around this and make the insert “truly” parallel
My answer to (1) is “I don’t know – but I’ll look at it if necessary” combined with the guess – it’s something to do with the table having only one partition at the outset and this has an unexpected side effect on the randomising function for the PX distribution.
My answer to (2) is “if I’m right about (1), why not try pre-defining two partitions, and I’ll even let both of them stay empty”.
So here’s my new definition for pt1:
create table pt1
partition by range (object_id) interval (1000000) (
partition p0 values less than (0),
partition p1 values less than (1)
)
as
select *
from all_Objects
where rownum = 0
/
Re-running the test with the completely redundant, and permanently empty p0 partition the plan doesn’t change but the results from v$pq_tqstat change dramatically:
DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS BYTES ROW_SHARE DATA_SHARE WAITS TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
1 0 Producer 1 P006 207897 34581153 16.82 16.76 23 4 0
1 P007 215669 34786429 17.45 16.86 30 5 0
1 P008 221474 36749626 17.92 17.81 28 5 0
1 P009 204959 34497164 16.58 16.72 22 2 0
1 P00A 177755 30141002 14.38 14.61 21 0 0
1 P00B 208066 35585810 16.84 17.25 25 2 0
Consumer 1 P000 213129 35612973 17.25 17.26 82 57 0
1 P001 200516 33570586 16.23 16.27 84 55 0
1 P002 203395 33950449 16.46 16.45 83 56 0
1 P003 205458 34235575 16.63 16.59 82 54 0
1 P004 204111 33999932 16.52 16.48 581 555 0
1 P005 209211 34971669 16.93 16.95 580 553 0
1 Producer 1 P000 2 286 16.67 16.67 422 149 0
1 P001 2 286 16.67 16.67 398 130 0
1 P002 2 286 16.67 16.67 405 128 0
1 P003 2 286 16.67 16.67 437 161 0
1 P004 2 286 16.67 16.67 406 116 0
1 P005 2 286 16.67 16.67 440 148 0
Consumer 1 QC 12 1716 100.00 100.00 242 111 0
19 rows selected.
Every consumer receives and inserts roughly 200,000 rows – it’s a totally fair parallel DML. Timings are pretty irrelevant for such a small data set but the excution time did drop from 7 seconds to 4 seconds when parallelism was working “properly”.
I’ve tested this script on 12.2.0.1 and 19.3.0.0 – the same anomaly appears in both versions though it might be worth noting that the strange skew in the waits and timeouts doesn’t appear in 19.3.0.0.




