聚簇因子
原文作者:Jonathan lewis
原文地址: https://jonathanlewis.wordpress.com/2019/10/23/clustering_factor-7/
初稿2019年10月
几天前,我发表了一小段以前写的脚本的注释,该脚本用于在索引构建之前估计它的clustering_factor。当时我指出,它的一个限制是它不会处理你设置了table_cached_blocks这个参数值时的情况,但几天后我决定写另一个版本的代码,来满足这个新功能—那是一个令我尴尬的发现之旅。
在回顾了一些我发表的关于table_cached_blocks参数对聚簇因子的影响的博客后,我意识到我所写的可以有两种解释:一种是我写作时所认为的,另一种是正确的。我发现这个问题是因为我编写了一个简单的SQL语句—它使用match_recognition()机制–来执行我认为是正确的计算。在使用几组测试数据测试,并生成了正确的结果后,我发邮件给Stew Ashton(有关match_recognition()机制我总是会问他)他是否可以对代码做个检查,因为它执行很慢,我想知道是否有更好的方法去写这个SQL。
他的回答很直接:
“我读了你和Richard Foote写的关于clustering_factor和table_cached_blocks的博客,但它们并不是像你们描述中所说的那样去处理的。”
然后他解释了他从我写的东西中推断出来的东西,这比我写的时候想的更有道理。他还提供了一些代码来实现他的解释,所以我设计了两个会对实现错误解释的代码产生错误的预测的数据模型。他的代码给出了正确的答案,而我的没有。
以下是上面(两种)解释的区别,错误的解释是第一个(这里使用16作为table_cached_blocks的值来讨论)
错误的解释:
当您按照顺序遍历索引条目时,请记住您所看到的最后16行rowid(这是rowid,表示索引所指向的表中的行)。如果当前rowid的块id与所记住的16个rowid中的其中一个不匹配,则增加clustering_factor的计数。
这个算法的简单性意味着您可以准备一个包含16个条目的“循环”数组,并在每次读取一个新条目时覆盖最老的条目。遗憾的是,这是一个错误的想法,因为这是一个过于简单的(虽然match_recognition()是CPU密集型的)策略来实现它——如果您使用gather_index_stats()所用到的内部的数据库机制,它可能会令人难以置信的高效。
正确解释:
为16个块id设置一个数组,每个id都有一个相关的“行号”。按顺序遍历索引—为每个条目提供一个行号。从当前条目中提取块id并在数组中搜索匹配的块id。如果你找到匹配项,就用当前的行号更新它(译者注:数组中的行号)(所以你可以回忆一下你最近是怎么看到块id的);如果没有找到匹配项,那么用当前块id和行号替换行号最小(即到过去的最大距离)的条目,并增加clustering_factor的计数。
Stew Ashton给我第一段代码是一个匿名PL / SQL块,其中包括一些硬编码片段和嵌入式SQL,以使用我定义的测试表和索引,然而他之后给我发送的第二段代码创建了一个通用函数,它根据你想测试的表和索引定义,使用动态SQL构建一个查询。我将(第二段)代码发布(经允许)如下:
create or replace function predict_clustering_factor(
/*
Function to predict the clustering factor of an index,
taking into account the intended value of
the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS.
Input is the table name, the list of column names
and the intended value of TABLE_CACHED_BLOCKS.
The function collects the last N block ids (not the last N entries).
When there is no more room, it increments the clustering factor
and replaces the least recently used block id with the current one.
Note: here a “block id” is a rowid with the row_number portion set to 0.
It is effectively a “truncated” rowid.
*/
p_table_name in varchar2,
p_column_list in varchar2,
p_table_cached_blocks in number
) return number authid current_user is
rc sys_refcursor;
type tt_rids is table of rowid;
lt_rids tt_rids;
type t_block_list is record(
rid rowid,
last_hit number
);
type tt_block_list is table of t_block_list;
lt_block_list tt_block_list := new tt_block_list();
l_rid rowid;
l_clustering_factor number := 0;
b_block_found boolean;
l_rn number := 0;
l_oldest_hit number;
i_oldest_hit binary_integer := 0;
function truncated_rid(p_rid in rowid) return rowid is
rowid_type number;
object_number NUMBER;
relative_fno NUMBER;
block_number NUMBER;
row_number NUMBER;
rid rowid;
begin
DBMS_ROWID.ROWID_INFO (
p_rid,
rowid_type,
object_number,
relative_fno,
block_number,
row_number
);
rid := DBMS_ROWID.ROWID_CREATE (
rowid_type,
object_number,
relative_fno,
block_number,
0
);
return rid;
end truncated_rid;
begin
if p_table_cached_blocks != trunc(p_table_cached_blocks)
or p_table_cached_blocks not between 1 and 255 then
raise_application_error(
-20001,
‘input parameter p_table_cached_blocks must be an integer between 1 and 255’
);
end if;
open rc for ‘select rowid from ‘||p_table_name||’ order by ‘||p_column_list||’, rowid’;
loop
fetch rc bulk collect into lt_rids limit 1000;
for irid in 1..lt_rids.count loop
l_rn := l_rn + 1;
l_rid := truncated_rid(lt_rids(irid));
b_block_found := false;
l_oldest_hit := l_rn;
if l_rn = 1 then
l_clustering_factor := l_clustering_factor + 1;
lt_block_list.extend;
lt_block_list(1).rid := l_rid;
lt_block_list(1).last_hit := l_rn;
else
for i in 1..lt_block_list.count loop
if l_oldest_hit > lt_block_list(i).last_hit then
l_oldest_hit := lt_block_list(i).last_hit;
i_oldest_hit := i;
end if;
if lt_block_list(i).rid = l_rid then
b_block_found := true;
lt_block_list(i).last_hit := l_rn;
exit;
end if;
end loop;
if not b_block_found then
l_clustering_factor := l_clustering_factor + 1;
if lt_block_list.count < p_table_cached_blocks then
lt_block_list.extend;
lt_block_list(lt_block_list.count).rid := l_rid;
lt_block_list(lt_block_list.count).last_hit := l_rn;
else
lt_block_list(i_oldest_hit).rid := l_rid;
lt_block_list(i_oldest_hit).last_hit := l_rn;
end if;
end if;
end if;
end loop;
exit when rc%notfound;
end loop;
close rc;
return l_clustering_factor;
exception when others then
if rc%isopen then
close rc;
end if;
raise;
end predict_clustering_factor;
/
在执行上述函数创建后,下面是一个用法示例:
rem
rem Script: clustering_factor_est_2.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2019
rem
rem Last tested
rem 19.3.0.0
rem 12.2.0.1
rem
create table t1
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 – > comment to avoid WordPress format issue
)
select
rownum id,
cast(rownum as varchar2(10)) v1,
trunc(dbms_random.value(0,10000)) rand,
rpad(‘x’,100,‘x’) padding
from
generator v1,
generator v2
where
rownum <= 1e6 – > comment to avoid WordPress format issue
/
SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ’ || predict_clustering_factor(‘t1’,‘rand, id’,16))
Predicted cf for t1(rand, id): 997218
Elapsed: 00:00:07.54
SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ’ || predict_clustering_factor(‘t1’,‘rand, id’,255))
Predicted cf for t1(rand, id): 985607
Elapsed: 00:00:50.61
您会注意到,“table_cached_blocks”参数设置的值越大,预测clustering_factor所需的时间就越多,在我的示例中,这都是CPU时间。考虑到需要搜索保存以前历史的数组,这并不奇怪。在这个示例表t1 中有1000000行,不同值的数量和数据的离散,使得代码将很难找到缓存中的块id—基本上 这类索引不会给优化器带来太多混乱,也不太可能需要特别注意让优化器在应该使用它的时候使用它,在不合适的时候忽略它。
最后(译者注:把执行结果)剪切粘贴上来,以展示两次预测的准确性:
SQL> create index t1_i on t1(rand, id);
Elapsed: 00:00:02.96
SQL> execute dbms_stats.set_table_prefs(null,‘t1’,‘table_cached_blocks’,16)
Elapsed: 00:00:00.01
SQL> execute dbms_stats.gather_index_stats(null,‘t1_i’)
Elapsed: 00:00:09.55
SQL> select clustering_factor from user_indexes where index_name = ‘T1_I’;
CLUSTERING_FACTOR
985607
Elapsed: 00:00:00.11
SQL> execute dbms_stats.set_table_prefs(null,‘t1’,‘table_cached_blocks’,255)
Elapsed: 00:00:00.01
SQL> execute dbms_stats.gather_index_stats(null,‘t1_i’)
Elapsed: 00:00:07.80
SQL> select clustering_factor from user_indexes where index_name = ‘T1_I’;
CLUSTERING_FACTOR
985607
Elapsed: 00:00:00.00
两者完全匹配–但是您可能会注意到,对于我们设置table_cached_blocks = 255的情况,创建索引和收集统计信息所需要的时间比预测集群因子需要的时间短得多。
以下为原文:
Oracle Scratchpad
October 23, 2019
Clustering_Factor
Filed under: CBO,Indexing,Oracle — Jonathan Lewis @ 9:56 pm BST Oct 23,2019
A few days ago I published a little note of a script I wrote some time ago to estimate the clustering_factor of an index before it had been built. At the time I pointed out that one of its limitations was that it would not handle cases where you were planning to set the table_cached_blocks preference, but a couple of days later I decided that I’d write another version of the code that would cater for the new feature – and that’s how I made an embarrassing discovery.
Having reviewed a number of notes I’ve published about the table_cached_blocks preference and its impact on the clustering_factor I’ve realised the what I’ve written has always been open to two interpretations – the one that I had in mind as I was writing, and the correct one. I made this discovery because I had written a simple SQL statement – using the match_recognize() mechanism – to do what I considered to be the appropriate calculation. After testing the query with a few sets of sample data that produced the correct results I emailed Stew Ashton (my “go-to” person for match_recognize() questions) asking if he would do a sanity check on the code because it was rather slow and I wondered if there was a better way of writing it.
His reply was roughly:
“I’ve read the notes you and Richard Foote have written about the clustering_factor and table_cached_blocks, and this isn’t doing what your description says it should.”
Then he explained what he had inferred from what I had written … and it made more sense than what I had been thinking when I wrote it. He also supplied some code to implement his interpretation – so I designed a couple of data models that would produce the wrong prediction for whichever piece of code implemented the wrong interpretation. His code gave the right answers, mine didn’t.
So here’s the difference in interpretation – the wrong one first – using 16 as a discussion value for the table_cached_blocks:
• WRONG interpretation: As you walk through index entries in order remember the last 16 rowids (that’s rowid for the rows in the table that the index is pointing to) you’ve seen. If the current rowid has a block id component that doesn’t match the block id from one of the remembered 16 rowids then increment the counter for the clustering_factor.
o The simplicity of this algorithm means you can fix a “circular” array of 16 entries and keep walking around the circle overwriting the oldest entry each time you read a new one. It’s a pity that it’s the wrong idea because there’s a simple (though massively CPU -intensive match_recognize() strategy for implementing it – and if you were using an internal library mechanism during a proper gather_index_stats() it could be incredibly efficient.
• RIGHT interpretation: set up an array for 16 block ids, each with an associated “row-number”. Walk through the index in order – giving each entry a row-number as you go. Extract the block id from the current entry and search through the array for a matching block id. If you find a match then update its entry with the current row-number (so you can remembr how recently you saw the block id); if you don’t find a match then replace the entry that has the smallest (i.e. greatest distance into the past) row-number with the current block id and row-number and increment the counter for the clustering_factor.
The first piece of code that Stew Ashton sent me was an anonymous PL/SQL block that included some hard-coded fragments and embedded SQL to use a test table and index that I had defined, but he then sent a second piece of code that creates a generic function that uses dynamic SQL to construct a query against a table and an index definition that you want to test. The latter is the code I’ve published (with permission) below:
create or replace function predict_clustering_factor(
/*
Function to predict the clustering factor of an index,
taking into account the intended value of
the TABLE_CACHED_BLOCKS parameter of DBMS_STATS.SET_TABLE_PREFS.
Input is the table name, the list of column names
and the intended value of TABLE_CACHED_BLOCKS.
The function collects the last N block ids (not the last N entries).
When there is no more room, it increments the clustering factor
and replaces the least recently used block id with the current one.
Note: here a “block id” is a rowid with the row_number portion set to 0.
It is effectively a “truncated” rowid.
*/
p_table_name in varchar2,
p_column_list in varchar2,
p_table_cached_blocks in number
) return number authid current_user is
rc sys_refcursor;
type tt_rids is table of rowid;
lt_rids tt_rids;
type t_block_list is record(
rid rowid,
last_hit number
);
type tt_block_list is table of t_block_list;
lt_block_list tt_block_list := new tt_block_list();
l_rid rowid;
l_clustering_factor number := 0;
b_block_found boolean;
l_rn number := 0;
l_oldest_hit number;
i_oldest_hit binary_integer := 0;
function truncated_rid(p_rid in rowid) return rowid is
rowid_type number;
object_number NUMBER;
relative_fno NUMBER;
block_number NUMBER;
row_number NUMBER;
rid rowid;
begin
DBMS_ROWID.ROWID_INFO (
p_rid,
rowid_type,
object_number,
relative_fno,
block_number,
row_number
);
rid := DBMS_ROWID.ROWID_CREATE (
rowid_type,
object_number,
relative_fno,
block_number,
0
);
return rid;
end truncated_rid;
begin
if p_table_cached_blocks != trunc(p_table_cached_blocks)
or p_table_cached_blocks not between 1 and 255 then
raise_application_error(
-20001,
‘input parameter p_table_cached_blocks must be an integer between 1 and 255’
);
end if;
open rc for ‘select rowid from ‘||p_table_name||’ order by ‘||p_column_list||’, rowid’;
loop
fetch rc bulk collect into lt_rids limit 1000;
for irid in 1..lt_rids.count loop
l_rn := l_rn + 1;
l_rid := truncated_rid(lt_rids(irid));
b_block_found := false;
l_oldest_hit := l_rn;
if l_rn = 1 then
l_clustering_factor := l_clustering_factor + 1;
lt_block_list.extend;
lt_block_list(1).rid := l_rid;
lt_block_list(1).last_hit := l_rn;
else
for i in 1..lt_block_list.count loop
if l_oldest_hit > lt_block_list(i).last_hit then
l_oldest_hit := lt_block_list(i).last_hit;
i_oldest_hit := i;
end if;
if lt_block_list(i).rid = l_rid then
b_block_found := true;
lt_block_list(i).last_hit := l_rn;
exit;
end if;
end loop;
if not b_block_found then
l_clustering_factor := l_clustering_factor + 1;
if lt_block_list.count < p_table_cached_blocks then
lt_block_list.extend;
lt_block_list(lt_block_list.count).rid := l_rid;
lt_block_list(lt_block_list.count).last_hit := l_rn;
else
lt_block_list(i_oldest_hit).rid := l_rid;
lt_block_list(i_oldest_hit).last_hit := l_rn;
end if;
end if;
end if;
end loop;
exit when rc%notfound;
end loop;
close rc;
return l_clustering_factor;
exception when others then
if rc%isopen then
close rc;
end if;
raise;
end predict_clustering_factor;
/
After executing the above to create the function, here’s an example of usage:
rem
rem Script: clustering_factor_est_2.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2019
rem
rem Last tested
rem 19.3.0.0
rem 12.2.0.1
rem
create table t1
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 – > comment to avoid WordPress format issue
)
select
rownum id,
cast(rownum as varchar2(10)) v1,
trunc(dbms_random.value(0,10000)) rand,
rpad(‘x’,100,‘x’) padding
from
generator v1,
generator v2
where
rownum <= 1e6 – > comment to avoid WordPress format issue
/
SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ’ || predict_clustering_factor(‘t1’,‘rand, id’,16))
Predicted cf for t1(rand, id): 997218
Elapsed: 00:00:07.54
SQL> execute dbms_output.put_line('Predicted cf for t1(rand, id): ’ || predict_clustering_factor(‘t1’,‘rand, id’,255))
Predicted cf for t1(rand, id): 985607
Elapsed: 00:00:50.61
You’ll notice that the larger the setting for the “table_cached_blocks” parameter the more time it takes to predict the clustering_factor – and it was all CPU time in my example. This isn;t surprising given the need to search through an array holding the previous history. In this example the table t1 holds 1,000,000 rows, and the number and scatter of distinct values is so arranged that the code will hardly ever find a cached block id – essentially it’s the sort of index that isn’t going to cause much of confusion to the optimizer and isn’t likely to need special attention to make the optimizer use it when it should and ignore it when it’s inappropriate.
Finally a cut-n-paste to show the accuracy of the two predictions:
SQL> create index t1_i on t1(rand, id);
Elapsed: 00:00:02.96
SQL> execute dbms_stats.set_table_prefs(null,‘t1’,‘table_cached_blocks’,16)
Elapsed: 00:00:00.01
SQL> execute dbms_stats.gather_index_stats(null,‘t1_i’)
Elapsed: 00:00:09.55
SQL> select clustering_factor from user_indexes where index_name = ‘T1_I’;
CLUSTERING_FACTOR
997218
Elapsed: 00:00:00.11
SQL> execute dbms_stats.set_table_prefs(null,‘t1’,‘table_cached_blocks’,255)
Elapsed: 00:00:00.01
SQL> execute dbms_stats.gather_index_stats(null,‘t1_i’)
Elapsed: 00:00:07.80
SQL> select clustering_factor from user_indexes where index_name = ‘T1_I’;
CLUSTERING_FACTOR
985607
Elapsed: 00:00:00.00
Both match perfectly – but you might notice that creating the index and gathering the stats was much faster than predicting the clustering factor for the case where we set table_cached_blocks = 255.




