聚簇因子

原创胡佳伟 2020-02-19

1179

聚簇因子
原文作者：Jonathan lewis
原文地址：https://jonathanlewis.wordpress.com/2019/10/16/clustering_factor-6/

初稿2018年7月
“如果不首先创建索引，如何查找索引的聚簇因子？”
我不确认这是不是一个值得被讨论的问题[1]，但既然有人问了它（截止编写该博客的时候，我也不知道具体是谁，在哪儿，什么时候，为什么提问了），在Oracle还没有公开table_cached_blocks参数之前的美好时代，以下算是对简单堆表的一个回答吧。它的工作原理是把你想要创建的索引中的索引列与表的rowid一起排序，接下来对比当前行的rowid中的file/block（从表示rowid的字符串中截取相关的字符部分）部分，去看是不是与前一行在同一个块中。如果这行在一个不同的块中那我们计1，否则为0。最后把该位相加（就是聚簇因子的值）。
在接下来的演示中，我拷贝了all_oobjects视图中的一小部分数据，目的是在object_name列创建索引。在运行完我的计数查询后，接下来我会创建这个索引来检查它的聚簇因子与我的计数查询是吻合的。

rem
rem     Script: clustering_factor_est.sql
rem     Author: J.P.Lewis
rem     Dated:  July 2018
rem
 
create table t1
as
select  *
from    all_objects
where   rownum <= 10000
;
 
prompt  ====================
prompt  Tablescan with lag()
prompt  ====================
 
select
        sum(count_chg)
from    (
        select
                case
                        when substrb(rowid,1,15) <> lag(substrb(rowid,1,15),1,'000000000') over(order by object_name, rowid)
                                then 1
                                else 0
                end     count_chg
        from
                t1
        where
                object_name is not null
        )
;
 
prompt  ======================
prompt  create index and check
prompt  ======================
 
create index t1_i1 on t1(object_name);
 
select  clustering_factor
from    user_indexes 
where   table_name = 'T1'
and     index_name = 'T1_I1'
;

如下展示为上述代码执行的输出：

Table created.
 
====================
Tablescan with lag()
====================
 
SUM(COUNT_CHG)
--------------
          3901
 
1 row selected.
 
======================
create index and check
======================
Index created.
CLUSTERING_FACTOR
-----------------
             3901
 
1 row selected.

这个脚本对分区表的全局索引是起作用的，但是对全局分区索引和本地索引会得出没有意义的结果。而且，如果你设置table_cached_blocks参数为一个不是1的数，那么它将使用另外一种计算聚簇因子的机制。

[1] :考虑到运行语句来查询聚簇因子的工作量，你不妨创建一个索引（不可见，如果不会导致问题或许还可以采用nologging）-这也提供给你设置table_cached_blcoks然后收集统计信息的选项，以便查来看聚簇因子是如何变化的。

更新（发布不久之后）
我突然意识到，如果你创建了一个用来测试优化器是否会选择使用这个索引的“nosegment”索引，那么会有一条信息在数据字典中是没有的，那就是索引的聚簇因子（对于多列索引，你可以通过在列上创建列组来得到distinct_keys的有效值-否则在创建nosegment 索引时数据会失真），上面的代码也许就可以让你把聚簇因子写入索引的定义中，这将使你更好的了解如果你创建了索引优化器会不会用它。（待测试）

更新（几天以后）
我用三种方式修改了主查询:

将substr()改为substrb()
将rowid的截取改为(1,15)而不是(7,9)——这样我就有了(data_object_id, file_id, block_id)而不仅仅是(file_id, block_id)
在内联查询中添加了一个(冗余的)谓词“object_name is not null”

原文：
October 16, 2019
Clustering_Factor
Filed under: Indexing,Oracle — Jonathan Lewis @ 2:07 pm BST Oct 16,2019
Originally drafted July 2018
“How do you find out what the clustering_factor of an index would be without first creating the index ?”
I’m not sure this is really a question worth asking or answering[1], but since someone asked it (and given the draft date I have no idea who, where, when or why), here’s an answer for simple heap tables in the good old days before Oracle made public the table_cached_blocks preference. It works by sorting the columns you want in the index together with the table rowid, and then comparing the file/block component of the rowid (cutting the relevant characters from the string representation of the rowid) with the previous one to see if the current row is in the same block as the previous row. If the row is in a different block we count one, otherwise zero. Finally we sum the ones.
In the demo I’ve copied a little data from view all_objects, with the intention of creating an index on object_name. After running my counting query I’ve then created the index and checked its clustering_factor to show that I’ve got a match.

rem
rem     Script: clustering_factor_est.sql
rem     Author: J.P.Lewis
rem     Dated:  July 2018
rem
 
create table t1
as
select  *
from    all_objects
where   rownum <= 10000
;
 
prompt  ====================
prompt  Tablescan with lag()
prompt  ====================
 
select
        sum(count_chg)
from    (
        select
                case
                        when substrb(rowid,1,15) <> lag(substrb(rowid,1,15),1,'000000000') over(order by object_name, rowid)
                                then 1
                                else 0
                end     count_chg
        from
                t1
        where
                object_name is not null
        )
;
 
prompt  ======================
prompt  create index and check
prompt  ======================
 
create index t1_i1 on t1(object_name);
 
select  clustering_factor
from    user_indexes 
where   table_name = 'T1'
and     index_name = 'T1_I1'
;

Pasting the output from running the above:

Table created.
 
====================
Tablescan with lag()
====================
 
SUM(COUNT_CHG)
--------------
          3901
 
1 row selected.
 
======================
create index and check
======================
Index created.
CLUSTERING_FACTOR
-----------------
             3901
 
1 row selected.

This will work for a global index on a partitioned table, but will give meaningless answers for globally partitioned indexes and local indexes. Furthermore it’s not a mechanism that lends itself to calculating the clustering_factor if you’ve set the table_cached_blocks preference to a value other than 1.
[1] Given the amount of work it would take to run the query to check the clustering_factor you might as well create the index (invisible, perhaps and nologging if that doesn’t cause problems) – which also gives you the option for setting the table_cached_blocks and gathering_stats to see how the clustering_factor varies.
Update (shortly after posting)
It has occurred to me that if you have a “nosegment” index that you’ve been using to test whether or not the optimizer would use it IF you created it, then the one piece of information that is missing from the data dicionary for that index is its clustering_factor (For a multicolumn index you can get a valid value for distinct_keys by creating a column group on the set of columns – which would otherwise be faked when you create the nosegment index.) This code might allow you to write a clustering_factor to the index definition that would give you a better idea of whether the optimizer would use it if you created it. (To be tested.)
Update (a few days later)
I’ve modified the principle query in three ways:

changed the substr() to substrb()
changed the extract from the rowid to be (1,15) instead of (7,9) – so that I have the (data_object_id, file_id, block_id) not just (file_id, block_id)
added a (redundant) predicate “object_name is not null” to the inline query

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

聚簇因子

评论