索引设计_转载自Jonathan lewis's Blog

原创张程 2020-02-17

916

原文作者：Jonathan lewis
原文地址：原文地址
译文如下：
本案例研究基于几天前在Oracle开发人员社区论坛上提出的一个问题：问题描述。

问题如下：
我有一个超过2亿行的表，用于插入，更新和查询。我对此表有一个查询，想知道我可以创建什么索引来加快查询速度。
表定义如下：

R_ID          NOT NULL VARCHAR2(30)   
C_ID                     NUMBER         
N_ID                     VARCHAR2(40)   
CREATED_BY         NOT NULL VARCHAR2(30)   
LAST_MODIFIED_BY            VARCHAR2(30)   
C_STATUS           CHAR(1)        
A_ACTION            CHAR(1)        
R_CREATION_DT          TIMESTAMP(6)   
CNT                     NUMBER(38)

索引建议如下：

CREATE index tbl_1 on tbl(cnt,r_creation_dt,c_id,a_action,last_modified_by);  
  
CREATE index tbl_2 on tbl(cnt,c_status,r_creation_dt);  
  
CREATE bitmap index tbl_3 on tbl(c_status);

SQL如下：

SELECT c_id, a_action, cnt, last_modified_by
  FROM tbl
 WHERE c_status IN ('N', 'F')
   AND cnt <= 5
 GROUP BY cnt, r_creation_dt, case_id, anonymize_action, last_modified_by
 ORDER BY r_creation_dt FETCH FIRST 1000 ROWS ONLY;

问题描述：
我打算在此介绍的是这种情况下应采用的思考方式，最终建议并不一定是所提出问题的最佳答案（因为撰文时，提问者并没有给出足够多的信息以给出最佳解决方案），但重点是讨论这个分析的过程，并且提示你通过借助索引，我们可以做一些极致的工作。
问题描述如下：
“ 我有一张经常用来插入、更新和查询的，超过2亿行的表。我有一个查询在这个表上，我想知道我可创建什么样的索引以提升查询速度。”
因为提问者给出的表定义和查询中使用的列名不完全一致，所以，我不得不稍做修改。
整理后的表定义、索引及查询语句如下：

rem
rem     Script:         extreme_indexing.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem
 
create table tbl (
        r_id                    varchar2(30) not null,
        c_id                    number,
        n_id                    varchar2(40),
        created_by              varchar2(30) not null,
        last_modified_by        varchar2(30),
        c_status                char(1),
        a_action                char(1),
        r_creation_dt           timestamp(6),
        cnt                     number(38)
)
;
 
create        index tbl_1 on tbl(cnt, r_creation_dt, c_id, a_action, last_modified_by);  
create        index tbl_2 on tbl(cnt, c_status, r_creation_dt);  
create bitmap index tbl_3 on tbl(c_status); 
 
select
        /*+ index(tbl) */
        c_id,
        a_action,
        cnt,
        last_modified_by
from
        tbl
where
        c_status in(
            'N',
            'F'
        )
and     cnt <= 5 -- > comment to avoid wordpress format issue
and     r_creation_dt is not null
group by
        cnt,
        r_creation_dt,
        c_id,
        a_action,
        last_modified_by,
        c_status
order by
        r_creation_dt
fetch 
        first 1000 rows only;

问题分析过程：
首先要指出的是，位图索引tbl_i3基本可以确定不是一个好主意-- 位图和活动事务不能混用。提问者像是阅读过一些网上“不完全错误”，但非常有误导性的说法比如：当该列上具有很少量的不同值时，位图索引会工作得很好。然后，将其应用到了只有少量几个唯一值的，类似于“STATUS”列的列上。
并有参考依据。再看本例，存在条件列类似“status”的查询条件字段，并且只有很少的几个值。但是我们要注意一个细节，该表同时包含很多更新事务，当位图索引列发生改变时，更新所耗费的资源及查询开销将是非常巨大的。
弄清楚这个错误后，我们开始考虑本查询。它使用了（较新的）“fetch first N rows only”(获取前N行)语法。这意味着我们在返回子结果集前，可能必须先访问大量数据并对其排序。在这种情况下，性能问题可能具有很大的欺骗性，看似我们可能需要少量结果集，但必须先做大量工作才能得到它。
在这个案例中，（大量工作）是在前1000行之后的数据，即可能会有很多满足查询条件的数据。因此，我们有两个目标方向以优化查询：

尽可能高效地获取我们需要的数据
对我们获取的数据进行处理后，以尽可能高效地返回1,000行。

该查询只是一个单表查询，这意味着我们要么进行全表扫描，要么找到较好的索引访问方法，我们不必考虑表连接的开销。因此，首先要考虑的是与查询谓词匹配的数据量（和数据分散性）。如果对于条件“c_status in (‘N’,’F’) and cnt <= 5” 只返回“少量”的数据，那么一个在（c_status，cnt）列上构成的，或者以此开头的索引将会很有帮助。（请注意，我是把等值查询谓词列作为索引列的第一位，这是创建多列索引的常用策略的一部分。）
不过，也有几个问题需要回答：

“小”有多小？在2亿行的数据量下，100,000很小。但是如果您必须访问表中的100,000个不同的数据块，并从磁盘中做100,000个单块读，那仍然是一件很糟糕的事情。
多少行的状态为’N’，多少行的状态为’F’，多少行的cnt <= 5？也许只有很少的行的cnt <= 5，而有很多行的c_status列是 in (‘N’,’F’) ，这可能又导致不采用适用于组合索引通用列排序策略（译者注：将相等或IN条件的列放到前导列，而非等条件的列放到其后的组合索引创建策略）才是非常高效的。也许满足单个条件的行数很多，但是满足组合列查询的行数却很少。
这是c_status和CNT列唯一的组合条件值吗？列吗，或者（例如）cnt 为5只是一个示例，不同的c_status值是否有意义，不同的c_status和cnt列组合值，是否又必须使用完全不同的执行路径才能获得最佳查询性能。
我将做出一些抉择以便继续进行分析，它们也可能在提问者看来是完全错误的，因此，该测试仅用于讨论目的。让我们假设该查询和下面的描述是完全匹配的：也许这是一个每隔几分钟运行一次的查询，以清除一些未完成的工作，旧行被处理时，更改状态，然后从结果集（译者注：满足前述查询条件的结果集）中消失。而匹配前述查询条件的新行，保持在结果集中。我们还假设最后结果集总是很“小”，因为’N’和’F’很少（即使cnt <= 5 的总行数很大）。
基于这些假设，我们可以从在（c_status，cnt）列上创建索引开始，该索引使我们可以准确的访问表中所需的行，而无需在访问表后再过滤掉。下面是执行计划（在12.2.0.1上运行，并使用HINT：index()，以保证按照我们设想的使用索引）：

图片1.png

执行时启用了rowsource_execution_statistics（alter session set statistics_level = all）（译者注：目的是收集各执行步骤的资源开销，以便看到各步骤的实际返回的行数和逻辑读量），并输出执行计划。如上所示，首先从“index range scan”和“table access by index rowed”这两步的A-rows的数据量。就可以确定，表中满足条件的全部行为13142。然后对这些行做“sort group by ”，排序意味着行将按照所需的顺序产生，以便ORACLE窗口函数可以为我选择出我想要的1000行。
如果对此感到好奇，也可以查看以下采用dbms_utility.expand_sql_text（）的方法，经转换后，并已实际优化并执行后的SQL（通过dbms_utility.expand_sql_text（）获取，并做了增强）：

SELECT A1.C_ID             C_ID,
       A1.A_ACTION         A_ACTION,
       A1.CNT              CNT,
       A1.LAST_MODIFIED_BY LAST_MODIFIED_BY
  FROM (SELECT
        /*+ INDEX (A2) */
         A2.C_ID C_ID,
         A2.A_ACTION A_ACTION,
         A2.CNT CNT,
         A2.LAST_MODIFIED_BY LAST_ MODIFIED_BY,
         A2.R_CREATION_DT rowlimit_$_0,
         ROW_NUMBER() OVER(ORDER BY A2.R_CREATION_DT) rowlimit_$$_rownumber
          FROM TEST_USER.TBL A2
         WHERE (A2.C_STATUS = 'N' OR A2.C_STATUS = 'F')
           AND A2.CNT <= 5
           AND A2.R_CREATION_DT IS NOT NULL
         GROUP BY A2.CNT,
                  A2.R_CREATION_DT,
                  A2.C_ID,
                  A2.A_ACTION,
                  A2.LAST_MODIFIED_BY,
                  A2.C_STATUS) A1
 WHERE A1.rowlimit_$$_rownumber <= 1000
 ORDER BY A1.rowlimit_$_0

为该SQL选择索引，主要有三个问题。

即使我只需要返回一部分数据（如1000行），我也已经获取了表中谓词匹配的全部数据
我做了很多工作
我创建了一个索引，其中包括表中的全部行。

请记住，提问者有一个包含200M行的表，我们假设只有很小一部分数据与条件谓词匹配。我们在2亿行的数据表上创建索引只是为了获取少量（如数万个）数据，此举非常浪费空间。同时考虑到我们的索引包含“status” 列。当状态列的查询又包含多个状态值时，我们的还会多次访问索引来获得数据（如：执行计划中的5-6步骤）。进一步造成了资源的浪费。因此，我将首先解决该问题。让我们创建一个忽略大多数数据的“函数”索引，并更改代码以发挥该索引的优势。同时由于数据库环境是12c，因此我们可以通过添加虚拟列并在该列上建立索引来实现。

alter table tbl add nf_r_creation_dt invisible 
        generated always as (
                case
                        when c_status in ('N','F') and cnt <= 5
                        then r_creation_dt
                end
        ) virtual

create index tbl_i2 on tbl(nf_r_creation_dt)

我引入了一个不可见的，名为nf_r_creation_dt（nf_指STATUS列上的N/F）的虚拟列，它使用CASE WHEN表达式来匹配原有的查询谓词后并返回查询列r_creation_dt的值，对于不匹配原始谓词的所有其他（约200M）行，返回null。因此当我创建索引时，只有我可能需要的数据才会进入索引。
我还需要改写SQL以匹配创建的索引及虚拟列，只需将出现的r_creation_dt列替换为虚拟列nf_r_creation_dt，并删除原始的查询条件后，得到如下的SQL文本及执行计划：

select
        /*+ index(tbl) */
        c_id,
        a_action,
        cnt,
        last_modified_by
from
        tbl
where
        nf_r_creation_dt is not null
group by
        nf_r_creation_dt,
        cnt,
        c_id,
        a_action,
        last_modified_by,
        c_status
order by
        nf_r_creation_dt
fetch 
        first 1000 rows only    -- 1,000 rows in the original

图片2.png

执行计划采用了新索引的索引全扫描。但由于索引列仅包含那可能相关的数据行，所以，这并不是一个问题。但是，为了返回表中匹配的行（译者注：还需要通过回表步骤来访问不在索引中的其它列）。这可能会导致随机I / O超出我们的期望。因此，提高性能的下一步就是考虑将我们需要的所有列添加到索引中。这里有一个小问题：如果我们按原样添加列，我们将回到为表中每一行生成一个索引条目的老路上。，因此我们需要使用相同的CASE WHEN机制来创建更多虚拟列：

alter table tbl add nf_c_status invisible 
        generated always as (
                case
                        when c_status in ('N','F') and cnt <= 5
                        then c_status
                end
        ) virtual
/
 
alter table tbl add nf_last_modified_by invisible 
        generated always as (
                case
                        when c_status in ('N','F') and cnt <= 5
                        then last_modified_by
                end
        ) virtual
/
 
alter table tbl add nf_a_action invisible 
        generated always as (
                case
                        when c_status in ('N','F') and cnt <= 5
                        then a_action
                end
        ) virtual
/
 
alter table tbl add nf_c_id invisible 
        generated always as (
                case
                        when c_status in ('N','F') and cnt <= 5
                        then c_id
                end
        ) virtual
/
 
alter table tbl add nf_cnt invisible 
        generated always as (
                case
                        when c_status in ('N','F') and cnt <= 5
                        then cnt
                end
        ) virtual
/
 
create index tbl_i3 on tbl(
        nf_r_creation_dt,
        nf_cnt,
        nf_c_id,
        nf_a_action,
        nf_last_modified_by,
        nf_c_status
)
;

以上这些看起来有点繁琐的操作，是为了让这些列绝大多数情况下是NULL值，只有当其匹配原始谓词条件时才显示其原来的值。然后，还必须修改查询以匹配：

select
        /*+ index(tbl) */
        nf_c_id,
        nf_a_action,
        nf_cnt,
        nf_last_modified_by
from
        tbl
where
        nf_r_creation_dt is not null
group by
        nf_r_creation_dt,
        nf_cnt,
        nf_c_id,
        nf_a_action,
        nf_last_modified_by,
        nf_c_status
order by
        nf_r_creation_dt
fetch 
        first 1000 rows only    -- 1,000 rows in the original

但从执行计划中，我们看到了巨大的回报：
图片3.png

注意SORT GROUP BY操作竟然是NOSORT，而WINDOW操作也是NOSORT和STOPKEY？

我们创建了可能最小的索引，只有当行上的相关状态列被修改，才会进入或移出索引，当我们执行SQL查询时，Oracle会对索引进行完整扫描，来确保访问的全部是满足条件的数据。但在索引访问到足够的聚合结果后，就会停止扫描。

精华：如果您可以修改代码，在特定情况下，借助精心设计的索引来最小化查询所完成的工作以及维护该查询所需的基础工作，着实是令人惊叹的。虚拟列是一种极好的帮助，尤其是现在12c中允许将其置于不可见状态。

附上原文内容：
Index Engineering
Filed under: Indexing,Oracle,Tuning — Jonathan Lewis @ 4:53 pm GMT Jan 20,2020
This is a case study based on a question that appeared on the Oracle Developer Community forum a few days ago.

What I’m aiming to present in this note is the pattern of thinking that you should adopt in cases like this. The final suggestion in this note isn’t necessarily the best answer to the question posed (at the time of writing the OP hadn’t supplied enough information to allow anyone to come up with a best solution), but the point of the exercise is to talk about the journey and (perhaps) remind you of some of the extreme engineering you can do with indexes.

The (massaged) problem statement is as follows:

I have a table of more than 200 million rows that is used for inserts, updates and queries. I have a query on this table and want to know what index I could create to speed up the query.

The supplied definition of the table was not consistent with the names used in the query, so I’ve had to do a little editing, but table, current indexes, and query were as follows:

rem
rem Script: extreme_indexing.sql
rem Author: Jonathan Lewis
rem Dated: Jan 2020
rem

create table tbl (
r_id varchar2(30) not null,
c_id number,
n_id varchar2(40),
created_by varchar2(30) not null,
last_modified_by varchar2(30),
c_status char(1),
a_action char(1),
r_creation_dt timestamp(6),
cnt number(38)
)
;

create index tbl_1 on tbl(cnt, r_creation_dt, c_id, a_action, last_modified_by);
create index tbl_2 on tbl(cnt, c_status, r_creation_dt);
create bitmap index tbl_3 on tbl(c_status);

select
/*+ index(tbl) */
c_id,
a_action,
cnt,
last_modified_by
from
tbl
where
c_status in(
‘N’,
‘F’
)
and cnt <= 5 – > comment to avoid wordpress format issue
and r_creation_dt is not null
group by
cnt,
r_creation_dt,
c_id,
a_action,
last_modified_by,
c_status
order by
r_creation_dt
fetch
first 1000 rows only
;
The first thing to point out is the bitmap index tbl_i3 is almost certainly a bad idea – bitmaps and transactional activity do not mix. It seems quite likely that the OP in this case had read one of the many Internet notes that makes the “not totally wrong” but very misleading statement “bitmap indexes are good when you have a small number of distinct values”, and appled the principle to a column that looks like a “status” column holding only a few distisnct values.

Having got that error out of the way we can start to think about the query. It’s using the (fairly new) “Fetch first N rows” syntax, which means we may have to find a lot of data and sort it before returning a subset: performance issues can be very deceptive in cases like this because we might want a small result set but have to do a large amount of work to get it.

In this case we’re after the first 1,000 rows – which makes you think that maybe there will be a lot of data satisfying the query. So we have two targets to meet to optimise the query:

acquire the data we need as efficiently as possible
post-process the data we acquire to derive the 1,000 rows as efficiently as possible
The query is just a single table access – which means we’re either going to do a full tablescan or find a good indexed access path, we don’t have to worry about join strategies. So the first thing to consider is the volume (and scatter) of data that matches the predicates. If there’s only a “small” amount of data where “c_status in (‘N’,’F’) and cnt <= 5” then an index on – or starting with – (c_status, cnt) may be very helpful. (Note how I’ve specified the column with the equality predicate first – that’s part of a generic strategy for creating multi-column indexes.)

This, though, raises several questions that need to be answered:

How small is “small” ? In the context of 200 million rows, 100,000 is small; but if you had to visit 100,000 different blocks in the table and do 100,000 real single block reads from disc that might still be a very bad thing.
How many rows have status ‘N’, how many have status ‘F’, how many have cnt <= 5 ? Maybe a really tiny number of rows have cnt<=5 and lots have c_status in (‘N’,’F’) which could make this a case where ignoring the generic column-ordering strategy would be very effective. Maybe the number of rows satisfying the individual conditions is high but the number satisfying the combination is very low.
Is this the ONLY combination of c_status and cnt that is of interest, or (for example) was 5 just the number that was picked as an example, Would different c_status values be of interest, would some required combinations of c_status and cnt have to use completley different execution paths for the best performance.
I’m going to make some decisions in order to proceed – they may be totally wrong as far as the OP is concerned – so remember that this note is just for discussion purposes. Let’s assume that the common query is always exactly as stated. Perhaps it’s a query that runs every few minutes to clear up some outstanding work with the expectation that new rows matching the query keep appearing while older rows are processed, change status, and disappear from the result set. Let’s also assume that the result set is always “small”, and that it’s small because ‘N’ and ‘F’ are rare (even if the total number of rows with cnt <= 5 is large).

With these assumptions we could start by creating an index on (c_status, cnt), which gets us to exactly the rows we want from the table with no “throwaway” after visiting the table. Here’s the excution plan if that’s our choice of index (running on 12.2.0.1, and with an index() hint to force the use of the index when necessary):

图片1.png

I’ve enabled rowsource_execution_statistics (alter session set statistics_level = all) and pulled my execution plan from memory. As you can see from the A-rows for the index range scan and table access by index rowid, I’ve identified and acquired exactly the rows from the table that might be relevant (all 13,142 of them), then I’ve done a sort group by of all that data, sorting in a way that means the rows will be produced in exactly the order I need for the windowing function that Oracle will use to select the 1,000 rows I want.

If you’re curious, here (courtesy of dbms_utility.expand_sql_text() but cosmetically enhanced) is the transformed SQL that was actually optimised and executed:

SELECT
A1.C_ID C_ID,A1.A_ACTION A_ACTION,A1.CNT CNT,A1.LAST_MODIFIED_BY LAST_MODIFIED_BY
FROM (
SELECT
/*+ INDEX (A2) */
A2.C_ID C_ID,
A2.A_ACTION A_ACTION,
A2.CNT CNT,
A2.LAST_MODIFIED_BY LAST_ MODIFIED_BY,
A2.R_CREATION_DT rowlimit__0, ROW_NUMBER() OVER ( ORDER BY A2.R_CREATION_DT) rowlimit__rownumber FROM TEST_USER.TBL A2 WHERE (A2.C_STATUS='N' OR A2.C_STATUS='F') AND A2.CNT<=5 AND A2.R_CREATION_DT IS NOT NULL GROUP BY A2.CNT,A2.R_CREATION_DT,A2.C_ID,A2.A_ACTION,A2.LAST_MODIFIED_BY,A2.C_STATUS ) A1 WHERE A1.rowlimit__rownumber<=1000 ORDER BY A1.rowlimit__0

There are three main drawbacks to this choice of index.

I’ve acquired all the rows in the table that match the predicate even though I only really needed a subset
I’ve done a massive sort
I’ve created an index that includes every row in the table
Remember that the OP has a table of 200M rows, and we are assuming (pretending) that only a very small fraction of them match the initial predicates. Creating an index on 200M rows because we’re interested in only a few tens of thousands is wasteful of space and (given we have a “status” column) probably wasteful of processing resources as the status moves through several values. So I’m going to address that issue first. Let’s create a “function-based” index that ignores most of the data, and change the code to take advantage of that index – but since this is 12c, let’s do it by adding a virtual column and indexing that column.

alter table tbl add nf_r_creation_dt invisible
generated always as (
case
when c_status in (‘N’,‘F’) and cnt <= 5
then r_creation_dt
end
) virtual
/

create index tbl_i2 on tbl(nf_r_creation_dt)
/

I’ve introduced an invisible virtual column called nf_r_creation_dt (nf_ for status N/F) which uses a CASE expression matching the original predicate to return the r_creation_dt for rows that match and null for all the other (ca. 200M) rows. So when I create an index on the column the only entries in the index are for rows that I might want to see.

I have to edit the SQL to match – which simply means changing every appearance of r_creation_dt to nf_r_creation_dt, and eliminating the original predicate giving the following text and execution plan:

select
/*+ index(tbl) */
c_id,
a_action,
cnt,
last_modified_by
from
tbl
where
nf_r_creation_dt is not null
group by
nf_r_creation_dt,
cnt,
c_id,
a_action,
last_modified_by,
c_status
order by
nf_r_creation_dt
fetch
first 1000 rows only – 1,000 rows in the original

图片2.png

The plan shows an index full scan on the new index. Since the index holds only those rows that might be interesting this isn’t a threat. However we still have to visit all the matching rows in the table – and that might result in more random I/O than we like. So the next step in enhancing performance is to consider adding all the columns we want to the index. There’s a little problem with that: if we add the columns as they are we will go back to having an index entry for every single row in the table so we need to use the same CASE mechanism to create more virtual columns:

alter table tbl add nf_c_status invisible
generated always as (
case
when c_status in (‘N’,‘F’) and cnt <= 5
then c_status
end
) virtual
/

alter table tbl add nf_last_modified_by invisible
generated always as (
case
when c_status in (‘N’,‘F’) and cnt <= 5
then last_modified_by
end
) virtual
/

alter table tbl add nf_a_action invisible
generated always as (
case
when c_status in (‘N’,‘F’) and cnt <= 5
then a_action
end
) virtual
/

alter table tbl add nf_c_id invisible
generated always as (
case
when c_status in (‘N’,‘F’) and cnt <= 5
then c_id
end
) virtual
/

alter table tbl add nf_cnt invisible
generated always as (
case
when c_status in (‘N’,‘F’) and cnt <= 5
then cnt
end
) virtual
/

create index tbl_i3 on tbl(
nf_r_creation_dt,
nf_cnt,
nf_c_id,
nf_a_action,
nf_last_modified_by,
nf_c_status
)
;
It looks like a bit of a pain to go through all this rigmarole to get all those columns that are null most of the time but echo the original values when the rows match our original predicate; and then we have to modify the query to match:

select
/*+ index(tbl) */
nf_c_id,
nf_a_action,
nf_cnt,
nf_last_modified_by
from
tbl
where
nf_r_creation_dt is not null
group by
nf_r_creation_dt,
nf_cnt,
nf_c_id,
nf_a_action,
nf_last_modified_by,
nf_c_status
order by
nf_r_creation_dt
fetch
first 1000 rows only – 1,000 rows in the original
/
But the big payoff comes from the execution plan:
图片3.png

Notice how the SORT GROUP BY operation is a NOSORT, and the WINDOW operation is both NOSORT and STOPKEY ?

We’ve got the smallest index possible that only gets modified as rows move into, or out of, the interesting state, and when we run the query Oracle does a full scan of the index maintaining “running totals” but stop as soon as it’s aggregated enough results.

tl;dr
For very special cases it’s really amazing what you can (sometimes) do – if you can modify the code – with carefully engineered indexes to minimise the work done by a query AND the work done maintaining the infrastructure needed for that query. Virtual columns are a fantastic aid, especially now that 12c allows them to be invisible.

oracle sql优化

最后修改时间：2020-02-17 13:18:17

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

索引设计_转载自Jonathan lewis's Blog

评论