
前言
今天带来的是一个比较经典的问题,group by字段上有索引到底能不能使用?。我们分别在Oracle和PostgreSQL中创建一张表,然后在sum求和的列和group by的列上创建一个组合索引。执行分组查询求和,看看两大数据库是什么表现。
Oracle
SQL> CREATE TABLE gtable (id NOT NULL, groupby NOT NULL, orderby NOT NULL, padding NOT NULL) 2 AS 3 SELECT level, 4 mod(level, 100), 5 mod(level * 7 , 10000), 6 CAST(RPAD('Value ' || level || ' ', 500, '*') AS VARCHAR2(500)) 7 FROM dual 8 CONNECT BY 9 level <= 1000000;Table created.SQL> CREATE INDEX ix_gtable_groupby_orderby ON gtable(groupby, orderby);Index created.BEGIN DBMS_STATS.GATHER_TABLE_STATS(ownname => 'SYS',tabname => 'GTABLE', estimate_percent => 100, method_opt => 'for all columns size auto', granularity => 'ALL',degree => 16, cascade => TRUE, no_invalidate => FALSE, force => TRUE);END;/SQL> alter session set statistics_level=all; SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));PLAN_TABLE_OUTPUT-----------------------------------------------------------------------------------------------------------------------------------SQL_ID 0uxruvhgy6wac, child number 0-------------------------------------SELECT groupby, max(orderby) gmax FROM gtable GROUP BY groupbyPlan hash value: 3942815790----------------------------------------------------------------------------------------------------------------------------------------| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |----------------------------------------------------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | | 100 |00:00:00.26 | 2527 | | | || 1 | HASH GROUP BY | | 1 | 100 | 100 |00:00:00.26 | 2527 | 1200K| 1200K| 1380K (0)|| 2 | INDEX FAST FULL SCAN| IX_GTABLE_GROUPBY_ORDERBY | 1 | 1001K| 1000K|00:00:00.09 | 2527 | | | |----------------------------------------------------------------------------------------------------------------------------------------14 rows selected.
可以看到Oracle针对这类查询使用了INDEX FAST FULL SCAN
的方法获取数据。并没有使用全表扫描。
SELECT /*+full(gtable)*/ groupby, max(orderby) gmax FROM gtable GROUP BY groupby;SQL> select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));PLAN_TABLE_OUTPUT-----------------------------------------------------------------------------------------------------------------------------------SQL_ID d34t0d021gwt1, child number 0-------------------------------------SELECT /*+full(gtable)*/ groupby, max(orderby) gmax FROM gtable GROUPBY groupbyPlan hash value: 1428069678------------------------------------------------------------------------------------------------------------------| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |------------------------------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | | 100 |00:00:00.35 | 76939 | | | || 1 | HASH GROUP BY | | 1 | 100 | 100 |00:00:00.35 | 76939 | 1200K| 1200K| 2542K (0)|| 2 | TABLE ACCESS FULL| GTABLE | 1 | 1001K| 1000K|00:00:00.19 | 76939 | | | |------------------------------------------------------------------------------------------------------------------
我们手动加hint让它强制走全表扫描,可以发现Oracle消耗的时间和逻辑读,都要远远的高于走索引的情况。
PostgreSQL
同样,我们在PostgreSQL中创建一模一样的表。当前我的PostgreSQL版本为13.2
CREATE TABLE GTable ( id INT NOT NULL, groupby INT NOT NULL, orderby INT NOT NULL, padding VARCHAR(1000) NOT NULL);INSERT INTO GTableSELECT s, s % 100, s % 10000, RPAD('Value ' || s || ' ' , 500, '*')FROM generate_series(1, 100000) s;CREATE INDEX ix_gtable_groupby_orderby ON gtable(groupby, orderby);testdb=# analyze GTable; postgres=# explain (analyze,buffers) SELECT groupby, max(orderby) gmax from gtable group by groupby; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=8643.00..8644.00 rows=100 width=8) (actual time=38.129..38.144 rows=100 loops=1) Group Key: groupby Batches: 1 Memory Usage: 32kB Buffers: shared hit=7143 -> Seq Scan on gtable (cost=0.00..8143.00 rows=100000 width=8) (actual time=0.006..13.046 rows=100000 loops=1) Buffers: shared hit=7143 Planning: Buffers: shared hit=8 read=1 Planning Time: 0.161 ms Execution Time: 38.190 ms(10 rows)
可以看到PostgreSQL默认直接使用了Seq Scan
,它并没有走索引。而我们其实也收集了表的统计信息,他就是不会走索引。这主要是因为默认收集统计信息并没有对列精细收集。我们需要把列的统计信息收集的更加细一点,然后它就会走索引了。
通过对列groupby和orderby调整统计信息值为10000后,再次收集统计信息。执行1-2次后可以使用索引,走索引的效率明显好于全表扫描。
testdb=# ALTER TABLE GTable ALTER COLUMN groupby SET STATISTICS 10000;testdb=# ALTER TABLE GTable ALTER COLUMN orderby SET STATISTICS 10000;testdb=# analyze GTable;postgres=# explain (analyze,buffers) SELECT groupby, max(orderby) gmax from gtable group by groupby; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=0.29..2469.29 rows=100 width=8) (actual time=0.270..23.682 rows=100 loops=1) Group Key: groupby Buffers: shared hit=33 read=84 -> Index Only Scan using ix_gtable_groupby_orderby on gtable (cost=0.29..1968.29 rows=100000 width=8) (actual time=0.035..9.674 rows=100000 loops=1) Heap Fetches: 0 Buffers: shared hit=33 read=84 Planning: Buffers: shared hit=11 Planning Time: 0.135 ms Execution Time: 23.705 ms(10 rows)
可以看到执行时间从38ms下降到23ms。
后记
我们在PostgreSQL中执行sum和group by做统计查询,默认往往不能使用group by列上的索引。此时需要把列的统计信息值调高一点,再次收集才能和Oracle表现一样。
注:以上结论是在PostgreSQL 13.2版本上测试结果。
参考链接
1.https://stackoverflow.com/questions/48382860/why-covering-index-is-not-used-in-postgresql


励志成为PostgreSQL大神
长按关注吧
文章转载自励志成为postgresql大神,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




