DWS优化实战：40亿行大表点查从秒级到毫秒级的优化之路

原创 Rune悠然 2026-01-04

457

DWS环境下，客户要求将40亿行大表的点查SQL从秒级优化至毫秒级。虽然是听起来有些离谱的需求，但笔者经多次尝试，最终仍实现了秒级到毫秒级的性能飞跃，完成挑战。下面分享该案例的操作过程。

一、问题与背景

环境配置：DWS3节点8.2.1版本。
表信息：查询仅涉及一个单表，以下简称target_table_name，该表数据量40亿行，大小约3.5TB，132列，行存表，无分区，分布键为 id ，在 batch_no 和 date_col 上有组合索引，统计信息最新。
业务查询：固定点查SQL，使用 batch_no 和 date_col （范围2-3个月）过滤，带 ORDER BY 和 LIMIT 20 。
性能痛点：首次执行耗时3~5秒，缓存后虽可降至毫秒级，但要求首次查询即稳定达到毫秒级。
脱敏后SQL内容如下：

--原始SQL
select
column_name1, column_name2, ... column_name73 --省略71个字段
from target_table_name
where date_col between 'date_start' and 'date_end' --时间范围
and batch_no = 'filter_val1'
and filter_col2 in ('filter_val2')
and filter_col3 = 'filter_val3'
and filter_col4 in ('filter_val4')
and filter_col5 in ('filter_val5')
order by order_col1, order_col2, order_col3, order_col4, order_col5, order_col6 desc
limit 20
;

原SQL执行计划（这里A-time毫秒级源于客户提供的是多次查询之后的执行计划，实际首次查询为秒级）

| id  | operation                                                                 		 | A-rows | E-rows | E-cost       | A-time             |
|-----|------------------------------------------------------------------------------------------|--------|--------|--------------|--------------------|
| 1   | Limit                                                                     		 | 20     | 20     | 20933.08     | 289.211            |
| 2   |  -> Streaming (type: GATHER)                                              		 | 20     | 20     | 20933.08     | 289.204            |
| 3   |     -> Limit                                                              		 | 20     | 20     | 20922.90     | [279.336, 279.336] |
| 4   |        -> Sort                                                            		 | 20     | 3354   | 20932.52     | [279.331, 279.331] |
| 5   |           -> Index Scan using index_name on target_table_name                 		 | 54096  | 20124  | 20819.87     | [164.052, 164.052] |

二、优化历程

1. 第一次尝试：日分区 + 日期前导索引（失败）

操作：按 date_col 创建日分区，并建立 (date_col, batch_no) 的本地索引（中间写入原表数据的操作省略）。

-- 创建日分区表
CREATE TABLE test_day_part (...) 
DISTRIBUTE BY HASH(id)
PARTITION BY RANGE(date_col)
(
    PARTITION p20250101 VALUES LESS THAN ('2025-01-02'::date),
    PARTITION p20250102 VALUES LESS THAN ('2025-01-03'::date),
    PARTITION p20250103 VALUES LESS THAN ('2025-01-04'::date),
    ... -- 省略其他日分区
    PARTITION pmax VALUES LESS THAN (maxvalue)
);

--写入测试数据
--在业务空闲期间，抽取原表数据写入,耗时1小时，分批Insert，具体步骤在此不计。

-- 写入数据后再创建本地分区索引，以date_col为前导列 耗时1小时
CREATE INDEX test_day_part_idx ON test_day_part(date_col, batch_no) LOCAL;

-- 收集统计信息
ANALYZE test_day_part;

结果：查询劣化至62秒。
EXPLAIN ANALYZE/PERFORMANCE 执行计划展示：

| id  | operation                                                                        | A-rows | E-rows | E-cost       | A-time                 |
|-----|--------------------------------------------------------------------------- ------|--------|--------|--------------|------------------------|
| 1   | Limit                                                                            | 20     | 20     | 6000009.41   | 62567.305              |
| 2   |  -> Streaming (type: GATHER)                                                     | 20     | 20     | 6000009.41   | 62567.292              |
| 3   |     -> Limit                                                                     | 20     | 20     | 5999999.40   | [62559.373, 62559.373] |
| 4   |        -> Sort                                                                   | 20     | 20124  | 6000000.93   | [62556.767, 62556.767] |
| 5   |           -> Partition Iterator                                                  | 54096  | 3354   | 5999909.90   | [62433.361, 62433.361] |
| 6   |              -> Partitioned Index Scan using test_day_part_idx on test_day_part  | 54096  | 20124  | 5999909.90   | [62421.604, 62421.604] |

个人复盘：O记经验主义，陷入了之前O记运维的经典优化手段，大表日分区剪枝+本地索引。

分区粒度过细：查询跨2-3个月，需扫描60-90个分区，引入大量元数据与调度开销。
索引前导列错误： date_col 作为前导列，无法利用 batch_no 的高选择性快速过滤。

2. 第二次尝试：月分区 + batch_no前导索引（改善但未达标）

操作：改为月分区，建立 (batch_no, date_col) 的本地索引。

-- 创建日分区表
CREATE TABLE test_day_part (...) 
DISTRIBUTE BY HASH(id)
PARTITION BY RANGE(date_col)
(
    PARTITION p20250101 VALUES LESS THAN ('2025-01-02'::date),
    PARTITION p20250102 VALUES LESS THAN ('2025-01-03'::date),
    PARTITION p20250103 VALUES LESS THAN ('2025-01-04'::date),
    ... -- 省略其他日分区
    PARTITION pmax VALUES LESS THAN (maxvalue)
);

--写入测试数据
--在业务空闲期间，抽取原表数据写入,耗时1小时，分批Insert，具体步骤在此不计。

-- 写入数据后再创建本地分区索引，以date_col为前导列 耗时1小时
CREATE INDEX test_day_part_idx ON test_day_part(date_col, batch_no) LOCAL;

-- 收集统计信息
ANALYZE test_day_part;

结果：首次查询耗时约4秒，与原性能持平。
EXPLAIN ANALYZE/PERFORMANCE 执行计划展示：

| id  | operation                                                                          | A-rows | E-rows | E-cost       | A-time               |
|-----|------------------------------------------------------------------------------------|--------|--------|--------------|----------------------|
| 1   | Limit                                                                              | 20     | 20     | 15704.06     | 4027.172             |
| 2   |  -> Streaming (type: GATHER)                                                       | 20     | 20     | 15704.06     | 4027.165             |
| 3   |     -> Limit                                                                       | 20     | 20     | 15694.14     | [4012.219, 4012.219] |
| 4   |        -> Sort                                                                     | 20     | 3354   | 15702.42     | [4012.208, 4012.208] |
| 5   |           -> Partition Iterator                                                    | 54096  | 20212  | 15605.40     | [3879.718, 3879.718] |
| 6   |              -> Partitioned Index Scan using test_mon_part_idx on test_mon_part	   | 54096  | 20124  | 15605.40     | [3859.981, 3859.981] |

瓶颈分析：

物理存储仍无序：索引可定位约5万行数据，但回表需随机读取分散的数据页。
排序开销未消除： ORDER BY 字段与索引无关，仍需内存排序。

3. 第三次优化：月分区 + 索引聚簇（成功）

操作：在月分区及 (batch_no, date_col) 索引基础上，执行：

CLUSTER test_mon_part USING test_mon_part_idx;
ANALYZE test_mon_part;

EXPLAIN ANALYZE/PERFORMANCE 执行计划展示：
注：为了实验结果合理，尽量减少buffer命中对执行效率提升的干扰因素，第三次优化间隔一段时间后再次进行EXPLAIN PERFORMANCE

| id  | operation                                                                          | A-rows | E-rows | E-cost       | A-time             |
|-----|------------------------------------------------------------------------------------|--------|--------|--------------|--------------------|
| 1   | Limit                                                                              | 20     | 20     | 15704.06     | 301.906            |
| 2   |  -> Streaming (type: GATHER)                                                       | 20     | 20     | 15704.06     | 301.892            |
| 3   |     -> Limit                                                                       | 20     | 20     | 15694.14     | [298.010, 298.010] |
| 4   |        -> Sort                                                                     | 20     | 3354   | 15702.42     | [297.997, 297.997] |
| 5   |           -> Partition Iterator                                                    | 54096  | 20212  | 15605.40     | [170.168, 170.168] |
| 6   |              -> Partitioned Index Scan using test_mon_part_idx on test_mon_part	   | 54096  | 20124  | 15605.40     | [156.252, 156.252] |

原理： CLUSTER 命令按索引顺序物理重组表数据，使符合查询条件的数据在磁盘上连续存储。
效果：首次查询耗时降至约300毫秒，且替换条件和日期区间多次执行依然保持毫秒级，实现毫秒级稳定响应。
性能提升原因分析：

第二次优化执行计划Datanode Information：

Datanode Information (identified by plan id)
---------------------------------------------
1 --Limit
  (actual time=4027.163..4027.172 rows=20 loops=1)
  (CPU: ex c/r=43, ex row=20, ex cyc=879, inc cyc=402714094)

2 --Streaming (type: GATHER)
  (actual time=4027.159..4027.165 rows=20 loops=1)
  (Buffers: shared hit=4)
  (CPU: ex c/r=20135601, ex row=20, ex cyc=402713215, inc cyc=402713215)

3 --Limit
  dn_xxx_xxx (actual time=4012.209..4012.219 rows=20 loops=1)
  dn_xxx_xxx (CPU: ex c/r=46, ex row=20, ex cyc=926, inc cyc=401211712)

4 --Sort
  dn_xxx_xxx (actual time=4012.205..4012.208 rows=20 loops=1)
  dn_xxx_xxx (Buffers: shared hit=27 read=7931)
  dn_xxx_xxx (CPU: ex c/r=250, ex row=54096, ex cyc=13531916, inc cyc=401210784)

5 --Partition Iterator
  dn_xxx_xxx (actual time=0.175..3879.718 rows=54096 loops=1)
  dn_xxx_xxx (CPU: ex c/r=39, ex row=54096, ex cyc=12828549, inc cyc=387678669)

6 --Partitioned Index Scan using index_name on table_name
  dn_xxx_xxx (actual time=1.673..3859.981 rows=54096 loops=1) (filter time=65.727 projection time=30.799)
  dn_xxx_xxx (Buffers: shared hit=10 read=7931)
  dn_xxx_xxx (CPU: ex c/r=7130, ex row=54096, ex cyc=385750319, inc cyc=385750319)

第三次优化执行计划Datanode Information：

Datanode Information (identified by plan id)
---------------------------------------------
1 --Limit
  (actual time=317.899..317.911 rows=20 loops=1)
  (CPU: ex c/r=51, ex row=20, ex cyc=1035, inc cyc=31790600)

2 --Streaming (type: GATHER)
  (actual time=317.894..317.903 rows=20 loops=1)
  (Buffers: shared hit=1)
  (CPU: ex c/r=1589478, ex row=20, ex cyc=31789565, inc cyc=31789565)

3 --Limit
  dn_xxx_xxx (actual time=306.557..306.568 rows=20 loops=1)
  dn_xxx_xxx (CPU: ex c/r=52, ex row=20, ex cyc=1053, inc cyc=30655802)

4 --Sort
  dn_xxx_xxx (actual time=306.553..306.556 rows=20 loops=1)
  dn_xxx_xxx (Buffers: shared hit=20 read=5997)
  dn_xxx_xxx (CPU: ex c/r=212, ex row=54096, ex cyc=11497009, inc cyc=30654749)

5 --Partition Iterator
  dn_xxx_xxx (actual time=0.177..193.910 rows=54096 loops=1)
  dn_xxx_xxx (CPU: ex c/r=24, ex row=54096, ex cyc=1341027, inc cyc=19157740)

6 --Partitioned Index Scan using index_name on table_name
  dn_xxx_xxx (actual time=0.624..180.625 rows=54096 loops=4) (filter time=54.221 projection time=28.917)
  dn_xxx_xxx (Buffers: shared hit=3 read=5997)
  dn_xxx_xxx (CPU: ex c/r=329, ex row=54096, ex cyc=17816713, inc cyc=17816713)

数据物理有序后，对比Datanode Information的Plan id 4和6发现，排序和索引扫描时间 actual time 大幅减少。
关注 Buffers: shared hit 与 read 的比例，第一个计划需要从磁盘读取约7931个数据块，而第二个计划仅需读取约5997个，减少了近25%。这直接导致了第二个计划各层算子的I/O等待时间大幅缩短，进而使得CPU能够更高效地工作，整体响应时间更快（cyc执行周期数的降低）。

三、局部聚簇适用条件与操作指南

适用场景

查询模式固定（WHERE、ORDER BY条件稳定）。
返回数据量小（如 LIMIT 20 ）。
表为分析型（AP）负载，极少有DML操作（INSERT/UPDATE/DELETE会破坏聚簇顺序）。

操作步骤

创建索引：按查询条件创建本地分区索引，等值字段在前（如 (batch_no, date_col) ）。
执行聚簇：

历史数据：全表 CLUSTER （需在业务低峰期进行，锁表耗时较长）。
增量数据：对单个分区执行 CLUSTER （推荐，影响小）。

更新统计信息：聚簇后立即执行 ANALYZE 。
验证监控：检查执行计划，定期监控查询性能。

避坑要点

维护成本：聚簇非一劳永逸，新增数据会破坏顺序，需定期对增量分区执行聚簇。
索引匹配：聚簇依赖的索引必须与高频查询条件高度匹配。
锁与资源： CLUSTER 需要 ACCESS EXCLUSIVE 锁，且消耗大量I/O与临时空间（尤其注意防止集群只读）。
分区策略：先确保分区粒度（如月分区）与查询范围匹配，再实施聚簇。

四、总结

对于海量数据表的固定点查优化，在正确设计分区和索引的基础上，通过局部聚簇（CLUSTER）将数据物理重组，是实现毫秒级稳定查询的关键。该方案仅适用于极少更新、查询模式固定的分析型AP场景，但必须配套定期的增量数据聚簇维护，以保持性能。

数据库实操性能优化墨力计划 gaussdb(dws)大表点查

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

文章被以下合辑收录

墨天轮数据库性能优化征文-优质文章合集（共37篇）

本合辑汇总了「实操看我的」数据库性能优化篇征文活动中“最佳实操奖” 文章及部分高阅读量合格文章，欢迎参考~

DWS数仓服务专题（共1篇）

DWS数仓服务专题，分享一些运维案例