发现隐患SQL
之前阅读德哥文章,本文正好实践下
https://github.com/digoal/blog/blob/master/201612/20161231_01.md
某次巡检发现隐患SQL
SELECT name, VAR_TEXT FROM PT_WF_RU_VARIABLE varWhere 1=1 and var.NAME ='prociName'and VAR_TEXT like '%变XXXXXXXX票%';
执行成本较高,经过了解其中prociName 比较固定,而值:变XXXXXXX票是用户输入的值,该值用户比较关心,用户查询比较频繁,用户也会查询其他关键字,但是总的来说返回的数据量不多。
Gather (cost=1000.00..31695.89 rows=4 width=34) (actual time=27.070..329.595 rows=399 loops=1)Workers Planned: 3Workers Launched: 0Buffers: shared hit=25730-> Parallel Seq Scan on pt_wf_ru_variable var (cost=0.00..30695.49 rows=1 width=34) (actual time=26.625..329.001 rows=399 loops=1)Filter: (((var_text)::text ~~ '%变XXXXXXXX票%'::text) AND ((name)::text = 'prociName'::text))Rows Removed by Filter: 1025572Buffers: shared hit=25730Planning Time: 0.287 msExecution Time: 329.657 ms
2.5万逻辑读, 耗时420ms. 表数据总量102W,其中能匹配上'%变XXXXXX票%'只有399条数据。
优化方案
创建gin 索引
GIN(Generalized Inverted Index, 通用倒排索引) 是一个存储对(key, posting list)集合的索引结构适用于数组、全文检索等数据类型,可以扩展支持更多的数据类型。添加extension。
Create extension if not exists pg_trgm;create extension if not EXISTS btree_gin;-- 新建gin索引CREATE INDEX idx_pt_wt_var_name_text_gin ON power_bpm.PT_WF_RU_VARIABLE using gin( var_text gin_trgm_ops );
执行计划
Bitmap Heap Scan on pt_wf_ru_variable var (cost=60.13..127.07 rows=4 width=34) (actual time=0.715..1.897 rows=399 loops=1)Recheck Cond: ((var_text)::text ~~ '%变XXXXXXXX票%'::text)Filter: ((name)::text = 'prociName'::text)Heap Blocks: exact=272Buffers: shared hit=290-> Bitmap Index Scan on idx_pt_wt_var_name_text_gin (cost=0.00..60.13 rows=17 width=0) (actual time=0.618..0.619 rows=399 loops=1)Index Cond: ((var_text)::text ~~ '%变XXXXXXXX票%'::text)Buffers: shared hit=18Planning Time: 0.378 msExecution Time: 2.000 ms
成本:290逻辑读, 耗时2ms.
短语无法使用gin索引
Gather (cost=1000.00..32260.48 rows=5661 width=34) (actual time=5.015..344.088 rows=24341 loops=1)Workers Planned: 3Workers Launched: 0Buffers: shared hit=25730-> Parallel Seq Scan on pt_wf_ru_variable var (cost=0.00..30694.38 rows=1826 width=34) (actual time=0.191..337.010 rows=24341 loops=1)Filter: (((var_text)::text ~~ '%XX%'::text) AND ((name)::text = 'prociName'::text))Rows Removed by Filter: 1001630Buffers: shared hit=25730Planning Time: 0.432 msExecution Time: 345.292 ms
创建gist 索引
Gist(Generalized Search Tree 归纳树) 用于解决一些b-tree, gin难以解决的数据减少问题.GiST具有数据聚集的功能,聚集后,在单个组内包含的KEY+HEAP行号会放到单个INDEX PAGE中存储在GiST的entry 中,便于检索。
CREATE INDEX idx_pt_wt_var_name_text_gist ON power_bpm.PT_WF_RU_VARIABLE using gist( var_text gist_trgm_ops);
使用 '%变XXXXXXX票%' 查询时:执行计划:
Bitmap Heap Scan on pt_wf_ru_variable var (cost=4.54..71.48 rows=4 width=34) (actual time=9.430..10.395 rows=399 loops=1)Recheck Cond: ((var_text)::text ~~ '%变XXXXXXXX票%'::text)Filter: ((name)::text = 'prociName'::text)Heap Blocks: exact=272Buffers: shared hit=1094-> Bitmap Index Scan on idx_pt_wt_var_name_text_gist (cost=0.00..4.54 rows=17 width=0) (actual time=9.344..9.345 rows=399 loops=1)Index Cond: ((var_text)::text ~~ '%变XXXXXXXX票%'::text)Buffers: shared hit=822Planning Time: 0.323 msExecution Time: 10.504 ms
使用索引,逻辑读1094 耗时10ms并且对比全表扫描,性能提升20倍左右
短语带来大量的recheck removed
使用 '%XX%' 查询时:
Bitmap Heap Scan on pt_wf_ru_variable var (cost=13291.07..26964.26 rows=5661 width=34) (actual time=203.993..367.774 rows=24341 loops=1)Recheck Cond: (((var_text)::text ~~ '%XX%'::text) AND ((name)::text = 'prociName'::text))Rows Removed by Index Recheck: 216922Heap Blocks: exact=25464Buffers: shared hit=45874-> BitmapAnd (cost=13291.07..13291.07 rows=5661 width=0) (actual time=199.172..199.173 rows=0 loops=1)Buffers: shared hit=20410-> Bitmap Index Scan on idx_pt_wt_var_name_text_gist (cost=0.00..1776.85 rows=24059 width=0) (actual time=164.508..164.508 rows=985082 loops=1)Index Cond: ((var_text)::text ~~ '%XX%'::text)Buffers: shared hit=16760-> Bitmap Index Scan on pt_wf_idx_variable_nametext (cost=0.00..11511.13 rows=241411 width=0) (actual time=33.550..33.551 rows=261285 loops=1)Index Cond: ((name)::text = 'prociName'::text)Buffers: shared hit=3650Planning Time: 0.320 msExecution Time: 368.991 ms
分析:可以使用gist 索引,但是recheck非常严重,导致执行成本过高, 2.5万逻辑读 368ms。第一次过滤后又近100W数据,大量的数据在最后recheck后 removed。因为第一次过滤时token命中的数据组合后的block都符合条件导致。
创建rum索引
RUM 参考了GIN的实现,并改进了GIN在全文检索时的一些弱点,比如在INDEX 中加入了附加信息,比如TOKEN位置等信息,是增强版本的gin。
create extension rum;CREATE INDEX idx_pt_wt_var_name_text_rum ON power_bpm.PT_WF_RU_VARIABLE using rum ( string_to_tsvector(var_text) rum_tsvector_ops );
Varchar 类型无法直接建立rum 索引,需要转换全文检索类型,检索是需要tsquery类型。同时需要把字符串转换成 tsvector类型
create or replace function string_to_tsvector(v text) returns tsvector as $$declarex int := 1;res text := '';i text;beginfor i in select regexp_split_to_table(v,'')loopres := res||' '||chr(92)||i||':'||x;x := x+1;end loop;return res::tsvector;end;$$ language plpgsql strict immutable;----把字符串转换成带有位置标记的ts_query( < ->符号是用来带上相邻信息的 )create or replace function string_to_tsquery(v text) returns tsquery as $$declarex int := 1;res text := '';i text;beginfor i in select regexp_split_to_table(v,'')loopif x>1 thenres := res||' <-> '||chr(92)||i; -- 这一句<->符号是用来带上相邻信息的elseres := chr(92)||i;end if;x := x+1;end loop;return res::tsquery;end;$$ language plpgsql strict immutable;
使用 '%变XXXXXXXXX票%' 查询时:需要改成
and string_to_tsvector(VAR_TEXT) @@ string_to_tsquery('变XXXXXXXXX票')
执行计划:
Index Scan using idx_pt_wt_var_name_text_rum on pt_wf_ru_variable var(cost=144.00..152.01 rows=1 width=34) (actual time=6.997..7.710 rows=399 loops=1)Index Cond: (string_to_tsvector((var_text)::text) @@ '''变'' <-> ''X'' <-> ''X'' <-> ''X'' <-> ''X'' <-> ''X种'' <-> ''X'' <-> ''X'' <-> ''票'''::tsquery)Filter: ((name)::text = 'prociName'::text)Buffers: shared hit=411Planning Time: 0.336 msExecution Time: 7.782 ms
成本:402逻辑读, 耗时7.73ms.
Rum索引支持短语检索
使用 '%XX%' 查询时:需要改成
and string_to_tsvector(VAR_TEXT) @@ string_to_tsquery('XX')
执行计划:
Index Scan using idx_pt_wt_var_name_text_rum on pt_wf_ru_variable var (cost=32.00..140.52 rows=6 width=34) (actual time=16.766..35.053 rows=24341 loops=1)Index Cond: (string_to_tsvector((var_text)::text) @@ '''X'' <-> ''X'''::tsquery)Filter: ((name)::text = 'prociName'::text)Buffers: shared hit=3857Planning Time: 0.437 msExecution Time: 36.393 ms
成本:3857逻辑读, 耗时36ms
总结
非短语检索 | 短语检索 | |
全表扫描 | 2.5万逻辑读- 329ms | 2.5万逻辑读 329ms |
Gin索引 | 290逻辑读 2ms | 未使用索引 |
Gist索引 | 1094逻辑读 10ms | 2.5万逻辑读 368ms |
Rum索引 | 402逻辑读 7.7ms. | 3857逻辑读 36ms |
通过测试性能对比,以及综合考虑数据总量,增量。以及书写SQL习惯。需要在数据表上面建立两个索引。
---符合SQL书写习惯,适合非短语检索。CREATE INDEX idx_pt_wt_var_name_text_gin ON power_bpm.PT_WF_RU_VARIABLE using gin( var_text gin_trgm_ops );---- 针对短语检索,但是需要调整检索条件书写格式。CREATE INDEX idx_pt_wt_var_name_text_rum ON power_bpm.PT_WF_RU_VARIABLE using rum ( string_to_tsvector(var_text) rum_tsvector_ops );




