暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

Pg中模糊查询摸索与思考

godba 2024-04-22
236

发现隐患SQL

之前阅读德哥文章,本文正好实践下 

https://github.com/digoal/blog/blob/master/201612/20161231_01.md

某次巡检发现隐患SQL

    SELECT name,  VAR_TEXT   FROM  PT_WF_RU_VARIABLE var 
    Where 1=1 and var.NAME ='prociName'
    and VAR_TEXT like '%变XXXXXXXX票%';

    执行成本较高,经过了解其中prociName 比较固定,而值:变XXXXXXX票是用户输入的值,该值用户比较关心,用户查询比较频繁,用户也会查询其他关键字,但是总的来说返回的数据量不多。

      Gather (cost=1000.00..31695.89 rows=4 width=34) (actual time=27.070..329.595 rows=399 loops=1)
      Workers Planned: 3
      Workers Launched: 0
      Buffers: shared hit=25730
      -> Parallel Seq Scan on pt_wf_ru_variable var (cost=0.00..30695.49 rows=1 width=34) (actual time=26.625..329.001 rows=399 loops=1)
      Filter: (((var_text)::text ~~ '%变XXXXXXXX票%'::text) AND ((name)::text = 'prociName'::text))
      Rows Removed by Filter: 1025572
      Buffers: shared hit=25730
      Planning Time: 0.287 ms
      Execution Time: 329.657 ms

      2.5万逻辑读, 耗时420ms. 表数据总量102W,其中能匹配上'%变XXXXXX票%'只有399条数据。 

      优化方案

      因为最终返回的结果不多,可以考虑索引扫描。而pg中只有全文索引才能高效查询关键字。

      创建gin 索引

      GIN(Generalized Inverted Index, 通用倒排索引) 是一个存储对(key, posting list)集合的索引结构适用于数组、全文检索等数据类型,可以扩展支持更多的数据类型。添加extension。

        Create extension  if not exists  pg_trgm;
        create extension if not EXISTS btree_gin;
        -- 新建gin索引
        CREATE INDEX idx_pt_wt_var_name_text_gin ON power_bpm.PT_WF_RU_VARIABLE using gin( var_text gin_trgm_ops );

        执行计划

          Bitmap Heap Scan on pt_wf_ru_variable var  (cost=60.13..127.07 rows=4 width=34) (actual time=0.715..1.897 rows=399 loops=1)
          Recheck Cond: ((var_text)::text ~~ '%变XXXXXXXX票%'::text)
          Filter: ((name)::text = 'prociName'::text)
          Heap Blocks: exact=272
          Buffers: shared hit=290
          -> Bitmap Index Scan on idx_pt_wt_var_name_text_gin (cost=0.00..60.13 rows=17 width=0) (actual time=0.618..0.619 rows=399 loops=1)
          Index Cond: ((var_text)::text ~~ '%变XXXXXXXX票%'::text)
          Buffers: shared hit=18
          Planning Time: 0.378 ms
          Execution Time: 2.000 ms

          成本:290逻辑读, 耗时2ms.

          短语无法使用gin索引

          饶有兴致尝试其他频繁查询的值,发现典型问题:VAR_TEXT  like  '%XX%' 不能使用gin索引。 XX表示只有2个关键字
            Gather  (cost=1000.00..32260.48 rows=5661 width=34) (actual time=5.015..344.088 rows=24341 loops=1)
            Workers Planned: 3
            Workers Launched: 0
            Buffers: shared hit=25730
            -> Parallel Seq Scan on pt_wf_ru_variable var (cost=0.00..30694.38 rows=1826 width=34) (actual time=0.191..337.010 rows=24341 loops=1)
            Filter: (((var_text)::text ~~ '%XX%'::text) AND ((name)::text = 'prociName'::text))
            Rows Removed by Filter: 1001630
            Buffers: shared hit=25730
            Planning Time: 0.432 ms
            Execution Time: 345.292 ms
            从数据量看2%左右结果数据,是符合走索引条件的,但是 pg_trgm以3个连续的字符作为token. 当查询词是2个字符时,检索效果不好,不能使用索引。

            创建gist 索引

            Gist(Generalized Search Tree 归纳树) 用于解决一些b-tree, gin难以解决的数据减少问题.GiST具有数据聚集的功能,聚集后,在单个组内包含的KEY+HEAP行号会放到单个INDEX PAGE中存储在GiSTentry 中,便于检索

              CREATE INDEX idx_pt_wt_var_name_text_gist ON power_bpm.PT_WF_RU_VARIABLE  using gist( var_text  gist_trgm_ops);

              使用 '%变XXXXXXX票%' 查询时:执行计划:

                Bitmap Heap Scan on pt_wf_ru_variable var  (cost=4.54..71.48 rows=4 width=34) (actual time=9.430..10.395 rows=399 loops=1)
                Recheck Cond: ((var_text)::text ~~ '%变XXXXXXXX票%'::text)
                Filter: ((name)::text = 'prociName'::text)
                Heap Blocks: exact=272
                Buffers: shared hit=1094
                -> Bitmap Index Scan on idx_pt_wt_var_name_text_gist (cost=0.00..4.54 rows=17 width=0) (actual time=9.344..9.345 rows=399 loops=1)
                Index Cond: ((var_text)::text ~~ '%变XXXXXXXX票%'::text)
                Buffers: shared hit=822
                Planning Time: 0.323 ms
                Execution Time: 10.504 ms

                使用索引,逻辑读1094 耗时10ms并且对比全表扫描,性能提升20倍左右

                短语带来大量的recheck removed

                使用 '%XX%' 查询时:

                  Bitmap Heap Scan on pt_wf_ru_variable var  (cost=13291.07..26964.26 rows=5661 width=34) (actual time=203.993..367.774 rows=24341 loops=1)
                  Recheck Cond: (((var_text)::text ~~ '%XX%'::text) AND ((name)::text = 'prociName'::text))
                  Rows Removed by Index Recheck: 216922
                  Heap Blocks: exact=25464
                  Buffers: shared hit=45874
                  -> BitmapAnd (cost=13291.07..13291.07 rows=5661 width=0) (actual time=199.172..199.173 rows=0 loops=1)
                  Buffers: shared hit=20410
                  -> Bitmap Index Scan on idx_pt_wt_var_name_text_gist (cost=0.00..1776.85 rows=24059 width=0) (actual time=164.508..164.508 rows=985082 loops=1)
                  Index Cond: ((var_text)::text ~~ '%XX%'::text)
                  Buffers: shared hit=16760
                  -> Bitmap Index Scan on pt_wf_idx_variable_nametext (cost=0.00..11511.13 rows=241411 width=0) (actual time=33.550..33.551 rows=261285 loops=1)
                  Index Cond: ((name)::text = 'prociName'::text)
                  Buffers: shared hit=3650
                  Planning Time: 0.320 ms
                  Execution Time: 368.991 ms

                  分析:可以使用gist 索引,但是recheck非常严重,导致执行成本过高, 2.5万逻辑读 368ms。第一次过滤后又近100W数据,大量的数据在最后recheckremoved。因为第一次过滤时token命中的数据组合后的block都符合条件导致。

                  创建rum索引

                  RUM 参考了GIN的实现,并改进了GIN在全文检索时的一些弱点,比如在INDEX 中加入了附加信息,比如TOKEN位置等信息,是增强版本的gin。

                    create extension  rum;
                    CREATE INDEX idx_pt_wt_var_name_text_rum ON power_bpm.PT_WF_RU_VARIABLE using rum ( string_to_tsvector(var_text) rum_tsvector_ops );

                    Varchar 类型无法直接建立rum 索引,需要转换全文检索类型,检索是需要tsquery类型。同时需要把字符串转换成 tsvector类型

                      create or replace function string_to_tsvector(v text) returns tsvector as $$  
                      declare
                      x int := 1;
                      res text := '';
                      i text;
                      begin
                      for i in select regexp_split_to_table(v,'')
                      loop
                      res := res||' '||chr(92)||i||':'||x;
                      x := x+1;
                      end loop;
                      return res::tsvector;
                      end;
                      $$ language plpgsql strict immutable;
                      ----把字符串转换成带有位置标记的ts_query( < ->符号是用来带上相邻信息的 )
                      create or replace function string_to_tsquery(v text) returns tsquery as $$
                      declare
                      x int := 1;
                      res text := '';
                      i text;
                      begin
                      for i in select regexp_split_to_table(v,'')
                      loop
                      if x>1 then
                      res := res||' <-> '||chr(92)||i; -- 这一句<->符号是用来带上相邻信息的
                      else
                      res := chr(92)||i;
                      end if;
                      x := x+1;
                      end loop;
                      return res::tsquery;
                      end;
                      $$ language plpgsql strict immutable;

                      使用 '%变XXXXXXXXX票%' 查询时:需要改成

                        and string_to_tsvector(VAR_TEXT) @@ string_to_tsquery('变XXXXXXXXX票')

                        执行计划:

                          Index Scan using idx_pt_wt_var_name_text_rum on pt_wf_ru_variable var
                          (cost=144.00..152.01 rows=1 width=34) (actual time=6.997..7.710 rows=399 loops=1)
                          Index Cond: (string_to_tsvector((var_text)::text) @@ '''变'' <-> ''X'' <-> ''X'' <-> ''X'' <-> ''X'' <-> ''X种'' <-> ''X'' <-> ''X'' <-> ''票'''::tsquery)
                          Filter: ((name)::text = 'prociName'::text)
                          Buffers: shared hit=411
                          Planning Time: 0.336 ms
                          Execution Time: 7.782 ms

                          成本:402逻辑读, 耗时7.73ms.

                          Rum索引支持短语检索

                          使用 '%XX%' 查询时:需要改成

                            and string_to_tsvector(VAR_TEXT) @@ string_to_tsquery('XX')

                            执行计划:

                              Index Scan using idx_pt_wt_var_name_text_rum on pt_wf_ru_variable var (cost=32.00..140.52 rows=6 width=34) (actual time=16.766..35.053 rows=24341 loops=1)
                              Index Cond: (string_to_tsvector((var_text)::text) @@ '''X'' <-> ''X'''::tsquery)
                              Filter: ((name)::text = 'prociName'::text)
                              Buffers: shared hit=3857
                              Planning Time: 0.437 ms
                              Execution Time: 36.393 ms

                              成本:3857逻辑读, 耗时36ms

                              总结


                              非短语检索

                              短语检索

                              全表扫描

                              2.5万逻辑读- 329ms

                              2.5万逻辑读 329ms

                              Gin索引

                              290逻辑读 2ms

                              未使用索引

                              Gist索引

                              1094逻辑读 10ms

                               2.5万逻辑读 368ms

                              Rum索引

                              402逻辑读 7.7ms.

                              3857逻辑读 36ms

                              通过测试性能对比,以及综合考虑数据总量,增量。以及书写SQL习惯。需要在数据表上面建立两个索引。

                                ---符合SQL书写习惯,适合非短语检索。
                                CREATE INDEX idx_pt_wt_var_name_text_gin ON power_bpm.PT_WF_RU_VARIABLE using gin( var_text gin_trgm_ops );
                                ---- 针对短语检索,但是需要调整检索条件书写格式。
                                CREATE INDEX idx_pt_wt_var_name_text_rum ON power_bpm.PT_WF_RU_VARIABLE using rum ( string_to_tsvector(var_text) rum_tsvector_ops );


                                文章转载自godba,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

                                评论