作者:杨向博
平安科技DBA,开源PostgreSQL爱好者。三年数据库运维经验,目前主要负责平安云数据库运维。对SQL优化和源代码分析存在浓厚兴趣。
一、问题背景
最近一测试环境某个postgres进程多次将主机内存耗尽,触发了OOM,甚至导致主机多次重启,一些服务中断。从messages中OOM信息来看是进程占用anon达数十GB。
该进程看起来就是执行一条简单的select,如下:
考虑到信息安全红线,文中做的sql演示中表名等信息均来自个人电脑,与平安业务无关
select * from qsump_pacloud_oscginfo_activity_detail_info_day where id ='zbngjih5xd' add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-08-19 00:00:00','yyyy-mm-dd hh24-mi-ss');
对该sql打印执行计划后发现共扫描了20000+的分区表,使用的是pg_pathman的range分区。那么就产生了两个疑问:
(1)为什么没有筛选出符合条件的分区表,而是扫描了所有的分区表?
(2)这个分区range为7天,并且该表存储的是18年到现在的数据,为什么会存在20000+个分区表?
对于问题1,之前遇见过类似的情况,已知在筛选条件中如果对于分区字段右值套用了函数表达式,或者类型转换函数to_date(),to_timestamp()等,那么不会筛选出对应的分区表,会扫描所有的分区表;但是支持::date或者::timestamp这种类型转换,示例如下:
使用::timestamp方式,执行计划中只扫描了查询范围内的两个分区表postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >= '2020-01-09 00:00:00'::timestamp and add_time < '2020-01-19 00:00:00'::timestamp;QUERY PLAN-----------------------------------------------------------------------------------------------------------Append (cost=0.00..71.00 rows=1360 width=12)-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_2 (cost=0.00..35.50 rows=680 width=12)Filter: (add_time >= '2020-01-09 00:00:00'::timestamp without time zone)-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_3 (cost=0.00..35.50 rows=680 width=12)Filter: (add_time < '2020-01-19 00:00:00'::timestamp without time zone)(5 rows)
使用to_timestamp()方式,执行计划中扫描了全部11个分区表postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-01-19 00:00:00','yyyy-mm-dd hh24-mi-ss');QUERY PLAN----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Append (cost=0.00..558.80 rows=110 width=12)-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_1 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_2 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_3 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_4 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_5 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_6 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_7 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_8 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_9 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_10 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))-> Seq Scan on qsump_pacloud_oscginfo_activity_detail_info_day_11 (cost=0.00..50.80 rows=10 width=12)Filter: ((add_time >= to_timestamp('2020-01-09 00:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)) AND (add_time < to_timestamp('2020-01-1900:00:00'::text, 'yyyy-mm-dd hh24-mi-ss'::text)))(23 rows)postgres=#
所以将to_timestamp()改为::timestamp的方式,问题临时规避了,之前没有进一步研究过原因,这篇案例后文主要分析问题1的根本原因。
再看问题2,为什么会存在20000+个分区表?通过表的relfilenode,查看对应的物理文件,发现这些文件都是前一天下午某个时刻创建的。找到对应时间的日志,看到该表在某时刻插入了一条数据,但插入分区字段的值是一个未来很遥远的时间。刚好使用到了range自动扩展,也就是说插入一条数据,如果超出当前所有分区的范围,会自动创建新的分区,并补齐其中空缺的分区表。就是这一个insert,导致产生了20000+的分区表,真的是一条sql引发的惨案呐。
二、分区分析
那么为什么会产生问题1中的现象?这个时候就能体现出开源的一些便利条件了,可以自己从源代码中找答案。
众所周知,pg_pathman是以HOOK的方式,来修改原本的querytree和plantree。postgresql源代码中已经为这些类似的HOOK插件留好了入口,在postgresql启动时process_shared_preload_libraries()函数根据配置的插件名找到对应的lib,然后运行里边的pg_init()函数,pg_init()会做一些初始化,加载插件的HOOK函数,当业务逻辑走到HOOK入口时直接调用即可。
pg_pathman中怎么确定需要的range分区表呢?是通过以下的函数来完成的
/* Given 'value' and 'ranges', return selected partitions list */voidselect_range_partitions(const Datum value,const Oid collid,FmgrInfo *cmp_func,const RangeEntry *ranges,const int nranges,const int strategy,WrapperNode *result) * returned partitions */{*函数体比较长,这里省略了,后边gdb跟踪时会描述下大体的逻辑*/}
大致翻阅了下源代码,虽然pg_pathman的代码不多,但是对于笔者这样一个代码能力薄弱的人来说,里边的逻辑一时三刻无法理清,真的有点无从下手的感觉。所以选择了最“笨重”,但是对自己来说最有效的办法-gdb跟踪。
三、GDB跟踪
1. 准备工作
session1:执行sql
session2:跟踪调试
查询表的oid信息如下:
oid | relname-------+-------------------------------------------------16781 | qsump_pacloud_oscginfo_activity_detail_info_day --主表,共有11个分区表16863 | qsump_pacloud_oscginfo_activity_detail_info_day_216869 | qsump_pacloud_oscginfo_activity_detail_info_day_3....16917 | qsump_pacloud_oscginfo_activity_detail_info_day_11
2. 调试::timestamp形式的语句
session1:
postgres=# select pg_backend_pid();pg_backend_pid----------------31698postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >= '2020-01-09 00:00:00'::timestamp and add_time < '2020-01-19 00:00:00'::timestamp;|
session2:
为了观察完整的过程,几乎给pathman的所有HOOK函数,以及生成plantree的一些关键函数都设置了断点,这里只贴出选择range分区逻辑的跟踪过程
调试信息中,注释格式标记为:##注释##
[postgres@postgres_zabbix ~]$ gdb --pid 31698GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7Copyright (C) 2013 Free Software Foundation, Inc.License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>This is free software: you are free to change and redistribute it.There is NO WARRANTY, to the extent permitted by law. Type "show copying"and "show warranty" for details.This GDB was configured as "x86_64-redhat-linux-gnu".For bug reporting instructions, please see:<http://www.gnu.org/software/gdb/bugs/>.Attaching to process 31698Reading symbols from home/postgres/postgresql-9.6.6/pg9debug/bin/postgres...done.Reading symbols from lib64/librt.so.1...(no debugging symbols found)...done.Loaded symbols for lib64/librt.so.1Reading symbols from lib64/libdl.so.2...(no debugging symbols found)...done.Loaded symbols for lib64/libdl.so.2Reading symbols from lib64/libm.so.6...(no debugging symbols found)...done.Loaded symbols for lib64/libm.so.6Reading symbols from lib64/libc.so.6...(no debugging symbols found)...done.Loaded symbols for lib64/libc.so.6Reading symbols from lib64/libpthread.so.0...(no debugging symbols found)...done.[Thread debugging using libthread_db enabled]Using host libthread_db library "/lib64/libthread_db.so.1".Loaded symbols for lib64/libpthread.so.0Reading symbols from lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.Loaded symbols for lib64/ld-linux-x86-64.so.2Reading symbols from home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so...done.Loaded symbols for home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.soReading symbols from lib64/libnss_files.so.2...(no debugging symbols found)...done.Loaded symbols for lib64/libnss_files.so.20x00007fef912c15e3 in __epoll_wait_nocancel () from lib64/libc.so.6Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64(gdb) b exec_simple_queryBreakpoint 1 at 0x7b9bc2: file postgres.c, line 867.(gdb) b pg_plan_queriesBreakpoint 2 at 0x7b9b25: file postgres.c, line 834.(gdb) b pathman_rel_pathlist_hookBreakpoint 3 at 0x7fef8a86594c: file src/hooks.c, line 263.(gdb) b pathman_join_pathlist_hookBreakpoint 4 at 0x7fef8a8652d7: file src/hooks.c, line 79.(gdb) b pathman_shmem_startup_hookBreakpoint 5 at 0x7fef8a866772: file src/hooks.c, line 687.(gdb) b pathman_post_parse_analysis_hookBreakpoint 6 at 0x7fef8a86650b: file src/hooks.c, line 587.(gdb) b pathman_planner_hookBreakpoint 7 at 0x7fef8a8662d3: file src/hooks.c, line 524.(gdb) b pathman_process_utility_hookBreakpoint 8 at 0x7fef8a86696b: file src/hooks.c, line 795.(gdb) b pg_plan_queryBreakpoint 9 at 0x7b9a8e: file postgres.c, line 778.(gdb) b plannerBreakpoint 10 at 0x6fc596: file planner.c, line 175.(gdb) b add_partition_filtersBreakpoint 11 at 0x7fef8a86a713: file src/planner_tree_modification.c, line 378.(gdb) b partition_filter_visitorBreakpoint 12 at 0x7fef8a86a74a: file src/planner_tree_modification.c, line 390.(gdb) b get_pathman_relation_infoBreakpoint 13 at 0x7fef8a851c05: file src/relation_info.c, line 361.(gdb) b cache_parent_of_partitionBreakpoint 14 at 0x7fef8a852e6d: file src/relation_info.c, line 1015.(gdb) b handle_modification_queryBreakpoint 15 at 0x7fef8a86a3bf: file src/planner_tree_modification.c, line 255.(gdb) b select_range_partitionsBreakpoint 16 at 0x7fef8a85940d: file src/pg_pathman.c, line 531.(gdb) b walk_expr_treeBreakpoint 17 at 0x7fef8a85992d: file src/pg_pathman.c, line 717.(gdb) b handle_opexprBreakpoint 18 at 0x7fef8a85ab0e: file src/pg_pathman.c, line 1317.(gdb) b IsConstValueBreakpoint 19 at 0x7fef8a858811: file src/pg_pathman.c, line 114.(gdb) n(gdb) set print pretty
调用select_range_partitions的关键函数
(gdb) list ##打印handle_opexpr 函数体##1328 int strategy;13291330 tce = lookup_type_cache(prel->ev_type, TYPECACHE_BTREE_OPFAMILY);1331 strategy = get_op_opfamily_strategy(expr->opno, tce->btree_opf);1332 ##当IsConstValue为true时,调用handle_const函数##1333 if (IsConstValue(param, context))1334 {1335 handle_const(ExtractConst(param, context),1336 expr->inputcollid,1337 strategy, context, result);(gdb) nBreakpoint 19, IsConstValue (node=0x20682f0, context=0x7ffe113be790) at src/pg_pathman.c:114 ##进入断点19,校验我们sql中>=和<条件是否是T_Const类型##114 switch (nodeTag(node))(gdb) p *node ##可以看到我们传入的expr node_type为T_Const##$4 = {type = T_Const}(gdb) list ##打印函数体##109110 * Can we transform this node into a Const? */111 static bool112 IsConstValue(Node *node, const WalkerContext *context)113 {114 switch (nodeTag(node))115 { ##当类型为T_Const时返回true##116 case T_Const:117 return true;118(gdb) list119 case T_Param:120 return WcxtHasExprContext(context);121122 case T_RowExpr:123 {124 RowExpr *row = (RowExpr *) node;125 ListCell *lc;126127 * Can't do anything about RECORD of wrong type */128 if (row->row_typeid != context->prel->ev_type)(gdb) n117 return true;(gdb)141 }(gdb) ##函数返回了ture,进入handle_const,准备调用select_range_partitions##handle_opexpr (expr=0x2068360, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:13351335 handle_const(ExtractConst(param, context),(gdb) ##进入了select_range_partitions函数##Breakpoint 15, select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,result=0x2070a10) at src/pg_pathman.c:531(gdb) bt ##打印堆栈信息,看看函数调用关系###0 select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,result=0x2070a10) at src/pg_pathman.c:540#1 0x00007fef8a859ea3 in handle_const (c=0x20682f0, collid=0, strategy=4, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:929#2 0x00007fef8a85abd7 in handle_opexpr (expr=0x2068360, context=0x7ffe113be790, result=0x2070a10) at src/pg_pathman.c:1335#3 0x00007fef8a8599d4 in walk_expr_tree (expr=0x2068360, context=0x7ffe113be790) at src/pg_pathman.c:734#4 0x00007fef8a865bfd in pathman_rel_pathlist_hook (root=0x2067e40, rel=0x206fcc8, rti=1, rte=0x20674b8) at src/hooks.c:345##省略底层堆栈信息##
select_range_partitions函数中选定分区表的逻辑
(gdb)##这里的value就是我们的查找范围的左区间,也就是>=的值'2020-01-09 00:00:00',可以看到nranges=11,即存在11个分区表##Breakpoint 15, select_range_partitions (value=631843200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=4,result=0x2070a10) at src/pg_pathman.c:531531 bool lossy = false,(gdb)540 int startidx = 0,(gdb) bt(gdb)541 endidx = nranges - 1,(gdb)546 Bound value_bound = MakeBound(value); * convert value to Bound */(gdb)550 result->found_gap = false;(gdb)(gdb) p *cmp_func ##比较大小使用的函数为timestamp_cmp##$40 = {fn_addr = 0x8b2ffa <timestamp_cmp>,fn_oid = 2045,fn_nargs = 2,fn_strict = 1 '\001',fn_retset = 0 '\000',fn_stats = 2 '\002',fn_extra = 0x0,fn_mcxt = 0x1f82948,fn_expr = 0x0(gdb)553 if (nranges == 0)(gdb) ##cmp_func函数为timestamp_cmp,返回值为 return (dt1 < dt2) ? -1 : ((dt1 > dt2) ? 1 : 0);####cmp_bounds是个回调函数,调用cmp_func函数来确定value_bound(也就是我们sql中的查询范围常量即>=和<的值)和分区表的rang_min及rang_max之间的大小关系####cmp_min为-1,说明value_bound小于ranges[i].min####cmp_max为-1,说明value_bound小于ranges[i].max,这两个合起来就可以确定当前的ranges[i]是不是要找的分区表####这里很巧妙,先比较ranges[startidx].min也就是第一个分区表的左区间以及ranges[endidx].max也就是最后一个分区表的右区间,确认要找的值在不在整个分区范围内,若不在后边直接返回了,若在则继续轮巡比较##566 cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[startidx].min);(gdb)567 cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[endidx].max);(gdb)569 if ((cmp_min <= 0 && strategy == BTLessStrategyNumber) ||(gdb) p nranges$35 = 11 ##共11个分区表##(gdb) p value_bound ##当前的范围常量是sql中>=后的值##$36 = {value = 631843200000000,is_infinite = 0 '\000'}(gdb) p ranges[startidx] ##rang[0]为第一个分区表##$37 = {child_oid = 16857,min = {value = 631152000000000,is_infinite = 0 '\000'},max = {value = 631756800000000,is_infinite = 0 '\000'}}(gdb) n(gdb)646 else if (is_greater)(gdb)647 startidx = i + 1;(gdb)651 }(gdb) ##这里就开始操作分区下标了,类似对分区下标做二分查找##611 i = startidx + (endidx - startidx) 2;(gdb)615 cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].min);(gdb)616 cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].max);(gdb) ##is_less 若为假,则说明左值(>=条件的值),大于当前分区的左区间range_min##618 is_less = (cmp_min < 0 || (cmp_min == 0 && strategy == BTLessStrategyNumber));(gdb) ##is_greater若为假,说明左值(>=条件的值),小于当前分区的右区间range_max##619 is_greater = (cmp_max > 0 || (cmp_max >= 0 && strategy != BTLessStrategyNumber));(gdb) ##if条件为真,说明这里已经找到了查询的左区间的分区表##621 if (!is_less && !is_greater)(gdb)623 if (strategy == BTGreaterEqualStrategyNumber && cmp_min == 0)(gdb)625 else if (strategy == BTLessStrategyNumber && cmp_max == 0)(gdb)628 lossy = true;(gdb) p ranges[i]$38 = {child_oid = 16863, ##查询左区间位于分区表oid=16863内,即qsump_pacloud_oscginfo_activity_detail_info_day_2##min = {value = 631756800000000,is_infinite = 0 '\000'},max = {value = 632361600000000,is_infinite = 0 '\000'}}(gdb) n633 break;(gdb)657 switch(strategy)(gdb)680 if (lossy)(gdb)682 result->rangeset = list_make1_irange(make_irange(i, i, IR_LOSSY));(gdb) n683 if (i < nranges - 1)(gdb) ##将匹配到的左区间加入到result->rangeset node##685 lappend_irange(result->rangeset,(gdb)684 result->rangeset =(gdb)697 break;
(gdb)##这里开始匹配右区间,value值为sql中< 条件的值'2020-01-19 00:00:00'Breakpoint 20, select_range_partitions (value=632707200000000, collid=0, cmp_func=0x7ffe113be650, ranges=0x1ff0040, nranges=11, strategy=1,result=0x2036d30) at src/pg_pathman.c:531531 bool lossy = false,(gdb) n611 i = startidx + (endidx - startidx) 2;(gdb)(gdb)615 cmp_min = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].min);(gdb)616 cmp_max = cmp_bounds(cmp_func, collid, &value_bound, &ranges[i].max);(gdb) ##is_less 若为假,则说明左值(<条件的值),大于当前分区的左区间618 is_less = (cmp_min < 0 || (cmp_min == 0 && strategy == BTLessStrategyNumber));(gdb) ##is_greater若为假,则说明左值(<条件的值),小于当前分区的右区间range_max##619 is_greater = (cmp_max > 0 || (cmp_max >= 0 && strategy != BTLessStrategyNumber));(gdb)(gdb) p is_less$45 = 0 '\000'(gdb) p is_greater$46 = 0 '\000'(gdb) ##if条件为真,说明找到了查询右区间的分区表621 if (!is_less && !is_greater)(gdb)623 if (strategy == BTGreaterEqualStrategyNumber && cmp_min == 0)(gdb)625 else if (strategy == BTLessStrategyNumber && cmp_max == 0)(gdb)628 lossy = true;(gdb)(gdb) p i$44 = 2(gdb)(gdb) p ranges[i] ##找到的右区间为oid=16869,即qsump_pacloud_oscginfo_activity_detail_info_day_3$43 = {child_oid = 16869,min = {value = 632361600000000,is_infinite = 0 '\000'},max = {value = 632966400000000,is_infinite = 0 '\000'}}633 break;(gdb)657 switch(strategy)(gdb)661 if (lossy)(gdb) ##将匹配的右区间加入到result->rangeset node663 result->rangeset = list_make1_irange(make_irange(i, i, IR_LOSSY));(gdb)664 if (i > 0)##到这里已经匹配到了需要查询的所有分区表##
384 parent_rel = heap_open(rte->relid, NoLock);(gdb)387 if (prel->enable_parent)(gdb)393 foreach(lc, ranges)(gdb)395 IndexRange irange = lfirst_irange(lc);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)393 foreach(lc, ranges)(gdb) p i$49 = 3(gdb) p children[2]$51 = 16869(gdb)(gdb) p children[1]$52 = 16863(gdb)
pg_plan_query (querytree=0x1fefd80, cursorOptions=256, boundParams=0x0) at postgres.c:792792 if (log_planner_stats)(gdb) ##如果开启了配置参数Debug_print_plan,将会将完整的plantree打印到日志817 if (Debug_print_plan)(gdb)818 elog_node_display(LOG, "plan", plan, Debug_pretty_print);(gdb) p *plan$59 = {type = T_PlannedStmt,commandType = CMD_SELECT,queryId = 0,hasReturning = 0 '\000',hasModifyingCTE = 0 '\000',canSetTag = 1 '\001',transientPlan = 0 '\000',dependsOnRole = 0 '\000',parallelModeNeeded = 0 '\000',planTree = 0x206fde8,rtable = 0x20700b8,resultRelations = 0x0,utilityStmt = 0x0,subplans = 0x0,rewindPlanIDs = 0x0,rowMarks = 0x0,relationOids = 0x2070108,invalItems = 0x0,nParamExec = 0} ##可以看到最终的plan->relationOids list中包含三个node,即主表和两个分区表,和我们之前看到“好的”执行计划结果是相符的##(gdb) p *plan->relationOids$60 = {type = T_OidList,length = 3,head = 0x20700e8,tail = 0x2070428}(gdb) p *plan->relationOids->head ##主表##$61 = {data = {ptr_value = 0x418d,int_value = 16781,oid_value = 16781},next = 0x20702d8}##分区表qsump_pacloud_oscginfo_activity_detail_info_day_2##(gdb) p *plan->relationOids->head->next$62 = {data = {ptr_value = 0x41df,int_value = 16863,oid_value = 16863},next = 0x2070428}##分区表qsump_pacloud_oscginfo_activity_detail_info_day_3##(gdb) p *plan->relationOids->tail$63 = {data = {ptr_value = 0x41e5,int_value = 16869,oid_value = 16869},next = 0x0}(gdb) n822 return plan;(gdb)823 }##跟踪结束##
postgres=# explain select * from qsump_pacloud_oscginfo_activity_detail_info_day where add_time >=to_timestamp( '2020-01-09 00:00:00','yyyy-mm-dd hh24-mi-ss') and add_time < to_timestamp('2020-01-19 00:00:00','yyyy-mm-dd hh24-mi-ss');|
session2:
##这里与之前不同,之前node_type为T_Const,而当前为T_FuncExprBreakpoint 20, IsConstValue (node=0x2071350, context=0x7ffe113be790) at src/pg_pathman.c:114114 switch (nodeTag(node))(gdb) p *node$2 = {type = T_FuncExpr}(gdb) list109110 /* Can we transform this node into a Const? */111 static bool112 IsConstValue(Node *node, const WalkerContext *context)113 {114 switch (nodeTag(node))115 {116 case T_Const:117 return true;118(gdb)119 case T_Param:120 return WcxtHasExprContext(context);121122 case T_RowExpr:123 {124 RowExpr *row = (RowExpr *) node;125 ListCell *lc;126127 /* Can't do anything about RECORD of wrong type */128 if (row->row_typeid != context->prel->ev_type)(gdb)129 return false;130131 /* Check that args are const values */132 foreach (lc, row->args)133 if (!IsConstValue((Node *) lfirst(lc), context))134 return false;135 }136 return true;137 ##IsConstValue函数中,并没有对T_FuncExpr做一个case分支去处理,因此,走了default,返回了false138 default:(gdb)139 return false;140 }141 }142143 /* Extract a Const from node that has been checked by IsConstValue() */144 static Const *145 ExtractConst(Node *node, const WalkerContext *context)146 {147 ExprState *estate;148 ExprContext *econtext = context->econtext;(gdb) n139 return false;(gdb)141 }(gdb)handle_opexpr (expr=0x20713c0, context=0x7ffe113be790, result=0x2079e40) at src/pg_pathman.c:13421342 else if (IsA(param, Param) || IsA(param, Var))(gdb) n##由于IsConstValue返回了false,因此没有进入handle_const里调用select_range_partitions选择分区表,而是直接将所有分区追加到result->rangeset node##1352 result->rangeset = list_make1_irange_full(prel, IR_LOSSY);(gdb) list1347 return; /* done, exit */1348 }1349 }1350 }13511352 result->rangeset = list_make1_irange_full(prel, IR_LOSSY);1353 result->paramsel = 1.0;1354 }13551356(gdb) n1353 result->paramsel = 1.0;(gdb)1354 }(gdb)
所有的分区表被追加到root node,而我们的where条件仅仅当做filter去处理,并没有先根据条件选择分区表。
##11个分区表均被追加到root node##(gdb)384 parent_rel = heap_open(rte->relid, NoLock);(gdb)387 if (prel->enable_parent)(gdb)393 foreach(lc, ranges)(gdb)395 IndexRange irange = lfirst_irange(lc);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)398 append_child_relation(root, parent_rel, rti, i, children[i], wrappers);(gdb)397 for (i = irange_lower(irange); i <= irange_upper(irange); i++)(gdb)393 foreach(lc, ranges)(gdb) p i$3 = 11
直接来看最后生成的plantree
(gdb)pg_plan_query (querytree=0x1fefd80, cursorOptions=256, boundParams=0x0) at postgres.c:792792 if (log_planner_stats)(gdb)817 if (Debug_print_plan)(gdb)818 elog_node_display(LOG, "plan", plan, Debug_pretty_print);(gdb) p *plan$5 = {type = T_PlannedStmt,commandType = CMD_SELECT,queryId = 0,hasReturning = 0 '\000',hasModifyingCTE = 0 '\000',canSetTag = 1 '\001',transientPlan = 0 '\000',dependsOnRole = 0 '\000',parallelModeNeeded = 0 '\000',planTree = 0x208f028,rtable = 0x208f2f8,resultRelations = 0x0,utilityStmt = 0x0,subplans = 0x0,rewindPlanIDs = 0x0,rowMarks = 0x0,relationOids = 0x208f348,invalItems = 0x0,nParamExec = 0}##可以看到,最终的plan->relationOids list包含12个node,即主表加11个分区表##(gdb) p *plan->relationOids$4 = {type = T_OidList,length = 12,head = 0x2091338,tail = 0x2092218}(gdb) p *plan->relationOids->head ##主表##$5 = {data = {ptr_value = 0x418d,int_value = 16781,oid_value = 16781},next = 0x20914b8}##第11个分区表qsump_pacloud_oscginfo_activity_detail_info_day_11##(gdb) p *plan->relationOids->tail$6 = {data = {ptr_value = 0x4215,int_value = 16917,oid_value = 16917},next = 0x0}(gdb)
1. 修改思路
通过gdb跟踪,现在已经明确了是IsConstValue函数中,不存在T_FuncExpr case分支,导致T_FuncExpr类型直接走了default,没有进行分区表的筛选。
那么修改方案是否可以为:
1)IsConstValue函数中加入T_FuncExpr case分支处理,实现分区的筛选
2)大致翻阅了下postgresql主体的源代码,发现主体代码中存在很多处理方式,比如将T_FuncExpr转换为simple_expr。那么pg_pathman中能否对node type做下处理,将T_FuncExpr转化为T_Const
测试了目前最新的pg_pathman版本 1.5.11存在同样的问题,但是postgresql原生的声明式分区不存在这样的问题,都可以处理。
/* Can we transform this node into a Const? */static boolIsConstValue(Node *node, const WalkerContext *context){switch (nodeTag(node)){case T_Const:return true;case T_Param:return WcxtHasExprContext(context);case T_RowExpr:{RowExpr *row = (RowExpr *) node;ListCell *lc;/* Can't do anything about RECORD of wrong type */if (row->row_typeid != context->prel->ev_type)return false;/* Check that args are const values */foreach (lc, row->args)if (!IsConstValue((Node *) lfirst(lc), context))return false;}return true;default:return false;}}
2. 修改的必要性
这是一个相对性看待的问题,我认为对于应用开发同学来说,有些不友好,只能使用::date这种方式。
请教过社区的专家,得知postgresql12原生声明式分区的性能已经可以和pg_pathman媲美,几乎不相上下,因此对于12及以后版本可以考虑优先使用声明式分区,抛弃插件。但是11以前到9.5之间的版本,由于自身分区性能较差,可能还是需要pg_pathman来实现。
I Love PG
关于我们
中国开源软件推进联盟PostgreSQL分会(简称:PG分会)于2017年成立,由国内多家PG生态企业所共同发起,业务上接受工信部产业发展研究院指导。PG分会致力于构建PG产业生态,推动PG产学研用发展,是国内一家PG行业协会组织。
技术文章精彩回顾 PostgreSQL学习的九层宝塔 PostgreSQL职业发展与学习攻略 搞懂PostgreSQL数据库透明数据加密之加密算法介绍 一文读懂PostgreSQL-12分区表 PostgreSQL源码学习之:RegularLock Postgresql源码学习之词法和语法分析 最佳实践—PG数据库系统表空间重建 PostgreSQL V12中的流复制配置 2019,年度数据库舍 PostgreSQL 其谁? PostgreSQL使用分片(sharding)实现水平可扩展性 一文搞懂PostgreSQL物化视图 PostgreSQL原理解析之:PostgreSQL备机是否做checkpoint PostgreSQL复制技术概述 Postgres是最好的开源软件 PostgreSQL是世界上最好的数据库 从Oracle迁移到PostgreSQL的十大理由 PG活动精彩回顾 见证精彩|PostgresConf.CN2019大会盛大开幕 PostgresConf.CN2019大会DAY2|三大分论坛,精彩不断 PostgresConf.CN2019培训日|爆满!Training Day现场速递! 「PCC-Training Day」培训日Day2圆满结束,PCC2019完美收官 创建PG全球生态!PostgresConf.CN2019大会盛大召开 首站起航!2019“让PG‘象’前行”上海站成功举行 走进蓉城丨2019“让PG‘象’前行”成都站成功举行 中国PG象牙塔计划发布,首批合作高校授牌仪式在天津举行 PostgreSQL实训基地落户沈阳航空航天大学和渤海大学,高校数据库课改正当时 群英论道聚北京,共话PostgreSQL 相聚巴厘岛| PG Conf.Asia 2019 DAY0、DAY1简报 相知巴厘岛| PG Conf.Asia 2019 DAY2简报 相惜巴厘岛| PG Conf.Asia 2019 DAY3简报 独家|硅谷Postgres大会简报 全球规模最大的PostgreSQL会议等你来! PostgreSQL线上沙龙第一期精彩回顾 PostgreSQL线上沙龙第二期精彩回顾 PostgreSQL线上沙龙第三期精彩回顾 PostgreSQL线上沙龙第四期精彩回顾 PG培训认证精彩回顾 关于中国PostgreSQL培训认证,你想知道的都在这里! 首批中国PGCA培训圆满结束,首批认证考试将于10月18日和20日举行! 中国首批PGCA认证考试圆满结束,203位考生成功获得认证! 中国第二批PGCA认证考试圆满结束,115位考生喜获认证! 请查收:中国首批PGCA证书! 重要通知:三方共建,中国PostgreSQL认证权威升级! 一场考试迎新年 | 12月28日,首次PGCE中级认证考试开考! 近500人参与!首次PGCE中级、第三批次PGCA初级认证考试落幕! 通知:PostgreSQL技术能力电子证书上线!





