较少的分区也报错too many range table entries

原创 liuzhilong62 2023-07-02

275

问题现象

postgresql中update执行语句报错too many range table entries
源sql

with t as (select id from LZLTAB where id=8723 limit 100 )
update	LZLTAB set
	STATUS = '00',
	FILE_ID = null,
	DATE_UPDATED = localtimestamp(0)
where id in (select	id from t)

如果把update改写成select，可以执行成功

with t as (select	id from	LZLTAB where	id=8723 limit 100 )
select * from 	LZLTAB where	id in (select id	from t)
  id  |        date_created                                               
------+----------------------------+...
 8723 | 2023-06-21 18:02:21.161687 
(1 row)

主键和分区是这样，分区总共有400个

Partition key: RANGE (partition_key)
Indexes:
    "pk_lzl" PRIMARY KEY, btree (id, partition_key)
...
Partitions: lzl_p20230601 FOR VALUES FROM ('20230601') TO ('20230602'),
            lzl_p20230602 FOR VALUES FROM ('20230602') TO ('20230603'),
            lzl_p20230603 FOR VALUES FROM ('20230603') TO ('20230604')

sql逻辑有很多优化点，但是这里不讨论优化。重点是分析为什么update会报错，和为什么select和update会有区别。
执行explain update报错如下：

explain with t as (selec tid from LZLTAB where id=8723 limit 100 )
update LZLTAB set
STATUS = '00',
FILE_ID = null,
DATE_UPDATED = localtimestamp(0)
where id in (select id from t);

ERROR:  54000: too many range table entries
LOCATION:  add_rte_to_flat_rtable, setrefs.c:451
Time: 18341.171 ms (00:18.341)

explain卡了18秒，然后报错

源码分析

报错直接抛出了源码的位置LOCATION: add_rte_to_flat_rtable, setrefs.c:451，直接找到该源码

src/backend/optimizer/plan/setrefs.c
其注释是说setrefs.c是完成计划树后的相关工作的

/*
 *Post-processing of a completed plan tree: fix references to subplan
 *	  vars, compute regproc values for operators, etc
 */

找到第451行的函数：

/*
 * Add (a copy of) the given RTE to the final rangetable
 *
 * In the flat rangetable, we zero out substructure pointers that are not
 * needed by the executor; this reduces the storage space and copying cost
 * for cached plans.  We keep only the ctename, alias and eref Alias fields,
 * which are needed by EXPLAIN, and the selectedCols, insertedCols,
 * updatedCols, and extraUpdatedCols bitmaps, which are needed for
 * executor-startup permissions checking and for trigger event checking.
 */
static void
add_rte_to_flat_rtable(PlannerGlobal *glob, RangeTblEntry *rte)
{
...
	/*
	 * Check for RT index overflow; it's very unlikely, but if it did happen,
	 * the executor would get confused by varnos that match the special varno
	 * values.
	 */
	if (IS_SPECIAL_VARNO(list_length(glob->finalrtable)))
		ereport(ERROR,
				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
				 errmsg("too many range table entries")));
...
}

errmsg()就是第451行。add_rte_to_flat_rtable()看注释跟RTE有关，什么RTE呢？下面再分析。
报错的判断是IS_SPECIAL_VARNO()，直接搜索该函数，在src/include/nodes/primnodes.h中找到宏定义

/*
 * Var - expression node representing a variable (ie, a table column)
 *
 * In the parser and planner, varno and varattno identify the semantic
 * referent, which is a base-relation column unless the reference is to a join
 * USING column that isn't semantically equivalent to either join input column
 * (because it is a FULL join or the input column requires a type coercion).
 * In those cases varno and varattno refer to the JOIN RTE.  (Early in the
 * planner, we replace such join references by the implied expression; but up
 * till then we want join reference Vars to keep their original identity for
 * query-printing purposes.)
 ...
 */
#define    INNER_VAR		65000	/* reference to inner subplan */
#define    OUTER_VAR		65001	/* reference to outer subplan */
#define    INDEX_VAR		65002	/* reference to index column */

#define IS_SPECIAL_VARNO(varno)		((varno) >= INNER_VAR)

上面一段注释有点难懂，但是有段话很重要：In those cases varno and varattno refer to the JOIN RTE。varno与RTE有关系。
然后varno>=65000时，会抛出报错。（这里不扩展INNER_VAR,OUTER_VAR,INDEX_VAR的区别，因为他们的值差别不大，不影响我们继续分析）

什么是RTE？
在执行计划源码中的各个位置都能找到RTE（rangetable或RangeTblEntry）的描述，并且报错也很明显ERROR: 54000: too many range table entries也是说的RTE。那么什么是RTE？
在src/include/nodes/parsenodes.h中有一段对RTE的描述

/*--------------------
 * RangeTblEntry -
 *	  A range table is a List of RangeTblEntry nodes.
 *
 *	  A range table entry may represent a plain relation, a sub-select in
 *	  FROM, or the result of a JOIN clause.  (Only explicit JOIN syntax
 *	  produces an RTE, not the implicit join resulting from multiple FROM
 *	  items.  This is because we only need the RTE to deal with SQL features
 *	  like outer joins and join-output-column aliasing.)  Other special
 *	  RTE types also exist, as indicated by RTEKind.
 *
 *	  Note that we consider RTE_RELATION to cover anything that has a pg_class
 *	  entry.  relkind distinguishes the sub-cases.
 */

简单的说，RTE是执行计划中的“表”，可以是具体的表也可以是生成类的“表”，比如子查询、join结果等等。RTE超出限制65000，也就是说执行计划中生成了太多的RTE。

查看update的执行计划

由于我们知道了RTE的啥，所以查看sql的执行计划可能会有所帮助。但是由于源sql（400个分区）没有生成执行计划，我们创建一个30个分区的表，期待它能explain出来，然后观察它的执行计划。
30个分区的表执行相同的update语句

explain with t as (select id from lzl where id=8723 limit 100 )
update lzl set
STATUS = '00',
FILE_ID = null,
DATE_UPDATED = localtimestamp(0)
where id in ( select id from t);

生成的执行计划如下

                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Update on lzl  (cost=155.48..4980.00 rows=600 width=3042)
   Update on lzl_p20230601 lzl_1
   Update on lzl_p20230602 lzl_2
	...
   Update on lzl_p20230630 lzl_30
   ->  Hash Semi Join  (cost=155.48..166.00 rows=20 width=3042)
         Hash Cond: (lzl_1.id = t.id)
         ->  Seq Scan on lzl_p20230601 lzl_1  (cost=0.00..10.20 rows=20 width=2912)
         ->  Hash  (cost=155.10..155.10 rows=30 width=40)
               ->  Subquery Scan on t  (cost=0.14..155.10 rows=30 width=40)
                     ->  Limit  (cost=0.14..154.80 rows=30 width=8)
                           ->  Append  (cost=0.14..154.80 rows=30 width=8)
                                 ->  Index Only Scan using lzl_p20230601_pkey on lzl_p20230601 lzl_32  (cost=0.14..5.16 rows=1 width=8)
                                       Index Cond: (id = 8723)
                                 ->  Index Only Scan using lzl_p20230602_pkey on lzl_p20230602 lzl_33  (cost=0.14..5.16 rows=1 width=8)
                                       Index Cond: (id = 8723)
                                ...
                                 ->  Index Only Scan using lzl_p20230630_pkey on lzl_p20230630 lzl_61  (cost=0.14..5.16 rows=1 width=8)
                                       Index Cond: (id = 8723)
		...
   ->  Hash Semi Join  (cost=155.48..166.00 rows=20 width=3042)
         Hash Cond: (lzl_30.id = t_29.id)
         ->  Seq Scan on lzl_p20230630 lzl_30  (cost=0.00..10.20 rows=20 width=2912)
         ->  Hash  (cost=155.10..155.10 rows=30 width=40)
               ->  Subquery Scan on t_29  (cost=0.14..155.10 rows=30 width=40)
                     ->  Limit  (cost=0.14..154.80 rows=30 width=8)
                           ->  Append  (cost=0.14..154.80 rows=30 width=8)
                                 ->  Index Only Scan using lzl_p20230601_pkey on lzl_p20230601 lzl_931  (cost=0.14..5.16 rows=1 width=8)
                                       Index Cond: (id = 8723)
                                 ->  Index Only Scan using lzl_p20230602_pkey on lzl_p20230602 lzl_932  (cost=0.14..5.16 rows=1 width=8)
                                       Index Cond: (id = 8723)
								...
                                 ->  Index Only Scan using lzl_p20230630_pkey on lzl_p20230630 lzl_960  (cost=0.14..5.16 rows=1 width=8)
                                       Index Cond: (id = 8723)
(2041 rows)

执行计划非常长，总共有2041行。这个执行计划非常笨，每个分区在更新的时候，都把谓词条件在分区表中运行了一次，由于sql没有分区键，每次运行又扫描了所有分区。本身30个分区的分区表，每个分区扫描了30次，总共扫描900次分区。
从执行计划也能看到RTE表刚开始有30个，分配给update直到lzl_30；后面每次hash匹配时每个分区扫描也分配了30个RTE，比如lzl_1对应的hash下的分区扫描从lzl_32到lzl_61。这里为什么是32而不是31？因为整个分区扫描是一个子查询也是一个RTE，为t(所有的为t,t1-t_29)，也是总共30个。所以执行计划中生成的RTE表总共有30+30+30*30=960个。

如果用select来看执行计划，发现跟update的执行计划有很大的不同

explain with t as (select id from lzl where id=8723 limit 100 )
select  STATUS ,FILE_ID ,DATE_UPDATED  from lzl where id in ( select id from t);

 Hash Semi Join  (cost=155.48..467.05 rows=90 width=98)
   Hash Cond: (lzl.id = lzl_31.id)
   ->  Append  (cost=0.00..309.00 rows=600 width=106)
         ->  Seq Scan on lzl_p20230601 lzl_1  (cost=0.00..10.20 rows=20 width=106)
         ->  Seq Scan on lzl_p20230602 lzl_2  (cost=0.00..10.20 rows=20 width=106)
...
         ->  Seq Scan on lzl_p20230630 lzl_30  (cost=0.00..10.20 rows=20 width=106)
   ->  Hash  (cost=155.10..155.10 rows=30 width=8)
         ->  Limit  (cost=0.14..154.80 rows=30 width=8)
               ->  Append  (cost=0.14..154.80 rows=30 width=8)
                     ->  Index Only Scan using lzl_p20230601_pkey on lzl_p20230601 lzl_32  (cost=0.14..5.16 rows=1 width=8)
                           Index Cond: (id = 8723)
                     ->  Index Only Scan using lzl_p20230602_pkey on lzl_p20230602 lzl_33  (cost=0.14..5.16 rows=1 width=8)
                           Index Cond: (id = 8723)
...
                     ->  Index Only Scan using lzl_p20230630_pkey on lzl_p20230630 lzl_61  (cost=0.14..5.16 rows=1 width=8)
                           Index Cond: (id = 8723)
(96 rows)

没有反复（笛卡尔集式地）访问表，RTE只到了61个。这也是为什么400个分区的select可以查询出来的原因，因为400*400次访问实在太多了。
所以关于之前的sql，update执行报错，而select正常的情况，可以得出结论：

400个分区的select，它的执行计划中的RTE有801个，没超过INNER_VAR的值65000，它可以生成执行计划并且执行
400个分区的update，它的执行计划中的RTE有160160400个，远远超过INNER_VAR的值65000，不能成功生成执行计划，抛出RTE超限的报错。

原因分析的差不多了，但是select和update的执行计划差别很大，仍然感到疑惑。下面横向对比oracle和mysql的执行计划，看看有什么差别。

ORACLE的行为

oracle库创建分区表，并使用本地索引

CREATE TABLE lzl (
    id number NOT NULL,
    partition_key number DEFAULT 0 NOT NULL,
...
)
PARTITION BY RANGE (partition_key)
(
PARTITION lzl_p20230601 VALUES LESS THAN ('20230602'),
PARTITION lzl_p20230602 VALUES LESS THAN ('20230603'),
...
PARTITION lzl_p20230630 VALUES LESS THAN ('20230631'));

create  index PKLZL on lzl(id, partition_key) local;
alter table lzl add constraint pklzl primary key (id, partition_key) using index pklzl;

执行计划：

with t as (select id from lzl where id=8723 and rownum<= 100 )
select  STATUS ,FILE_ID ,DATE_UPDATED  from lzl where id in ( select id from t)

update lzl set
STATUS = '00',
FILE_ID = null,
DATE_UPDATED = sysdate
where id in (select id from lzl where id=8723 and rownum<= 100)

oracle里，select和update都使用了nest loop，访问所有分区partition range all，所以oracle无论是select和update，t表为驱动表，因为是in所以结果进行了排序去重，所以oracle的执行计划不是30*30次访问，而是跟驱动表里的结果集有关，如果是n条数据，那么访问n*30次分区。因为驱动表t没有什么数据，所以这个执行计划没什么问题。

mysql的行为

因为mysql只支持本地索引，所以直接建主键就行了

CREATE TABLE test (
    id bigint NOT NULL,
    date_created timestamp  ,
...
)
PARTITION BY RANGE (partition_key) 
(
PARTITION lzl_p20230601 VALUES LESS THAN (20230602),
PARTITION lzl_p20230602 VALUES LESS THAN (20230603),
...
PARTITION lzl_p20230630 VALUES LESS THAN (20230631));

alter table lzl add primary key pklzl(id,partition_key);

mysql从5.7开始执行计划会显示扫描了哪些分区，这里的版本是8.0的。
select的执行计划：


> explain with t as (select id from lzl where id=8723 limit 100 )
    -> select  STATUS ,FILE_ID ,DATE_UPDATED  from lzl where id in ( select id from t);
+----+-------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------------+---------+---------+-------+------+----------+-----------------+
| id | select_type | table      | partitions                                                                                                                                                                                                                                                                                                                                                                                                                          | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra           |
+----+-------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------------+---------+---------+-------+------+----------+-----------------+
|  1 | PRIMARY     | <derived3> | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                | ALL  | NULL          | NULL    | NULL    | NULL  |    2 |   100.00 | Start temporary |
|  1 | PRIMARY     | lzl        | lzl_p20230601,lzl_p20230602,lzl_p20230603,lzl_p20230604,lzl_p20230605,lzl_p20230606,lzl_p20230607,lzl_p20230608,lzl_p20230609,lzl_p20230610,lzl_p20230611,lzl_p20230612,lzl_p20230613,lzl_p20230614,lzl_p20230615,lzl_p20230616,lzl_p20230617,lzl_p20230618,lzl_p20230619,lzl_p20230620,lzl_p20230621,lzl_p20230622,lzl_p20230623,lzl_p20230624,lzl_p20230625,lzl_p20230626,lzl_p20230627,lzl_p20230628,lzl_p20230629,lzl_p20230630 | ref  | PRIMARY       | PRIMARY | 8       | t.id  |    1 |   100.00 | End temporary   |
|  3 | DERIVED     | lzl        | lzl_p20230601,lzl_p20230602,lzl_p20230603,lzl_p20230604,lzl_p20230605,lzl_p20230606,lzl_p20230607,lzl_p20230608,lzl_p20230609,lzl_p20230610,lzl_p20230611,lzl_p20230612,lzl_p20230613,lzl_p20230614,lzl_p20230615,lzl_p20230616,lzl_p20230617,lzl_p20230618,lzl_p20230619,lzl_p20230620,lzl_p20230621,lzl_p20230622,lzl_p20230623,lzl_p20230624,lzl_p20230625,lzl_p20230626,lzl_p20230627,lzl_p20230628,lzl_p20230629,lzl_p20230630 | ref  | PRIMARY       | PRIMARY | 8       | const |    1 |   100.00 | Using index     |

update的执行计划：

> explain with t as (select id from lzl where id=8723 limit 100 )
    -> update lzl set
    -> STATUS = '00',
    -> FILE_ID = null,
    -> DATE_UPDATED = localtimestamp(0)   where id in ( select id from t);
+----+-------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------------+---------+---------+-------+------+----------+-----------------+
| id | select_type | table      | partitions                                                                                                                                                                                                                                                                                                                                                                                                                          | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra           |
+----+-------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------------+---------+---------+-------+------+----------+-----------------+
|  1 | PRIMARY     | <derived3> | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                | ALL  | NULL          | NULL    | NULL    | NULL  |    2 |   100.00 | Start temporary |
|  1 | UPDATE      | lzl        | lzl_p20230601,lzl_p20230602,lzl_p20230603,lzl_p20230604,lzl_p20230605,lzl_p20230606,lzl_p20230607,lzl_p20230608,lzl_p20230609,lzl_p20230610,lzl_p20230611,lzl_p20230612,lzl_p20230613,lzl_p20230614,lzl_p20230615,lzl_p20230616,lzl_p20230617,lzl_p20230618,lzl_p20230619,lzl_p20230620,lzl_p20230621,lzl_p20230622,lzl_p20230623,lzl_p20230624,lzl_p20230625,lzl_p20230626,lzl_p20230627,lzl_p20230628,lzl_p20230629,lzl_p20230630 | ref  | PRIMARY       | PRIMARY | 8       | t.id  |    1 |   100.00 | End temporary   |
|  3 | DERIVED     | lzl        | lzl_p20230601,lzl_p20230602,lzl_p20230603,lzl_p20230604,lzl_p20230605,lzl_p20230606,lzl_p20230607,lzl_p20230608,lzl_p20230609,lzl_p20230610,lzl_p20230611,lzl_p20230612,lzl_p20230613,lzl_p20230614,lzl_p20230615,lzl_p20230616,lzl_p20230617,lzl_p20230618,lzl_p20230619,lzl_p20230620,lzl_p20230621,lzl_p20230622,lzl_p20230623,lzl_p20230624,lzl_p20230625,lzl_p20230626,lzl_p20230627,lzl_p20230628,lzl_p20230629,lzl_p20230630 | ref  | PRIMARY       | PRIMARY | 8       | const |    1 |   100.00 | Using index     |

mysql的2个执行计划都一致。但是执行计划驱动表选择有问题，const的应该为驱动表扫描次数会更少。

BUG?

bug描述

https://postgrespro.com/list/thread-id/2482006
通过报错很容易就能搜到这个bug。这个bug还是德哥在2020年提交的，后面是两个源码大佬对这个bug的讨论，讨论内容比较长，总结一下：pg并不支持无限制的分区表，这在真实世界中也是能理解的，如果分区过多性能可能会急速下降。但是社区还是认为需要调整这个限制，并对源码中的INNER_VAR、Var.varno等进行了讨论。

误导性

这个bug标题有一定的误导性，BUG #16302: too many range table entries - when count partition table(65538 childs)
bug看上去说分区表的分区不能超过65538个，在讨论中也有PG can handle up to 64K relations in a query，一个查询不能有超过64K的relation。
这个就很奇怪，因为我这里的表是400个分区，然后就抛出报错了。实际上上面两个描述不太准确。因为64K的限制指的是执行计划中“表”，不全等于真实的表。当然如果表或者分区已经超过这个数，那么当然会有问题。但是如果没有超过64K，也可能是有问题的，就像我这里的案例一样，它只有400个分区。

修复

bug提交的是12.2版本，我的环境是13.2版本的。
这个bug在pg15中修复，src/include/nodes/primnodes.h源码跟之前不一样了

#define    INNER_VAR		(-1)	/* reference to inner subplan */
#define    OUTER_VAR		(-2)	/* reference to outer subplan */
#define    INDEX_VAR		(-3)	/* reference to index column */
#define    ROWID_VAR		(-4)	/* row identity column during planning */

#define IS_SPECIAL_VARNO(varno)		((int) (varno) < 0)

跟社区讨论的一样，15中不仅把几个VAR都改成了负数，还把varno转换成了32位(40亿），之前是16位的（也就是65536）。
而在抛出报错的函数中，src/backend/optimizer/plan/setrefs.c中的add_rte_to_flat_rtable()函数中的报错代码已经被删除了！整个15的源码都没有too many range table entries！

总结

pg对于分区表的优化还有提升空间。pg对于分区表的分区，跟oracle、mysql不一样，它仍然子分区当成普通表来处理，而oracle只是把子分区当成一个段来处理，跟表是有差别的。这也导致pg在生产分区表执行计划时把每一个分区的访问方式都写了出来（在不会裁剪的情况下），分区特别多时执行计划会非常的长；而oracle只需要写partition range all就行了；mysql也会打印所有分区，但是不会像pg那样把每个分区的访问当成一个子查询，从而降低了执行计划的复杂度。
即使分区没有达到64K，也可能报错too many range table entries。这个限制其实是执行计划RTE个数，而不是分区个数（当然分区达到这个数，RTE也到达了，就像上面说的，pg打印了所有分区的访问方式）
too many range table entries报错在pg15解决
如果是15以下的版本，确实不要建太多的分区！当然也可以利用分区裁剪来减少分区的访问，就像这个案例，把where逻辑中加上分区键条件就可以了。

postgresql

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者