ICDE 2025_Query Weak Equivalence and Its Verification in Analytical Databases_OceanBase.pdf

迹部景吾

13页

2次

2025-06-10

免费下载

Query Weak Equivalence and Its Veriﬁcation in

Analytical Databases

Jinguo You*

Kunming University of Science

and Technology

Kunming, China

jgyou@kust.edu.cn

Wanting Fu

Kunming University of Science

and Technology

Kunming, China

wtfu@stu.kust.edu.cn

Yuxuan Wang

Kunming University of Science

and Technology

Kunming, China

yxwang@stu.kust.edu.cn

Peilei He

Kunming University of Science

and Technology

Kunming, China

peilei@stu.kust.edu.cn

Kaiqi Liu

Kunming University of Science

and Technology

Kunming, China

liukaiqi w@163.com

Quanqing Xu*

Ant Group OceanBase

Beijing, China

xuquanqing.xqq@oceanbase.com

Abstract—Modern database applications operate on massive

data and support a range of complex queries, especially OLAP

queries which are time-consuming. To accelerate query pro-

cessing, a variety of methods for automatically verifying query

equivalence have been proposed to avoid redundant executions

of equivalent queries, mainly in a semantic sense. However, we

have observed some queries that are not semantically equivalent

also return the same tuples under the speciﬁc data distribution,

which cannot be detected by most current automated veriﬁcation

of query equivalence.

To deal with this issue, this paper proposes weak equivalence

for identifying queries that are not semantically equivalent but

produce the same results under the read-mostly scenarios such

as OLAP. Speciﬁcally, for posed queries, we extract their ﬁlter

condition expressions, which are then transformed into symbolic

representations, namely ﬁrst-order logic formulae. In terms of

their partial order, i.e. containment relationship, we introduce

Query Lattice, a novel structure that is constructed as a lattice

which is partitioned into equivalence classes that are convex

to answer queries if we determine they belong to the classes.

The equivalence class enables stored queries to respond to

future unseen queries so that redundant generation of query

plan and execution can be bypassed. Experimental evaluation of

Query Lattice built on top of a prevailing open-source DBMS,

PostgreSQL shows that the maximum improvement that Query

Lattice can achieve is 44.95% over the original PostgreSQL,

when running on the datasets of both TPC-H and TPC-H Skew

benchmarks.

Index Terms—OLAP, Multiple-query optimization, Data cube,

Query lattice, Weak equivalence.

I. INTRODUCTION

With the proliferation of database applications, improving

the performance of database management systems has become

a major objective of database research and there is a long

line of work focusing on the optimization of queries [1]–

[5]. The growing number of data analytic systems, like Myria

[6], Dremel [7], Spark SQL [8], and so on, enable people

*Corresponding authors.

to deploy complex data processing which has exhibited sig-

niﬁcant redundant operations including complex computations

and reading tuples from disk due to unnecessary execution of

queries that return the same tuples. These resources wasted are

caused by redundant execution of repeated queries, equivalent

queries with different forms, and not equivalent queries but

with the same outputs on a given data distribution. In practice,

it is difﬁcult to manually clarify these redundant, especially

the not equivalent queries returning the same answer. Thus,

automated methods for minimizing unwanted execution are

required.

In this paper, we say two queries are similar if they select the

same columns and have the same aggregate functions with the

same inputs. The difference between two similar queries is that

they have different ﬁlter predicates. Except for semantically

equivalent queries, these similar but not equivalent queries

can still produce identical results depending on the data

distribution of the speciﬁc database instance. For the case

that queries have the same results only under some speciﬁc

data distributions, we say these queries have weak equivalence

and the queries that have weak equivalence are similar ﬁrst.

In read-mostly environments, like data warehouses, most are

long-running analytical queries [9]. The methods mentioned

above are dedicated to verifying equivalent SQL queries. As

a result, the opportunity to optimize these similar queries that

have weak equivalence is missed.

Researchers have extensively studied the problem of de-

tecting equivalent queries [10], [11]. Previous research has

concentrated on identifying a subset of relational algebra

where it is feasible to determine query equivalence under

set and bag semantics [12], [13]. [14] focused solely on the

theoretical aspects of this problem. Recently, two pragmatic

approaches have been proposed for determining the semantic

equivalence of queries. The algebraic approaches translate

queries into algebraic expressions and then apply a set of

rewrite rules to standardize the expressions [15]–[17]. Finally,

1882

2025 IEEE 41st International Conference on Data Engineering (ICDE)

DOI 10.1109/ICDE65448.2025.00144

BBAAD9C20180234D78A0072836F0B970A2B9B2091AE1FBA0AED98436B1142B0C7B48B93861513B0122392408984659EB40E9217A61D04B311BBFC236724E32D124124DAD8521D9A7941D2A276F2749B7CA37E83875DB437E48D6019C18B00CD8DAC625989E3

they perform a sequence of tests to check for isomorphisms

and homomorphisms between the rewritten expressions to

determine the equivalence of the original queries. COSETTE

utilizes K-relations [18] to represent queries while UDP em-

ploys U-semirings [16], [17], [19]. These algebraic approaches

are used to verify the correctness of query rewrite rules

in query optimizers. They can therefore validate complex

structural transformations of queries. However, these algebraic

approaches still have limitations. First, they are unable to

model the semantics of widely used SQL features, such as

complex query predicates, arithmetic operations, and three-

valued logic for supporting NULL [20]. This limits their ability

to support a wide range of real-world queries. Secondly, its

veriﬁcation procedure is computationally intensive due to a lot

of rewrite rules that need to be applied to the given algebraic

expressions to determine their equivalence. These limitations

restrict their efﬁciency in DBaaS offerings.

To address these issues, the symbolic approach EQUITAS

[21] determines query equivalence using two steps. It initially

converts a query to a symbolic representation (SR) (i.e.,

a set of ﬁrst-order logic formulae) and thereafter uses a

general-purpose Satisﬁability Modulo Theory (SMT) solver

to ascertain the containment relationship between two SRs. If

the containment relationship holds in both directions, then the

two queries are considered equivalent. The symbolic method

overcomes the limitations of the algebraic method by reducing

the problem of proving query equivalence to a constraint

satisfaction problem [21]. Since the symbolic approach can

model more SQL features, more SQL equivalence solvers

are proposed to discover more equivalent queries [22], [23].

While this symbolic approach addresses the drawbacks of the

algebraic approach, both of them have one limitation. They

only optimize the repetitive execution of equivalent queries

(i.e., these queries must return the same tuples for any database

instances).

Therefore, we propose to take advantage of this optimization

opportunity by leveraging a novel approach for determining

weak equivalence for similar queries. We present a lattice

structure called Query Lattice to automatically detect similar

queries having weak equivalence and answer them. The sym-

bolic representations of queries’ ﬁlter conditions are derived

and stored in the Query Lattice, which is then partitioned

into equivalence classes based on whether queries have partial

order and their results are the same. This approach can

model various complex query predicates including arithmetic

operations and three-valued logic. We then utilize the com-

putationally efﬁcient SMT solver to establish containment

among SRs of queries, since the problem of determining the

weak equivalence of queries is reduced to that of deciding

the satisﬁability of formulae in ﬁrst-order logic. Deciding the

satisﬁability of FOL formulae is an NP-complete problem.

However, in practice, modern solvers employ heuristics from

satisﬁability theory [24] to efﬁciently solve FOL formulae

[25], [26].

If new queries have weak equivalence with cached queries

in our lattice, we employ cached results from that equivalence

class to answer the queries. Our method eliminates the need

for any succeeding query plan generation and execution when

a query can be answered by an equivalence class. Furthermore,

the lattice structure minimizes the search space for a query to

ﬁnd a matching class by leveraging the partial order among

equivalence classes in the lattice. Though both problems of

deciding query equivalence and containment are undecidable

[27], [28], these two problems can now be reduced to that of

deciding the satisﬁability of formulae in ﬁrst-order logic. This

allows for the use of computationally efﬁcient SMT solvers

[24].

In summary, we make the following contributions:

• We illustrate the limitations of veriﬁcation of query

equivalence and motivate the need for a new method to

determine the equivalence of query results.

• We introduce the concepts of input-containment and

weak equivalence to describe those queries that are

not semantically equivalent but have the same results.

The weak equivalence can be proven through input-

containment and equivalence class.

• We present Query Lattice, a novel structure composed

of symbolic representations of ﬁlter conditions of

queries. We demonstrate that this structure improves the

performance of PostgreSQL by handling not semantically

equivalent queries which have the same outputs.

• We implement our Query Lattice in PostgreSQL, a widely

used database, and Query Lattice is evaluated on both

TPC-H and TPC-H Skew benchmarks. We demonstrate

that this structure can ﬁnd similar queries with weak

equivalence in both benchmarks which improves the

response time of query answering compared to the

original PostgreSQL.

The rest of this paper is organized as follows: Section 2

describes the motivation of Query Lattice. Section 3 presents

the process by which Query Lattice veriﬁes query results

equivalence between queries. Section 4 describes the hierar-

chical structure of Query Lattice. Section 5 demonstrates two

main algorithms used for comparing and updating. Section 6

contains experiments to validate the efﬁcacy of Query Lattice

in improving database performance and lays out the limitations

of the current implementation of Query Lattice. Finally, the

paper is concluded in Section 7.

II. BACKGROUND & MOTIVATION

In this section, we ﬁrst use two examples to show similar

queries with weak equivalence that former validation methods

for equivalent queries are not applicable. We then present

an overview of how Query Lattice solves this problem and

illustrate this approach with some deﬁnitions.

A. Containment and Equivalence Problem

A pair of queries Q

and Q

is deemed to have a contain-

ment relationship provided the following constraint is satisﬁed:

1883

of 13

免费下载

文档被以下合辑收录

数据库顶会 ICDE 2025 论文下载（共16篇）

本合辑收集了数据库顶会 ICDE 2025 的论文，可以免费下载。

关注

文档被以下合辑收录

评论