predicates involving UDFs. As shown in Figure 1, with the
presence of UDFs, it can be actually beneficial to pull up a
UDF filter predicate if its computational cost can be reduced as
the UDF needs to be applied to fewer rows (69k rows instead
of 4.5m rows) as shown in the example.
Our Proposal: Learned Cost Estimator for UDFs. In
recent years, machine learning has shown promising results
for the task of cost estimation and query optimization, opening
a new avenue of database components [9]–[19]. These ML
techniques for cost estimation prove to capture complex effects
of data and workloads. Thus far, they do not capture the
complexities of UDFs in queries. UDFs have inner struc-
ture themselves—e.g., loops, branches, library calls—calling
for a new class of more cost estimation ML models. We
hence propose GRACEFUL (GRAph-based Cost Estimator
For User-defined Logic), a learned cost model for query plans
containing UDFs.
In general, estimating the runtime of arbitrary program code
is a hard problem. This stems from the fact that determin-
ing runtime in the most general sense is equivalent to the
Halting Problem, the undecidable problem of determining the
termination of a Turing machine. To solve this challenge and
predict costs for UDFs accurately, we utilize the following
observations: (1) UDFs typically come with code of manage-
able structure, as shown by [1]. They are typically composed
of a comparably simple control flow of a limited number of
loops and branches and have a limited size of only dozens
to hundreds of operations. (2) In contrast to execution code
outside a query plan, for UDFs inside a query plan we have
additional information on the data the UDFs are applied. This
allows reasoning on how often a certain operation in the UDF
will be executed or for how many rows an if/else branch
condition will evaluate to true.
The Need to Generalize to Unseen UDFs. Although UDFs
share structural similarities, the actual code can vary widely.
Due to this diversity among UDFs, it is crucial that the model
can generalize across UDFs and predict runtimes accurately
for unseen elements. To address this, with GRACEFUL we
provide a learned cost estimation model that aims to capture
the general runtime complexity of UDFs based on additional
statistics on how often certain code paths are executed, which
we derive from the base tables. As we show in our paper, our
model can thus generalize out-of-the-box to (1) unseen UDFs
(unseen program code), (2) unseen SQL workloads (unseen
SQL query patterns), and (3) unseen datasets.
We achieve this by pre-training the cost model on a wide
spectrum of queries with synthesized UDFs across various
datasets. Furthermore, we introduce a new representation for
the UDF code based on the control flow graph of the UDF.
With this representation, we train a Graph Neural Network
(GNN) model to produce a joint embedding of the UDF
and query plan and predict costs for UDFs with varying
complexities, including loops, branches, arithmetic as well
as string operators, and even library calls. Further, based on
GRACEFUL, we show that a pull-up advisor can be realized
that operates on top of our cost model, achieving very robust
and close to optimal query runtimes, highlighting the benefits
of accurate UDF cost estimates for optimizing query plans.
Adaptive Execution is Not to be Preferred. An alternative
route to our UDF cost estimator is adaptive execution. With
adaptive execution, the DBMS monitors the execution of the
operators and would, in flight, e.g., depending on observed
intermediate cardinalities, reconsider optimization decisions
made so far, like the join order. This allows the DBMS
to achieve fast execution times without needing a classical
cost estimator since it will adapt the query plan during the
execution. However, we’d like to highlight that the adaptive
execution of queries is highly complicated when integrated
into a DBMS. It would require a completely different exe-
cution paradigm, which is, in most cases, incompatible with
DBMS. Since cost models suit today’s architecture, achieving
the same speedups with a cost model is to be preferred. We,
therefore, focus on non-adaptive systems in this work.
Contributions. The main contributions of this paper are
summarized as follows:
(1) We introduce GRACEFUL, a GNN based cost estimator
capable of accurately predicting query runtimes for User-
Defined Functions (UDFs).
(2) We propose a novel transferrable representation for UDFs. It
captures the structure and input data distribution of UDFs
and includes a novel method to leverage database statis-
tics to estimate hit-ratios of UDF branches. This enables
accurate cost estimation for previously unseen UDFs and
databases.
(3) Our evaluation of GRACEFUL demonstrates superior ac-
curacy and significant end-to-end impact when optimizing
pull-up / push-down for UDF queries.
(4) Finally, we facilitate further research by releasing our code
and benchmark dataset, comprising over 90,000 queries
across 20 different databases.
2
Outline. The remainder of this paper is structured as follows:
We first provide an overview of our approach in Section II.
Next, we present our model’s details, including the representa-
tion of UDFs (Section III-A) and statistics annotations required
for precise cost estimation (Section III-B). Further, we describe
the joint representation of UDF and query (Section III-C) and
details of the model training and inference (Section III-D)
followed by the pull-up advisor Section IV, which builds
on our cost model. Afterward, we discuss our novel UDF
benchmark in Section V, which we used for evaluation.
Finally, we evaluate both the cost-estimator as well as the pull-
up advisor in Section VI using this benchmark and conclude
the paper with related work (Section VII) and a summary
(Section VIII).
II. OVERVIEW OF GRACEFUL
In this work, we introduce GRACEFUL, a novel learned
cost estimator designed to predict the runtime of SQL queries
which include UDFs as shown in Figure 2.
A. Key Ideas of Cost Model
To facilitate cost estimation, we propose a novel represen-
tation that abstracts UDFs, SQL queries, and the data they
2
https://github.com/DataManagementLab/Graceful.git
文档被以下合辑收录
评论