暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
ICDE 2025_GRACEFUL:A Learned Cost Estimator For UDFs.pdf
100
14页
1次
2025-06-17
免费下载
GRACEFUL: A Learned Cost Estimator For UDFs
Johannes Wehrstein
, Tiemo Bang
1
, Roman Heinrich
, Carsten Binnig
Technical University of Darmstadt
Microsoft - Gray Systems Lab
DFKI
Abstract—User-Defined-Functions (UDFs) are a pivotal feature
in modern DBMS, enabling the extension of native DBMS
functionality with custom logic. However, the integration of UDFs
into query optimization processes poses significant challenges,
primarily due to the difficulty of estimating UDF execution
costs. Consequently, existing cost models in DBMS optimizers
largely ignore UDFs or rely on static assumptions, resulting
in suboptimal performance for queries involving UDFs. In this
paper, we introduce GRACEFUL, a novel learned cost model
to make accurate cost predictions of query plans with UDFs
enabling optimization decisions for UDFs in DBMS. For example,
as we show in our evaluation, using our cost model, we can
achieve 50× speedups through informed pull-up/push-down filter
decisions of the UDF compared to the standard case where always
a filter push-down is applied. Additionally, we release a synthetic
dataset of over 90,000 UDF queries to promote further research
in this area.
I. INTRODUCTION
Modern Databases Increasingly Face UDFs. Rising data
volumes and constantly evolving computing requirements in-
crease the need to move computation as close as possible to
the data. One important way to meet this demand in Database
Management Systems (DBMS) is to utilize User-Defined-
Functions (UDFs)—versatile stored procedures. UDFs are
crucial to encapsulate complex logic that cannot be efficiently
expressed by traditional SQL constructs alone. In that way,
they enable developers and data scientists to extend the native
functionality of DBMS with custom operations tailored to
specific application requirements. For example, UDFs are
frequently used for formatting values using string operations,
custom aggregation operators, or complex conversion or filter-
ing functions (e.g. conversion from Celsius to Fahrenheit or
formatting timestamps). Therefore UDFs have emerged as a
pivotal feature in modern DBMS and are executed billions of
times daily in data centers [1], [2].
UDF Optimizations are Crucial. The integration of UDFs
into database systems plays a key role in achieving fast and
efficient query execution. Crucial for efficient execution is
the query optimizer, responsible for selecting an efficient
execution plan. However, UDFs pose significant challenges for
query optimization, as their impact on plan runtime is often
poorly understood. For instance, while predicate push-down
is a common heuristic that typically improves performance, in
fact predicates involving UDFs contradict this heuristic. Since
UDFs can be computationally expensive as they can include
loops or external function calls, deferring their execution to
later plan stages might improve performance.
Limited UDF Support in DBMS Cost Estimators. Con-
sidering UDFs in query optimization is a decades-old problem,
1
Work done while at UC Berkeley.
Pull-Up!
Join
UDF
Join mi_idx
UDF title
mk
GRACEFUL
w/ Default
Push-Down
w/ GRACEFUL
Pull-Up Optimization
21.86s
0.48s
Runtime
SELECT * FROM movie_keyword as mk JOIN title as t ON mk.movie_id = t.id
JOIN movie_info_idx as mi_idx ON t.id = mi_idx.movie_id
WHERE t.series_years = '1987-1997'
AND udf(mk.movie_id,mk.keyword_id) <= 26026;
4.5m
3.8m
1.4m
69k
51k
90k
Fig. 1: The impact of pull-up optimization an a SQL query with a
UDF: carefully chosen pull-ups of UDF filters in a query plan can
drastically reduce query runtime. The numbers along the query edges
annotate the intermediate cardinalities.
already raised in the late 90s, e.g., by Stonebraker in [3],
which discusses query optimization with complex predicates.
Since then, several approaches focused on optimizing UDF
computations to increase their performance [2], [4]. However,
the task of estimating their costs in order to optimize query
plans without having them seen before has not yet been
addressed. This leads query optimizers to often fail in plan
selection and have them fall back to na
¨
ıve solutions if UDFs
are involved. To the best of our knowledge, all available
commercial and non-commercial systems today either use a
static constant for the UDF cost or ignore the UDF cost
completely. This is because DBMSs largely rely on traditional
cost functions handcrafted for a pre-defined set of well-studied
DBMS operators [5], [6].
The Importance of Optimizing UDF Queries. The lack
of precise cost estimates significantly degrades performance
when queries include UDFs. While basic optimizations, like
applying static rule sets on the UDF operator, are applicable,
more sophisticated cost-based optimizations are not supported,
as they lack cost estimates for the UDF. However, making
informed cost-based optimization decisions of SQL queries
which include a UDF can make a 10x, 50x, or even larger
difference on the total query runtime. This is shown in Figure 1
where pulling up the UDF-filter to the very top of the query
plan (instead of pushing it down to the table) reduced the
runtime from 21.86 to 0.48 seconds.
Let us now look at the example in Figure 1 in more detail
to understand where this effect comes from. As mentioned
before, textbook-style rule-based query optimization assumes
that a push-down of filter operation is always beneficial
[7], [8]. Though this assumption holds for non-UDF filter
predicates, as the filter predicate evaluation is not a domi-
nating factor, this rule does not necessarily hold for complex
arXiv:2503.23863v1 [cs.DB] 31 Mar 2025
predicates involving UDFs. As shown in Figure 1, with the
presence of UDFs, it can be actually beneficial to pull up a
UDF filter predicate if its computational cost can be reduced as
the UDF needs to be applied to fewer rows (69k rows instead
of 4.5m rows) as shown in the example.
Our Proposal: Learned Cost Estimator for UDFs. In
recent years, machine learning has shown promising results
for the task of cost estimation and query optimization, opening
a new avenue of database components [9]–[19]. These ML
techniques for cost estimation prove to capture complex effects
of data and workloads. Thus far, they do not capture the
complexities of UDFs in queries. UDFs have inner struc-
ture themselves—e.g., loops, branches, library calls—calling
for a new class of more cost estimation ML models. We
hence propose GRACEFUL (GRAph-based Cost Estimator
For User-defined Logic), a learned cost model for query plans
containing UDFs.
In general, estimating the runtime of arbitrary program code
is a hard problem. This stems from the fact that determin-
ing runtime in the most general sense is equivalent to the
Halting Problem, the undecidable problem of determining the
termination of a Turing machine. To solve this challenge and
predict costs for UDFs accurately, we utilize the following
observations: (1) UDFs typically come with code of manage-
able structure, as shown by [1]. They are typically composed
of a comparably simple control flow of a limited number of
loops and branches and have a limited size of only dozens
to hundreds of operations. (2) In contrast to execution code
outside a query plan, for UDFs inside a query plan we have
additional information on the data the UDFs are applied. This
allows reasoning on how often a certain operation in the UDF
will be executed or for how many rows an if/else branch
condition will evaluate to true.
The Need to Generalize to Unseen UDFs. Although UDFs
share structural similarities, the actual code can vary widely.
Due to this diversity among UDFs, it is crucial that the model
can generalize across UDFs and predict runtimes accurately
for unseen elements. To address this, with GRACEFUL we
provide a learned cost estimation model that aims to capture
the general runtime complexity of UDFs based on additional
statistics on how often certain code paths are executed, which
we derive from the base tables. As we show in our paper, our
model can thus generalize out-of-the-box to (1) unseen UDFs
(unseen program code), (2) unseen SQL workloads (unseen
SQL query patterns), and (3) unseen datasets.
We achieve this by pre-training the cost model on a wide
spectrum of queries with synthesized UDFs across various
datasets. Furthermore, we introduce a new representation for
the UDF code based on the control flow graph of the UDF.
With this representation, we train a Graph Neural Network
(GNN) model to produce a joint embedding of the UDF
and query plan and predict costs for UDFs with varying
complexities, including loops, branches, arithmetic as well
as string operators, and even library calls. Further, based on
GRACEFUL, we show that a pull-up advisor can be realized
that operates on top of our cost model, achieving very robust
and close to optimal query runtimes, highlighting the benefits
of accurate UDF cost estimates for optimizing query plans.
Adaptive Execution is Not to be Preferred. An alternative
route to our UDF cost estimator is adaptive execution. With
adaptive execution, the DBMS monitors the execution of the
operators and would, in flight, e.g., depending on observed
intermediate cardinalities, reconsider optimization decisions
made so far, like the join order. This allows the DBMS
to achieve fast execution times without needing a classical
cost estimator since it will adapt the query plan during the
execution. However, we’d like to highlight that the adaptive
execution of queries is highly complicated when integrated
into a DBMS. It would require a completely different exe-
cution paradigm, which is, in most cases, incompatible with
DBMS. Since cost models suit today’s architecture, achieving
the same speedups with a cost model is to be preferred. We,
therefore, focus on non-adaptive systems in this work.
Contributions. The main contributions of this paper are
summarized as follows:
(1) We introduce GRACEFUL, a GNN based cost estimator
capable of accurately predicting query runtimes for User-
Defined Functions (UDFs).
(2) We propose a novel transferrable representation for UDFs. It
captures the structure and input data distribution of UDFs
and includes a novel method to leverage database statis-
tics to estimate hit-ratios of UDF branches. This enables
accurate cost estimation for previously unseen UDFs and
databases.
(3) Our evaluation of GRACEFUL demonstrates superior ac-
curacy and significant end-to-end impact when optimizing
pull-up / push-down for UDF queries.
(4) Finally, we facilitate further research by releasing our code
and benchmark dataset, comprising over 90,000 queries
across 20 different databases.
2
Outline. The remainder of this paper is structured as follows:
We first provide an overview of our approach in Section II.
Next, we present our model’s details, including the representa-
tion of UDFs (Section III-A) and statistics annotations required
for precise cost estimation (Section III-B). Further, we describe
the joint representation of UDF and query (Section III-C) and
details of the model training and inference (Section III-D)
followed by the pull-up advisor Section IV, which builds
on our cost model. Afterward, we discuss our novel UDF
benchmark in Section V, which we used for evaluation.
Finally, we evaluate both the cost-estimator as well as the pull-
up advisor in Section VI using this benchmark and conclude
the paper with related work (Section VII) and a summary
(Section VIII).
II. OVERVIEW OF GRACEFUL
In this work, we introduce GRACEFUL, a novel learned
cost estimator designed to predict the runtime of SQL queries
which include UDFs as shown in Figure 2.
A. Key Ideas of Cost Model
To facilitate cost estimation, we propose a novel represen-
tation that abstracts UDFs, SQL queries, and the data they
2
https://github.com/DataManagementLab/Graceful.git
of 14
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。
关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜