ICDE 2025_GRACEFUL：A Learned Cost Estimator For UDFs.pdf

芥川慈郎

100

14页

1次

2025-06-17

免费下载

GRACEFUL: A Learned Cost Estimator For UDFs

Johannes Wehrstein

∗

, Tiemo Bang

1†

, Roman Heinrich

∗‡

, Carsten Binnig

∗‡

∗

Technical University of Darmstadt

†

Microsoft - Gray Systems Lab

‡

DFKI

Abstract—User-Deﬁned-Functions (UDFs) are a pivotal feature

in modern DBMS, enabling the extension of native DBMS

functionality with custom logic. However, the integration of UDFs

into query optimization processes poses signiﬁcant challenges,

primarily due to the difﬁculty of estimating UDF execution

costs. Consequently, existing cost models in DBMS optimizers

largely ignore UDFs or rely on static assumptions, resulting

in suboptimal performance for queries involving UDFs. In this

paper, we introduce GRACEFUL, a novel learned cost model

to make accurate cost predictions of query plans with UDFs

enabling optimization decisions for UDFs in DBMS. For example,

as we show in our evaluation, using our cost model, we can

achieve 50× speedups through informed pull-up/push-down ﬁlter

decisions of the UDF compared to the standard case where always

a ﬁlter push-down is applied. Additionally, we release a synthetic

dataset of over 90,000 UDF queries to promote further research

in this area.

I. INTRODUCTION

Modern Databases Increasingly Face UDFs. Rising data

volumes and constantly evolving computing requirements in-

crease the need to move computation as close as possible to

the data. One important way to meet this demand in Database

Management Systems (DBMS) is to utilize User-Deﬁned-

Functions (UDFs)—versatile stored procedures. UDFs are

crucial to encapsulate complex logic that cannot be efﬁciently

expressed by traditional SQL constructs alone. In that way,

they enable developers and data scientists to extend the native

functionality of DBMS with custom operations tailored to

speciﬁc application requirements. For example, UDFs are

frequently used for formatting values using string operations,

custom aggregation operators, or complex conversion or ﬁlter-

ing functions (e.g. conversion from Celsius to Fahrenheit or

formatting timestamps). Therefore UDFs have emerged as a

pivotal feature in modern DBMS and are executed billions of

times daily in data centers [1], [2].

UDF Optimizations are Crucial. The integration of UDFs

into database systems plays a key role in achieving fast and

efﬁcient query execution. Crucial for efﬁcient execution is

the query optimizer, responsible for selecting an efﬁcient

execution plan. However, UDFs pose signiﬁcant challenges for

query optimization, as their impact on plan runtime is often

poorly understood. For instance, while predicate push-down

is a common heuristic that typically improves performance, in

fact predicates involving UDFs contradict this heuristic. Since

UDFs can be computationally expensive as they can include

loops or external function calls, deferring their execution to

later plan stages might improve performance.

Limited UDF Support in DBMS Cost Estimators. Con-

sidering UDFs in query optimization is a decades-old problem,

Work done while at UC Berkeley.

Pull-Up!

Join

UDF

Join mi_idx

UDF title

GRACEFUL

w/ Default

Push-Down

w/ GRACEFUL

Pull-Up Optimization

21.86s

0.48s

Runtime

SELECT * FROM movie_keyword as mk JOIN title as t ON mk.movie_id = t.id

JOIN movie_info_idx as mi_idx ON t.id = mi_idx.movie_id

WHERE t.series_years = '1987-1997'

AND udf(mk.movie_id,mk.keyword_id) <= 26026;

4.5m

3.8m

1.4m

69k

51k

90k

Fig. 1: The impact of pull-up optimization an a SQL query with a

UDF: carefully chosen pull-ups of UDF ﬁlters in a query plan can

drastically reduce query runtime. The numbers along the query edges

annotate the intermediate cardinalities.

already raised in the late 90s, e.g., by Stonebraker in [3],

which discusses query optimization with complex predicates.

Since then, several approaches focused on optimizing UDF

computations to increase their performance [2], [4]. However,

the task of estimating their costs in order to optimize query

plans without having them seen before has not yet been

addressed. This leads query optimizers to often fail in plan

selection and have them fall back to na

ıve solutions if UDFs

are involved. To the best of our knowledge, all available

commercial and non-commercial systems today either use a

static constant for the UDF cost or ignore the UDF cost

completely. This is because DBMSs largely rely on traditional

cost functions handcrafted for a pre-deﬁned set of well-studied

DBMS operators [5], [6].

The Importance of Optimizing UDF Queries. The lack

of precise cost estimates signiﬁcantly degrades performance

when queries include UDFs. While basic optimizations, like

applying static rule sets on the UDF operator, are applicable,

more sophisticated cost-based optimizations are not supported,

as they lack cost estimates for the UDF. However, making

informed cost-based optimization decisions of SQL queries

which include a UDF can make a 10x, 50x, or even larger

difference on the total query runtime. This is shown in Figure 1

where pulling up the UDF-ﬁlter to the very top of the query

plan (instead of pushing it down to the table) reduced the

runtime from 21.86 to 0.48 seconds.

Let us now look at the example in Figure 1 in more detail

to understand where this effect comes from. As mentioned

before, textbook-style rule-based query optimization assumes

that a push-down of ﬁlter operation is always beneﬁcial

[7], [8]. Though this assumption holds for non-UDF ﬁlter

predicates, as the ﬁlter predicate evaluation is not a domi-

nating factor, this rule does not necessarily hold for complex

arXiv:2503.23863v1 [cs.DB] 31 Mar 2025

predicates involving UDFs. As shown in Figure 1, with the

presence of UDFs, it can be actually beneﬁcial to pull up a

UDF ﬁlter predicate if its computational cost can be reduced as

the UDF needs to be applied to fewer rows (69k rows instead

of 4.5m rows) as shown in the example.

Our Proposal: Learned Cost Estimator for UDFs. In

recent years, machine learning has shown promising results

for the task of cost estimation and query optimization, opening

a new avenue of database components [9]–[19]. These ML

techniques for cost estimation prove to capture complex effects

of data and workloads. Thus far, they do not capture the

complexities of UDFs in queries. UDFs have inner struc-

ture themselves—e.g., loops, branches, library calls—calling

for a new class of more cost estimation ML models. We

hence propose GRACEFUL (GRAph-based Cost Estimator

For User-deﬁned Logic), a learned cost model for query plans

containing UDFs.

In general, estimating the runtime of arbitrary program code

is a hard problem. This stems from the fact that determin-

ing runtime in the most general sense is equivalent to the

Halting Problem, the undecidable problem of determining the

termination of a Turing machine. To solve this challenge and

predict costs for UDFs accurately, we utilize the following

observations: (1) UDFs typically come with code of manage-

able structure, as shown by [1]. They are typically composed

of a comparably simple control ﬂow of a limited number of

loops and branches and have a limited size of only dozens

to hundreds of operations. (2) In contrast to execution code

outside a query plan, for UDFs inside a query plan we have

additional information on the data the UDFs are applied. This

allows reasoning on how often a certain operation in the UDF

will be executed or for how many rows an if/else branch

condition will evaluate to true.

The Need to Generalize to Unseen UDFs. Although UDFs

share structural similarities, the actual code can vary widely.

Due to this diversity among UDFs, it is crucial that the model

can generalize across UDFs and predict runtimes accurately

for unseen elements. To address this, with GRACEFUL we

provide a learned cost estimation model that aims to capture

the general runtime complexity of UDFs based on additional

statistics on how often certain code paths are executed, which

we derive from the base tables. As we show in our paper, our

model can thus generalize out-of-the-box to (1) unseen UDFs

(unseen program code), (2) unseen SQL workloads (unseen

SQL query patterns), and (3) unseen datasets.

We achieve this by pre-training the cost model on a wide

spectrum of queries with synthesized UDFs across various

datasets. Furthermore, we introduce a new representation for

the UDF code based on the control ﬂow graph of the UDF.

With this representation, we train a Graph Neural Network

(GNN) model to produce a joint embedding of the UDF

and query plan and predict costs for UDFs with varying

complexities, including loops, branches, arithmetic as well

as string operators, and even library calls. Further, based on

GRACEFUL, we show that a pull-up advisor can be realized

that operates on top of our cost model, achieving very robust

and close to optimal query runtimes, highlighting the beneﬁts

of accurate UDF cost estimates for optimizing query plans.

Adaptive Execution is Not to be Preferred. An alternative

route to our UDF cost estimator is adaptive execution. With

adaptive execution, the DBMS monitors the execution of the

operators and would, in ﬂight, e.g., depending on observed

intermediate cardinalities, reconsider optimization decisions

made so far, like the join order. This allows the DBMS

to achieve fast execution times without needing a classical

cost estimator since it will adapt the query plan during the

execution. However, we’d like to highlight that the adaptive

execution of queries is highly complicated when integrated

into a DBMS. It would require a completely different exe-

cution paradigm, which is, in most cases, incompatible with

DBMS. Since cost models suit today’s architecture, achieving

the same speedups with a cost model is to be preferred. We,

therefore, focus on non-adaptive systems in this work.

Contributions. The main contributions of this paper are

summarized as follows:

(1) We introduce GRACEFUL, a GNN based cost estimator

capable of accurately predicting query runtimes for User-

Deﬁned Functions (UDFs).

(2) We propose a novel transferrable representation for UDFs. It

captures the structure and input data distribution of UDFs

and includes a novel method to leverage database statis-

tics to estimate hit-ratios of UDF branches. This enables

accurate cost estimation for previously unseen UDFs and

databases.

(3) Our evaluation of GRACEFUL demonstrates superior ac-

curacy and signiﬁcant end-to-end impact when optimizing

pull-up / push-down for UDF queries.

(4) Finally, we facilitate further research by releasing our code

and benchmark dataset, comprising over 90,000 queries

across 20 different databases.

Outline. The remainder of this paper is structured as follows:

We ﬁrst provide an overview of our approach in Section II.

Next, we present our model’s details, including the representa-

tion of UDFs (Section III-A) and statistics annotations required

for precise cost estimation (Section III-B). Further, we describe

the joint representation of UDF and query (Section III-C) and

details of the model training and inference (Section III-D)

followed by the pull-up advisor Section IV, which builds

on our cost model. Afterward, we discuss our novel UDF

benchmark in Section V, which we used for evaluation.

Finally, we evaluate both the cost-estimator as well as the pull-

up advisor in Section VI using this benchmark and conclude

the paper with related work (Section VII) and a summary

(Section VIII).

II. OVERVIEW OF GRACEFUL

In this work, we introduce GRACEFUL, a novel learned

cost estimator designed to predict the runtime of SQL queries

which include UDFs as shown in Figure 2.

A. Key Ideas of Cost Model

To facilitate cost estimation, we propose a novel represen-

tation that abstracts UDFs, SQL queries, and the data they

https://github.com/DataManagementLab/Graceful.git

of 14

免费下载

icde2025 paper

文档被以下合辑收录

数据库顶会 ICDE 2025 论文下载（共16篇）

本合辑收集了数据库顶会 ICDE 2025 的论文，可以免费下载。

关注

文档被以下合辑收录

评论