
Artemis: A Customizable Workload Generation
Toolkit for Benchmarking Cardinality Estimation
Zirui Hu
1,2
, Rong Zhang
1,2
*, Chengcheng Yang
1,2
, Xuan Zhou
1,2
, Quanqing Xu
3
, Chuanhui Yang
3
1
School of Data Science and Engineering, East China Normal University,
2
Engineering Research Center of Blockchain Data Management, Ministry of Education,
3
OceanBase, Ant Group
zrhu@stu.ecnu.edu.cn, {rzhang, ccyang, xzhou}@dase.ecnu.edu.cn, {xuquanqing.xqq, rizhao.ych}@oceanbase.com
Abstract—Cardinality Estimation (CardEst) is crucial for
query optimization. Despite the remarkable achievement in
DBMS, there is a pressing need to test or tune the work of
CardEst. To satisfy the need, we introduce ARTEMIS, a customiz-
able workload generator, which can be used to generate various
scenarios with the sensitive features for CardEst, including
various data dependencies, complex SQL structures, and diverse
cardinalities. It designs a PK-oriented deterministic data gener-
ation mechanism to plot various data characteristics; a search-
based workload generation is proposed for composing queries
with various complexities; it takes a constraint optimization-
guided way to achieve a cost-effective cardinality calculation.
In this demonstration, users can explore the core features of
ARTEMIS in generating workloads.
Index Terms—cardinality estimation, benchmarking.
I. INTRODUCTION
Cardinality estimation (CardEst), renowned for the Achilles
Heel of query optimization, is one of the critical tasks in
query optimizer for estimating the intermediate result size of
each operator, and deciding an optimal query plan. However,
the intricate data dependencies and complicated selection/join
operators still make CardEst an unsolved hard issue [1].
Traditionally, DBMSs use histograms to collect data statistics
with low construction and maintenance costs. Due to high
statistics storage cost, the histogram-based method is gen-
erally adopted based on some assumptions, e.g., uniformity
within a histogram bucket and independence of attributes.
To move beyond these assumptions, learning-based methods
are leveraged to sketch the underlying data distribution and
attribute correlations. Learning-based methods are mainly cat-
egorized into data-driven and query-driven. The data-driven
method employs generative models for unsupervised learning
of the proportion of rows in the joint domain represented as
P (D
M
) = P (D
1
, D
2
, ..., D
n
) [2], where D
M
is the joint
domain from the full outer join of all tables, and D
i
is the
sub-domain of the i
th
attribute column. Supposing the size of
table T as |T |, the new query as Q, the joint domain covered
by Q as D
Q
, the cardinality of query Q is P (D
Q
) ∗ |T |.
The query-driven method uses discriminative models trained
on annotated query-cardinality pairs (Q, Card(Q)) to predict
the Card(
¯
Q) for new query
¯
Q. Though learning-based meth-
ods have demonstrated supreme capability on some classical
benchmarks, e.g., TPC-H, JOB-light [3], and STATS-CEB [4],
they can not be easily generalized to obtain high accuracy
across diversified scenarios, which makes it impossible for the
*Rong Zhang is the corresponding author.
practical industry use. However, providing diverse workloads
for CardEst has always been a tough work. The core challenges
come from the requirements of:
C1: Insufficient Data Dependency and Distribution. Both
data-driven and traditional methods fundamentally try to iden-
tify joint domain with strong correlations and plot their joint
distribution, which requires enormous datasets with various
data dependencies and distributions. Traditional OLAP bench-
marks, e.g., TPC-H, are meticulously designed for evaluating
execution engines with uniform distribution and independence
between attributes, which is too simple for the CardEst task.
The other CardEst benchmarks [4] typically rely on a fixed
schema using publicly available datasets, which have the
defect of representativeness and generalizability; additionally,
maintaining the established dependency relationships and data
distributions when scaling data is still a great challenge.
C2: Demanding Training Label Generation. For query-
driven methods, their effectiveness is fundamentally decided
by the size of diverse annotated queries and cardinality label
pairs for training, which can be generally obtained in two
ways. One is to extract training data from real-life industry
logs, but it is barely possible to collect such kinds of training
data due to the concerns of privacy [5]. The other one is to
construct self-defined training data. However, it may consume
a significant amount of CPU and memory resources, along
with tedious and time-consuming manual effort to construct
and run queries in a full-fledged system to gather labels. For
example, collecting cardinality labels for 500 queries with
1TB of input data could take nearly 10 days [6]. Therefore,
the exhaustive resource consumption and burdensome manual
effort for label generation challenge the model training.
C3: Limited Complexity of Workloads. The diversity of
training workloads greatly influences the model robustness
of learning-based methods, while the diversity of testing
workloads verifies the effectiveness of methods across various
scenarios. Workload diversity can be represented by three
dimensions [4], i.e., 1) query templates composed of operators,
2) access distribution of the predicates, and 3) query cardinal-
ities. On the one hand, existing benchmarks have fundamental
shortcomings in their template design. For instance, the state-
of-the-art CardEst benchmark STATS-CEB [4] loses sup-
port to some join types, including cycle/cyclic-joins, non-key
joins, and outer/semi/anti-joins, and selection predicates like
disjunctive logical predicates (∨) and arithmetic predicates.
On the other hand, existing benchmarks do not guarantee
4628
2025 IEEE 41st International Conference on Data Engineering (ICDE)
2375-026X/25/$31.00 ©2025 IEEE
DOI 10.1109/ICDE65448.2025.00369
BBAAD9C20180234D78A0072836F0B930E2B9B2091C41FBA0A4D98436B1942B0C6B4CB93861533B082259240898463FEBCAE921BA61D05B811BBFC2157ABE32D6241290ADFF21A9A794B72A276F474945CAF7ED3875D043744886019CE8BDACD8D7C6299B9E3
文档被以下合辑收录
评论