
SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands Mark Raasveldt and Hannes Mühleisen
system. The following requirements for embedded analytical
databases were identied:
•
High eciency for OLAP workloads, but without com-
pletely sacricing OLTP performance. For example,
concurrent data modication is a common use case in
dashboard-scenarios where multiple threads update
the data using OLTP queries and other threads run the
OLAP queries that drive visualizations simultaneously.
•
Ecient transfer of tables to and from the database
is essential. Since both database and application run
in the same process and thus address space, there is
a unique opportunity for ecient data sharing which
needs to be exploited.
•
High degree of stability, if the embedded database
crashes, for example due to an out-of-memory situ-
ation, it takes the host down with it. This can never
happen. Queries need to be able to be aborted cleanly
if they run out of resources, and the system needs to
gracefully adapt to resource contention.
•
Practical “embeddability” and portability, the database
needs to run in whatever environment the host does.
Dependencies on external libraries (e.g.
openssh
)
for either compile- or runtime have been found to
be problematic. Signal handling, calls to
exit()
and
modication of singular process state (locale, working
directory etc.) are forbidden.
In this demonstration, we present the capabilities of our
new system, DuckDB. DuckDB is a new purpose-built em-
beddable relational database management system. DuckDB
is available as Open-Source software under the permissive
MIT license
1
. To the best of our knowledge, there currently
exists no purpose-built embeddable analytical database de-
spite the clear need outlined above. DuckDB is no research
prototype but built to be widely used, with millions of test
queries run on each commit to ensure correct operation and
completeness of the SQL interface. Our rst-ever demonstra-
tion of DuckDB will pit it against other systems on a small
device. We will allow viewers to increase the size of the
dataset processed, and observe various metrics such as CPU
load and memory pressure as the dataset size changes. This
will demonstrate the performance of DuckDB for analytical
embedded data analysis.
2 DESIGN AND IMPLEMENTATION
DuckDB’s design decisions are informed by its intended use
case: embedded analytics. Overall, we follow the “textbook”
separation of components: Parser, logical planner, optimizer,
physical planner, execution engine. Orthogonal components
are the transaction and storage managers. While DuckDB
is rst in a new class of data management systems, none of
1
https://github.com/cwida/duckdb
API C/C++/SQLite
SQL Parser libpg_query [2]
Optimizer Cost-Based [7, 9]
Execution Engine Vectorized [1]
Concurrency Control Serializable MVCC [10]
Storage DataBlocks [5]
Table 1: DuckDB: Component Overview
DuckDB’s components is revolutionary in its own regard.
Instead, we combined methods and algorithms from the state
of the art that were best suited for our use cases.
Being an embedded database, DuckDB does not have a
client protocol interface or a server process, but instead is
accessed using a C/C++ API. In addition, DuckDB provides
a SQLite compatibility layer, allowing applications that pre-
viously used SQLite to use DuckDB through re-linking or
library overloading. As with MonetDBLite, we have also im-
plemented the database APIs for R (DBI) and Python (PEP
249).
The SQL parser is derived from Postgres’ SQL parser that
has been stripped down as much as possible [
2
]. This has
the advantage of providing DuckDB with a full-featured and
stable parser to handle one of the most volatile form of its
input, SQL queries. The parser takes a SQL query string as
input and returns a parse tree of C structures. This parse tree
is then immediately transformed into our own parse tree of
C++ classes to limit the reach of Postgres’ data structures.
This parse tree consists of statements (e.g.
SELECT
,
INSERT
etc.) and expressions (e.g. SUM(a)+1).
The logical planner consists of two parts, the binder and
the plan generator. The binder resolves all expressions re-
ferring to schema objects such as tables or views with their
column names and types. The logical plan generator then
transforms the parse tree into a tree of basic logical query
operators such as scan, lter, project, etc. After the plan-
ning phase, we have a fully type-resolved logical query plan.
DuckDB keeps statistics on the stored data, and these are
propagated through the dierent expression trees as part of
the planning process. These statistics are used in the opti-
mizer itself, and are also used for integer overow prevention
by upgrading types when required.
DuckDB’s optimizer performs join order optimization us-
ing dynamic programming [
7
] with a greedy fallback for
complex join graphs [
11
]. It performs attening of arbitrary
subqueries as described in Neumann et al. [
9
]. In addition,
there are a set of rewrite rules that simplify the expression
tree, by performing e.g. common subexpression elimination
文档被以下合辑收录
评论