is often used in mission-critical applications such as ight
software [36].
• Fast.
SQLite can support tens of thousands of transactions
per second. In some cases, SQLite reads and writes blob
data 35% faster and uses 20% less storage space than the
lesystem [
16
]. SQLite’s query planner produces ecient
plans for complex analytical queries [28].
While distinct in many ways, SQLite shares several characteris-
tics with traditional database systems designed for online transac-
tion processing (OLTP). SQLite uses a row-oriented storage format,
where all the columns of a given record are stored in a contiguous
memory region [
23
]. SQLite’s operators act on individual rows,
rather than batches of rows as in vectorized query execution [
32
].
Finally, SQLite provides full ACID guarantees: transactions are
atomic, consistent, isolated, and durable [34].
However, SQLite is used in scenarios that are well outside the
boundaries of conventional OLTP. Well-known uses of SQLite in-
clude processing data on embedded devices and the internet of
things, managing application state as an application le format,
serving website requests, analyzing large datasets, caching enter-
prise data, and transferring data from one system to another [
19
].
A previous study, which traced SQLite activity on mobile phones,
found that SQLite was used for a diverse range of tasks [
39
]. During
the study, the majority of operations performed by SQLite were
single-table scans and key-value lookups. However, the trace in-
cluded much more complex analytical queries that involved joins
between several tables. In addition, a small but signicant portion
of the workloads consisted of OLTP operations.
Recently, the explosive growth of edge computing and interac-
tive data analysis has created a need for ecient in-process online
analytical processing (OLAP). Several database systems have al-
ready been created or adapted for OLAP, including MonetDB [
37
],
Oracle OLAP [
10
], and SAP HANA [
11
]. However, these eorts have
largely focused on client-server architectures. Despite this focus,
ecient OLAP is often desired in situations where a client-server
database system is unwieldy or even impossible to use.
In-process OLAP is an important component of edge computing.
In the internet of things, data analysis is increasingly being pushed
to the edge in order to reduce network trac and server load [
48
]. In
addition, there are often privacy concerns associated with sending
sensitive data across a network. While OLAP database systems
are well-suited for the task of data analysis, edge devices may not
possess the energy or computational resources necessary to host a
client-server database system.
As another example, data scientists frequently perform in-process
OLAP to interactively explore a dataset prior to building a model.
Signicant portions of data science workows involve relational
algebraic operations, such as selection, projection, join, and aggrega-
tion. These operations can be concisely scripted using a dataframe
library such as pandas [
52
]. However, despite their ease of use,
dataframe libraries provide limited query optimization and often
materialize large intermediate results. Furthermore, datasets often
exceed the capacity of memory, which may force the data scientist
to implement hand-rolled buer management and use inecient
storage representations, such as CSV or JSON. An embeddable data-
base engine is better equipped to handle these workloads. For this
reason, SQLite is already a popular tool in data science. Due to its
stability, portability, and space-eciency, the SQLite database le
format is commonly used for sharing datasets. For example, SQLite
is one of the primary le formats used by the popular Kaggle data
science platform [
38
]. The Python
sqlite3
module [
15
] is often
used to carry out SQL operations in data science notebooks. While
SQLite produces ecient query plans and handles datasets much
larger than memory, it is less optimized for analytics compared to
OLAP-specic database systems such as those mentioned above.
These areas are prime targets for a powerful, embeddable OLAP
database engine. DuckDB [
47
] recently emerged to meet this de-
mand. Nicknamed “the SQLite for analytics”, DuckDB is built from
the ground up for in-process OLAP, employing columnar storage,
parallel and vectorized query processing, and multi-version con-
currency control optimized for extract-transform-load (ETL) opera-
tions. While still in a pre-release development phase, DuckDB has
already produced competitive performance on OLAP benchmarks
[
8
]. We believe that DuckDB lls a much needed gap in embeddable
data processing.
While SQLite and DuckDB have both been evaluated separately,
a well-rounded comparison of the two systems is missing from
the literature. As described above, SQLite shares many design el-
ements with OLTP database systems, so one might expect it to
excel on OLTP benchmarks. However, SQLite aims to be as general-
purpose as possible, so competitive performance could reasonably
be expected on a variety of workloads. In contrast, DuckDB is
purpose-built for analytics, so one might expect it to outperform
SQLite on OLAP benchmarks.
In this paper, we provide experimental support for these expec-
tations and quantify their magnitude. For OLAP, we go one level
deeper—we identify specic characteristics of SQLite responsible
for its OLAP performance and then present optimizations that sub-
stantially increase its speed.
The key contributions of this paper are:
• We present a historical perspective of SQLite.
We also
present a concise description of its architecture.
• We provide a thorough evaluation of SQLite on char-
acteristic workloads, using DuckDB as a baseline.
Our
evaluation includes benchmarks that represent diverse a-
vors of in-process data management.
• We optimize SQLite for analytical data processing.
We
identify key bottlenecks in SQLite and discuss the advantages
and disadvantages of potential solutions. We integrate our
optimizations into SQLite, resulting in overall 4.2X speedup
on SSB.
• We identify several performance measures specic to
embeddable database engines
, including library footprint
and blob processing performance.
• We provide a high-level description of potential fu-
ture directions for further performance improvement
in SQLite.
The rest of this paper is organized as follows. Section 2 provides
an architectural overview of SQLite and rationale for its design.
Section 3 discusses SQLite in the context of evolving workloads
and hardware. Section 4 evaluates SQLite and DuckDB on a variety
3536
评论