
Figure 2: Time break down of a ML pipeline built on DDoS
dataset.
data scientists need only change a single line of client code to en-
joy the benets (see Figure 1). We propose ConnectorX, a fast and
memory-ecient data loading library that supports many DBMSs
(e.g., PostgreSQL, SQLite, MySQL, SQLServer, Oracle) to client
dataframes (e.g., Pandas, PyArrow, Modin, Dask, and Polars). We
made our contributions while sought to address four major ques-
tions.
First, where are the actual data loading bottlenecks? We pro-
le the
Pandas.read_sql
implementation in Section 3 (due to its
popularity). We nd that the runtime can be split into two parts:
the server side runtime includes query execution, serialization, and
network transfer, and the client side includes deserialization and
conversion into a dataframe. We were surprised to nd that >85%
of time is spent in the client, and that the conversion materializes
all intermediate transformation results in memory. These ndings
suggest that client-side optimizations are sucient to dramatically
reduce data loading costs.
Second, how do we both reduce the runtime and memory,
while also making the system extensible to new DBMSs? In
Section 4, we design a succinct domain-specic language (DSL)
for mapping DBMS result representations into dataframes—this
reduces the lines of code by 1-2 orders of magnitude as compared
to not using our DSL. Under the covers, ConnectorX compiles the
DSL to execute over a streaming workow system that eciently
translates bytes received from the network into objects in memory.
The workow executes in a pipelined fashion, and seamlessly com-
bines optimization techniques including parallel execution, string
allocation optimizations, and an ecient data representation.
Third, are widely used client query partitioning scheme good
enough? Parallelization via query partitioning is the dominant way
to reduce execution and loading time. Existing libraries [
60
,
66
,
72
]
including ConnectorX partition the query on the client, which is
popular since it does not require modication to the server. Unfortu-
nately, we nd that it introduces extra user burden, load imbalances,
wasted server resources, and data inconsistencies. Thus, we study
server-side result partitioning in Section 5, where the DBMS di-
rectly partitions the query result and sends back in parallel without
changing the protocol and access method. We prototype and demon-
strate the ecacy using PostgreSQL, and advocate DBMS vendors
to add this support to benet data science applications.
Fourth, does a new data loading library matter? Since its rst
release in April 2021, ConnectorX has been widely adopted by real
users, with a total of 300K+ downloads and 640+ Github stars in a
year. It has been applied to extracting data in ETL [
11
] and loading
ML data from DBMS [
12
]. It is also integrated into popular open
source projects such as Polars [
17
], Modin [
66
] and DataPrep [
14
].
For example, Polars is the most popular dataframe library in Rust,
and it uses ConnectorX as the default way to read data from various
databases [
13
]. Our experiments in Section 6 show that ConnectorX
signicantly outperforms existing libraries (Pandas, Dask, Modin,
Turbodbc) when loading large query results. Compared to Pandas,
it reduces runtime by 13× and memory utilization by 3×.
In summary, our paper makes the following contributions:
(1)
We perform an in-depth empirical analysis of the
read_sql
function in Pandas. We surprisingly nd that the main over-
head for
read_sql
comes from the client-side instead of
query execution and data transfer.
(2)
We design and implement ConnectorX that greatly reduces
the overhead of
read_sql
with no requirement to modify
existing database servers and client protocols.
(3)
We propose a carefully designed architecture and interface
to make ConnectorX easy to extend, and design a DSL to
simplify the type mapping from databases to dataframes.
(4)
We identify the issues of client-side query partitioning, and
propose server-side result partitioning and implement pro-
totype systems to address these issues.
(5)
We conduct extensive experiments to evaluate ConnectorX
and compare it with popular libraries. The results show that
ConnectorX signicantly outperforms existing solutions.
The remainder of this paper is organized as follows. We review
related work in Section 2. We perform an in-depth empirical analy-
sis of
read_sql
in Section 3, and propose ConnectorX in Section 4.
Section 5 dives into the topic of query partitioning. We present
evaluation results in Section 6, and conclude our paper in Section 7.
2 RELATED WORK
Bridging the gap between DBMS and ML has become a hot topic in
the database community. ConnectorX ts into the big picture by
supporting ecient data loading from DBMSs to dataframes.
Accelerating Data Access From DBMS. There has been abundant
works in speeding up moving data out of database systems.
(1) Server-Side Enhancement.
Accessing data from database sys-
tems through tuple-level protocol is notoriously slow [
57
,
70
]. Previ-
ous work [
70
] shows that existing wire protocols suer from redun-
dant information and expensive (de)serialization, and propose a new
protocol to tackle these issues. More approaches tend to leverage
existing open data formats (e.g. Parquet [
25
], ORC [
24
], Avro [
23
],
Arrow [
6
], Dermel [
59
], Pickle [
1
]) to speed up the process by avoid-
ing tuple-level accessing. Li et. al [
57
] adopts Flight [
46
] to enable
zero-copy on data export in Arrow IPC format. Data warehouses
such as Redshift [
20
], BigQuery [
16
] and Snowake [
43
] support
unloading data into cloud stores (e.g. Amazon S3 [
22
], Azure Blob
Storage [
26
], Google Cloud Storage [
30
]) in the format like CSV
and Parquet directly [
21
,
29
,
35
]. Data lake and Lakehouse solu-
tions [
27
,
74
,
77
] also advocate direct accessibility of open formats.
Parallelism is another eective way to speed up data transfer,
supported by many tools [
2
,
10
,
28
] to move data between two
le systems (e.g. HDFS, S3) or between le system and DBMSs.
Databricks points out the single SQL endpoint bottleneck and pro-
poses to tackle it with cloud fetch architecture [
39
], in which a
query’s result is exported to a cloud storage in multiple partitions,
enabling parallel downloading from the client. Similar support is
also provided by other cloud-native data warehouses [21, 29, 35].
However, all these server-side attempts require users to modify
the source code of a database server or switch to a new one, which
评论