connector-x | 让数据从DB高速导入到DataFrame中

大邓和他的Python 2021-09-15

1833

开学特惠| Python网络爬虫与文本分析

ConnectorX 使您能够以最快和最节省内存的方式将数据从数据库加载到 Python 中。

你需要的是一行代码：

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")

或者，您可以通过指定**分区字段(列)**来使用并行性加速数据加载。

import connectorx as cx

cx.read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem", partition_on="l_orderkey", partition_num=10)

该函数将通过将**指定字段(列)**平均拆分为分区数量来对查询进行分区。ConnectorX 将为每个分区分配一个线程来并行加载和写入数据。目前，我们支持对 SPJA 查询的整数列进行分区。

安装

命令行执行

pip install connectorx

性能表现

我们比对了modin、pandas、dask和connectorX中的read_sql函数，实验文件8.6G，4核电脑并行读取。读取速度和内存占用表现如下图

从两方面看，connectorX以弱三分之一的内存占用和21倍的读取速度遥遥领先于其他几个数据读取库。

支持的数据库

[x] Postgres
[x] Mysql
[x] Sqlite
[x] Redshift(通过postgres协议)
[x] Clickhouse（通过Mysql协议）
[x] SQL Server
[ ] Oracle
[ ] ...

API参数

connectorx.read_sql(conn: str, 
                    query: Union[List[str], str], 
                    *, 
                    return_type: str = "pandas", 
                    protocol: str = "binary", 
                    partition_on: Optional[str] = None, 
                    partition_num: Optional[int] = None)

conn: str: 统一资源端口，支持的URI格式例如: (postgres|postgressql|mysql|mssql|sqlite)://username:password@addr:port/dbname
.
query: Union[str, List[str]]: 为获取数据资源，支持传入单个SQL查询或者SQL查询列表
return_type: str = "pandas": connectorx.read_sql操作返回的数据类型，默认pandas；此参数共支持 pandas
, arrow
, pandas
, modin
, dask
or polars
.
protocol: str = "binary": 协议默认支持文件以二进制导入数据
partition_on: Optional[str]: 可选，根据某字段(列)对数据进行分区。
partition_num: Optioinal[int]: 可选，线程数

使用案例

从SQL文件中以单线程读取，读取结果以dataframe格式返回

import connectorx as cx

postgres_url = "postgresql://username:password@server:port/database"
query = "SELECT * FROM lineitem"

cx.read_sql(postgres_url, query)

根据字段l_orderkey自动分区，以10线程读取SQL，读取结果以dataframe格式返回

import connectorx as cx

postgres_url = "postgresql://username:password@server:port/database"
query = "SELECT * FROM lineitem"

cx.read_sql(postgres_url, 
            query, 
            partition_on="l_orderkey", 
            partition_num=10)

更多内容可查看https://github.com/sfu-db/connector-x