暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PolarDB 开源版通过 duckdb_fdw 支持 parquet 列存数据文件以及高效OLAP

alitrack 2023-12-29
3111

标签

PostgreSQL , PolarDB , parquet , duckdb_fdw


背景

PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力.

本文将介绍PolarDB 开源版通过duckdb_fdw 支持 parquet 列存数据文件以及高效OLAP.

测试环境为macos+docker, polardb部署请参考:

  • 《如何用 PolarDB 证明巴菲特的投资理念 - 包括PolarDB简单部署》

duckdb_fdw for PolarDB

参考

《用duckdb_fdw加速PostgreSQL分析计算, 提速40倍, 真香.》

1、部署

需要一个 高版本 cmake .

https://cmake.org/download  

cd ~
wget https://github.com/Kitware/CMake/releases/download/v3.25.1/cmake-3.25.1.tar.gz
tar -zxvf cmake-3.25.1.tar.gz
cd cmake-3.25.1
./configure --prefix=/usr/local
make -j 4
sudo make install


vi ~/.bash_profile
# add
export PATH=/usr/local/bin:$PATH

. ~/.bash_profile
[postgres@67e1eed1b4b6 duckdb]$ which cmake
/usr/local/bin/cmake

编译duckdb依赖

cd ~  
git clone -b v0.6.1 --depth 1 https://github.com/duckdb/duckdb
cd duckdb
make -j 4

将duckdb动态库拷贝到PolarDB软件目录中

cd ~/duckdb  

cp build/release/tools/sqlite3_api_wrapper/libsqlite3_api_wrapper.so ~/tmp_basedir_polardb_pg_1100_bld/lib/
cp tools/sqlite3_api_wrapper/include/sqlite3.h ~/tmp_basedir_polardb_pg_1100_bld/include/server/
cp src/include/duckdb.hpp ~/tmp_basedir_polardb_pg_1100_bld/include/server/
cp -r src/include/duckdb ~/tmp_basedir_polardb_pg_1100_bld/include/server/
cp build/release/src/libduckdb.so ~/tmp_basedir_polardb_pg_1100_bld/lib/

安装duckdb_fdw for PolarDB

cd ~  
git clone --depth 1 https://github.com/alitrack/duckdb_fdw

cd duckdb_fdw
USE_PGXS=1 make
USE_PGXS=1 sudo make install

LOG如下:

[postgres@aa25c5be9681 duckdb_fdw]$ USE_PGXS=1 make  
g++ -Wall -Wpointer-arith -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -c -o sqlite3_api_wrapper.o sqlite3_api_wrapper.cpp
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -shared -o duckdb_fdw.so connection.o option.o deparse.o sqlite_query.o duckdb_fdw.o sqlite3_api_wrapper.o -L/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib -Wl,-rpath,'$ORIGIN/../lib' -L/opt/rh/llvm-toolset-7.0/root/usr/lib64 -Wl,--as-needed -Wl,-rpath,'/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib',--enable-new-dtags -lduckdb -lstdc++
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o connection.bc connection.c
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o option.bc option.c
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o deparse.bc deparse.c
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o sqlite_query.bc sqlite_query.c
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o duckdb_fdw.bc duckdb_fdw.c
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -xc++ -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2 -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal -D_GNU_SOURCE -I/usr/include/libxml2 -flto=thin -emit-llvm -c -o sqlite3_api_wrapper.bc sqlite3_api_wrapper.cpp


[postgres@aa25c5be9681 duckdb_fdw]$ USE_PGXS=1 sudo make install
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib'
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension'
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension'
/usr/bin/install -c -m 755 duckdb_fdw.so '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/duckdb_fdw.so'
/usr/bin/install -c -m 644 .//duckdb_fdw.control '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension/'
/usr/bin/install -c -m 644 .//duckdb_fdw--1.0.sql .//duckdb_fdw--1.0--1.1.sql '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension/'
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode/duckdb_fdw'
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/
/usr/bin/install -c -m 644 connection.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
/usr/bin/install -c -m 644 option.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
/usr/bin/install -c -m 644 deparse.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
/usr/bin/install -c -m 644 sqlite_query.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
/usr/bin/install -c -m 644 duckdb_fdw.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
/usr/bin/install -c -m 644 sqlite3_api_wrapper.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./
cd '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode' && opt/rh/llvm-toolset-7.0/root/usr/bin/llvm-lto -thinlto -thinlto-action=thinlink -o duckdb_fdw.index.bc duckdb_fdw/connection.bc duckdb_fdw/option.bc duckdb_fdw/deparse.bc duckdb_fdw/sqlite_query.bc duckdb_fdw/duckdb_fdw.bc duckdb_fdw/sqlite3_api_wrapper.bc

2、加载duckdb_fdw插件

create extension duckdb_fdw;

3、测试使用

cd ~/duckdb/build/release  
./duckdb home/postgres/db

COPY (select generate_series as id, md5(random()::text) as info, now()::timestamp+(generate_series||' second')::interval as crt_time from generate_series(1,100)) TO '/home/postgres/t1.parquet' (FORMAT 'PARQUET');

COPY (select generate_series as cid, md5(random()::text) as info, now()::timestamp+(generate_series||' second')::interval as crt_time from generate_series(1,100)) TO '/home/postgres/t2.parquet' (FORMAT 'PARQUET');

COPY (select (floor(random()*100)+1)::int as gid, (floor(random()*100)+1)::int as cid, (random()*10)::int as c1, (random()*100)::int as c2, (random()*1000)::int as c3, (random()*10000)::int as c4, (random()*100000)::int as c5 from generate_series(1,1000000)) TO '/home/postgres/t3.parquet' (FORMAT 'PARQUET');


create view t1 as select * from read_parquet('/home/postgres/t1.parquet');
create view t2 as select * from read_parquet('/home/postgres/t2.parquet');
create view t3 as select * from read_parquet('/home/postgres/t3.parquet');

checkpoint;
.quit

4、性能怎么样

psql  

CREATE SERVER DuckDB_server FOREIGN DATA WRAPPER duckdb_fdw OPTIONS (database '/home/postgres/db');

IMPORT FOREIGN SCHEMA public FROM SERVER DuckDB_server INTO public;

postgres=# \det
List of foreign tables
Schema | Table | Server
--------+-------+---------------
public | t1 | duckdb_server
public | t2 | duckdb_server
public | t3 | duckdb_server
(3 rows)

postgres=# explain verbose select count(distinct gid) from t3;
QUERY PLAN
-------------------------------------------------------------
Foreign Scan (cost=1.00..1.00 rows=1 width=8)
Output: (count(DISTINCT gid))
SQLite query: SELECT count(DISTINCT "gid") FROM main."t3"
(3 rows)

postgres=# \timing
Timing is on.

select count(distinct gid) from t3;
count
-------
100
(1 row)
Time: 13.556 ms

select count(distinct gid),count(*) from t3;
count | count
-------+---------
100 | 1000000
(1 row)
Time: 21.836 ms

explain verbose select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Foreign Scan (cost=1.00..1.00 rows=1 width=16)
Output: (count(DISTINCT t3.gid)), (count(*))
SQLite query: SELECT count(DISTINCT r1."gid"), count(*) FROM (main."t3" r1 INNER JOIN main."t1" r2 ON (((r1."gid" = r2."id"))))
(3 rows)

select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.id;
count | count
-------+---------
100 | 1000000
(1 row)
Time: 15.536 ms

select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.id join t2 on t3.cid=t2.cid;
count | count
-------+---------
100 | 1000000
(1 row)
Time: 18.570 ms

explain verbose select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.id join t2 on t3.cid=t2.cid;
Foreign Scan (cost=1.00..1.00 rows=1 width=16)
Output: (count(DISTINCT t3.gid)), (count(*))
SQLite query: SELECT count(DISTINCT r1."gid"), count(*) FROM ((main."t3" r1 INNER JOIN main."t1" r2 ON (((r1."gid" = r2."id")))) INNER JOIN main."t2" r4 ON (((r1."cid" = r4."cid"))))
(3 rows)

5、将数据导入PolarDB本地, 做同样的查询看一下时间.

postgresql 本地计算

postgres=# create unlogged table lt1 as select * from t1;  
SELECT 100
postgres=# create unlogged table lt2 as select * from t2;
SELECT 100
postgres=# create unlogged table lt3 as select * from t3;
SELECT 10000000

postgres=# \timing
Timing is on.
postgres=# select count(distinct gid) from lt3;
count
-------
100
(1 row)
Time: 1142.800 ms (00:01.143)

duckdb_fdw+Parquet 比本地行存储表快了近100倍.

参考

https://github.com/alitrack/duckdb_fdw

202211/20221128_01.md 《DuckDB 轻量级压缩解读》
202211/20221124_08.md 《DuckDB 0.6.0 CLI 支持新的结果集显示模式 duckbox, 展示头部和尾部的少量记录, 避免结果集占满屏幕》
202211/20221124_07.md 《DuckDB 0.6.0 CLI 支持 tab键自动补齐》
202211/20221124_06.md 《DuckDB 0.6.0 的SQL执行进度是个败笔? 速度太快了, 执行进度反而成了干扰.》
202211/20221124_04.md 《DuckDB 0.6.0 支持并行 COUNT(DISTINCT)》
202211/20221124_03.md 《DuckDB 0.6.0 支持并行创建索引, 提升create index性能》
202211/20221124_02.md 《DuckDB 0.6.0 支持 csv 并行读取功能, 提升查询性能》
202211/20221124_01.md 《DuckDB 0.6.0 压缩算法增加: FSST, Chimp, Patas 提升字符串和浮点数压缩能力》
202211/20221123_03.md 《DuckDB 0.6.0 数据批量写入性能优化》
202211/20221123_02.md 《为什么看好 DuckDB 的发展前景?》
202211/20221122_04.md 《DuckDB 0.6.0 内存管理增强, 提升超出内存大小的大数据集hashjoin,sort,window function性能》
202211/20221122_03.md 《DuckDB 0.6.0 语法糖: Add Python-style list-comprehension syntax support to SQL》
202211/20221122_02.md 《DuckDB 0.6.0 语法糖: 支持columns 语法 - 兼容clickhouse columns 语法》
202211/20221122_01.md 《DuckDB 0.6.0 新增UNION数据类型支持 - 单列支持多类型》
202211/20221116_02.md 《DuckDB 0.6.0 发布》
202211/20221111_01.md 《DuckDB extension(插件) - 扩展DuckDB功能》
202211/20221101_01.md 《DuckDB 数值类型性能 : hugeint (16字节整型), decimal(内部整型存储)》
202210/20221027_04.md 《DuckDB select 语法糖: sample 采样查询》
202210/20221027_03.md 《DuckDB select 语法糖: exclude, replace columns》
202210/20221027_02.md 《DuckDB 备份与恢复数据库》
202210/20221026_05.md 《DuckDB 对比 PostgreSQL 48张表JOIN 的优化器》
202210/20221026_04.md 《DuckDB 对比 PostgreSQL join 优化器推理优化又一例》
202210/20221026_03.md 《DuckDB COPY 数据导入导出 - 支持csv, parquet格式, 支持CODEC设置压缩》
202210/20221026_02.md 《DuckDB 虚拟列 - GENERATED ALWAYS AS () VIRTUAL
202210/20221026_01.md 《DuckDB DataLake 场景使用举例 - aliyun OSS对象存储parquet》
202210/20221025_05.md 《DuckDB lambda 函数使用 - list/array数据处理 - 元素的转换与过滤 (list_filter, list_transform)》
202210/20221025_04.md 《DuckDB 列位置别名 #n
 的使用》
202210/20221025_03.md 《DuckDB 宏(MACRO) 的使用》
202210/20221025_02.md 《DuckDB update from,delete using - 采用JOIN 批量更新和批量删除》
202210/20221025_01.md 《DuckDB rowid 的使用》
202210/20221024_07.md 《DuckDB 对比 PostgreSQL left outer join 优化器推理优化一例》
202210/20221024_06.md 《DuckDB insert query 数据导入场景 优化和使用注意》
202210/20221024_04.md 《DuckDB 鸟瞰数据的方法: SUMMARIZE. 数据柱状图、统计信息》
202210/20221024_03.md 《DuckDB explain analye 的结果解释 - Profile Queries》
202210/20221024_02.md 《查看 DuckDB 优化器 物理和逻辑执行计划 - explain_output》
202210/20221001_02.md 《duckdb postgres_scan 插件 - 不落地数据, 加速PostgreSQL数据分析》
202209/20220924_01.md 《用duckdb_fdw加速PostgreSQL分析计算, 提速40倍, 真香.》
202209/20220913_02.md 《德说-第140期, duckdb+容器+parquet+对象存储, 实现SaaS场景, 低代码拖拉拽多维度实时分析 降本提效》
202209/20220909_02.md 《德说-第135期, duckdb的产品形态如何盈利? 未来数据库产品的商业形态》
202209/20220905_01.md 《DuckDB parquet 分区表 / Delta Lake(数据湖) 应用》
202209/20220902_01.md 《DuckDB 线性回归预测股价的例子》
202209/20220901_06.md 《DuckDB 数据文件水位问题观察》
202209/20220901_05.md 《DuckDB 采用外部 parquet 格式存储 - tpch 测试 - in_memory VS in_parquet》
202209/20220901_04.md 《DuckDB 完整的PRAGMA, setting, 系统表, 系统视图, 内置函数, 内置类型 在哪里?》
202209/20220901_03.md 《DuckDB 数据库的数据能不能超出内存限制? 以及推荐的使用方法 - parquet》
202209/20220901_02.md 《编译安装 DuckDB 最新版本 in MacOS》
202209/20220901_01.md 《DuckDB 读写 Parquet 文件 - 同时支持远程s3, oss, http等parquet文件读写》
202208/20220831_02.md 《DuckDB 聚合函数用法举例》
202208/20220831_01.md 《DuckDB的字符串 collate用法 - 大小写、口音(西方各国字符集)、地域属性、排序 - (icu, noCASE, noACCENT, noNFC)》
202208/20220829_04.md 《DuckDB 快速生成海量数据的方法》
202208/20220829_03.md 《DuckDB:list,struct,map 类型很强大(支持lambda计算) - PostgreSQL:数组、row|record、json字典》
202208/20220829_02.md 《DuckDB 字符串相似性计算函数》
202208/20220829_01.md 《DuckDB vs PostgreSQL TPC-H 测试》
202208/20220828_01.md 《DuckDB TPC-H, TPC-DS 测试》
202208/20220827_01.md 《DuckDB Window 窗口函数语法糖 - QUALIFY - window filter》
202208/20220826_02.md 《DuckDB 定位OLAP方向的SQLite, 适合嵌入式数据分析 - tpch测试与简单试用》
202010/20201022_01.md 《PostgreSQL 牛逼的分析型功能 - 列存储、向量计算 FDW - DuckDB_fdw - 无数据库服务式本地lib库+本地存储》

欢迎关注我的github (https://github.com/digoal/blog) , 学习数据库不迷路.  

近期正在写公开课材料, 未来将通过视频号推出. 


文章转载自alitrack,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论