Whoami:5年+金融、政府、医疗领域工作经验的DBACertificate:PGCM、OCP、YCPSkill:Oracle、Mysql、PostgreSQL、国产数据库Platform:CSDN、墨天轮、公众号(呆呆的私房菜)
PostgreSQL在OLTP领域表现出色,但在大规模数据分析方面可能稍显不足;
而DuckDB作为一款高效的列式数据库,以其轻量级和快速查询性能在数据分析领域受到关注,特别适合数据湖和湖屋架构;
那有没有一款产品把duckdb的数据分析能力集成到PostgreSQL数据啊!!!答案是:有!数仓公司 MotherDuck 与 Hydra 、DuckDB Labs 合作推出了PostgreSQL扩展——pg_duckdb,允许将duckdb的分析引擎集成到PostgreSQL中;

如此一来,不仅能在不改变PostgreSQL基础设施的情况下,与传统事务工作负载一起运行快速分析查询,而且能为某些分析查询提供指数级别的性能提升!
学习地址:https://github.com/duckdb/pg_duckdb
说了那么多,下面来个实际案例体验一下吧!
1. 安装数据库和pg_duckdb
# 演示环境操作系统:ubuntu 24.04# 服务器配置:4C 8Groot@chen:~# uname -aLinux chen 6.8.0-49-generic #49-Ubuntu SMP PREEMPT_DYNAMIC Mon Nov 4 02:06:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linuxroot@chen:~# cat proc/cpuinfo | grep process | wc -l4root@chen:~# cat proc/meminfo | grep MemMemTotal: 4009388 kBMemFree: 3386428 kBMemAvailable: 3497688 kB1. 配置在线源sudo vi etc/apt/sources.list# 默认注释了源码镜像以提高 apt update 速度,如有需要可自行取消注释deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ noble main restricted universe multiversedeb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ noble-updates main restricted universe multiversedeb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ noble-backports main restricted universe multiverseapt-get update2. 安装postgresql 15apt install -y postgresql-common gnupg2/usr/share/postgresql-common/pgdg/apt.postgresql.org.shapt install postgresql-153. 安装pgxmancurl -sfL https://install.pgx.sh | sh -4. 安装pg_duckdb扩展pgxman install pg_duckdb5. 启动数据库,并加载pg_duckdb插件systemctl start postgresqlpsql -c "alter system set shared_preload_libraries ='pg_duckdb';"systemctl restart postgresqlpsql -c "create extension pg_duckdb;"
2. 测试数据准备
ps:tpc-ds是tpc组织定义的一个决策支持基准测试;tpc-ds关注数仓系统和决策支持系统的性能测试,模拟了一个大型的星型市场篮子分析场景,包括数据生成、数据加载和一系列复杂的查询操作。
1. 安装duckdbwget https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zipunzip duckdb_cli-linux-amd64.zip./duckdb2. 安装tpcds扩展install tpcds;load tpcds;-- 生成规模因子为1的测试数据,即1GB左右数据call dsdgen(sf=1);-- 导出数据为csv格式,并且用“|”分隔export database 'public' (format csv, delimiter '|');3. 导入元数据到postgresql数据库中psql -f public/schema.sql4. 调整load.sql格式,并导入实际数据到表中sed -i 's/COPY/\\copy/' public/load.sqlpsql -f public/load.sql
3. 执行tpc-ds用例1
测试用例1
https://github.com/duckdb/duckdb/blob/main/extension/tpcds/dsdgen/queries/01.sql
1. 在postgresql内核中运行postgres=# set duckdb.enable_external_access = false;postgres=# \timing onTiming is on.postgres=#postgres=# WITH customer_total_return AS(SELECT sr_customer_sk AS ctr_customer_sk,sr_store_sk AS ctr_store_sk,sum(sr_return_amt) AS ctr_total_returnFROM store_returns,date_dimWHERE sr_returned_date_sk = d_date_skAND d_year = 2000GROUP BY sr_customer_sk,sr_store_sk)SELECT c_customer_idFROM customer_total_return ctr1,store,customerWHERE ctr1.ctr_total_return >(SELECT avg(ctr_total_return)*1.2FROM customer_total_return ctr2WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)AND s_store_sk = ctr1.ctr_store_skAND s_state = 'TN'AND ctr1.ctr_customer_sk = c_customer_skORDER BY c_customer_idLIMIT 100;Time: 134320.570 ms (02:14.321)2. 在pg_duckdb内核中执行postgres=# SET duckdb.force_execution = true;SETTime: 0.155 mspostgres=# WITH customer_total_return AS(SELECT sr_customer_sk AS ctr_customer_sk,sr_store_sk AS ctr_store_sk,sum(sr_return_amt) AS ctr_total_returnFROM store_returns,date_dimWHERE sr_returned_date_sk = d_date_skAND d_year = 2000GROUP BY sr_customer_sk,sr_store_sk)SELECT c_customer_idFROM customer_total_return ctr1,store,customerWHERE ctr1.ctr_total_return >(SELECT avg(ctr_total_return)*1.2FROM customer_total_return ctr2WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)AND s_store_sk = ctr1.ctr_store_skAND s_state = 'TN'AND ctr1.ctr_customer_sk = c_customer_skORDER BY c_customer_idLIMIT 100;Time: 121.312 ms
结论:肉眼可见的这个SQL查询效率,性能上足足提升1107倍!
1. pg_duckdb 扩展的面世,将会推动着Postgresql数据库掀起在olap领域中的狂潮!届时,同时具备超强的tp和ap能力的数据库,舍Postgresql还有谁?
2. pg_duckdb 目前还处于测试阶段,功能还比较有限,期待未来能够带给我们更多的惊喜!
本文内容就到这啦,阅读完本篇,相信你对PostgreSQL的olap能力又多了一分期待了吧!我们下篇再见!

文章转载自呆呆的私房菜,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




