暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

又起飞了!让PostgreSQL性能足足提升1000多倍的扩展!

呆呆的私房菜 2024-11-28
1373
    Whoami:5年+金融、政府、医疗领域工作经验的DBA
    Certificate:PGCM、OCP、YCP
    Skill:Oracle、Mysql、PostgreSQL、国产数据库
    Platform:CSDN、墨天轮、公众号(呆呆的私房菜)


    阅读本文可以了解提高PostgreSQL数据库数据分析能力的扩展pg_duckdb!


    01

    概述
    • PostgreSQL在OLTP领域表现出色,但在大规模数据分析方面可能稍显不足;

    • DuckDB作为一款高效的列式数据库,以其轻量级和快速查询性能在数据分析领域受到关注,特别适合数据湖和湖屋架构;


    • 那有没有一款产品把duckdb的数据分析能力集成到PostgreSQL数据啊!!!答案是:有!数仓公司 MotherDuck 与 Hydra 、DuckDB Labs 合作推出了PostgreSQL扩展——pg_duckdb,允许将duckdb的分析引擎集成到PostgreSQL中;


    • 如此一来,不仅能在不改变PostgreSQL基础设施的情况下,与传统事务工作负载一起运行快速分析查询,而且能为某些分析查询提供指数级别的性能提升

    • 学习地址:https://github.com/duckdb/pg_duckdb


    • 说了那么多,下面来个实际案例体验一下吧!

    02

    案例
    • 1. 安装数据库和pg_duckdb

      # 演示环境操作系统:ubuntu 24.04
      # 服务器配置:4C 8G
      root@chen:~# uname -a
      Linux chen 6.8.0-49-generic #49-Ubuntu SMP PREEMPT_DYNAMIC Mon Nov 4 02:06:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
      root@chen:~# cat proc/cpuinfo | grep process | wc -l
      4
      root@chen:~# cat proc/meminfo | grep Mem
      MemTotal: 4009388 kB
      MemFree: 3386428 kB
      MemAvailable: 3497688 kB


      1. 配置在线源
      sudo vi etc/apt/sources.list
      # 默认注释了源码镜像以提高 apt update 速度,如有需要可自行取消注释
      deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ noble main restricted universe multiverse
      deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ noble-updates main restricted universe multiverse
      deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ noble-backports main restricted universe multiverse
      apt-get update


      2. 安装postgresql 15
      apt install -y postgresql-common gnupg2
      /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh
      apt install postgresql-15


      3. 安装pgxman
      curl -sfL https://install.pgx.sh | sh -


      4. 安装pg_duckdb扩展
      pgxman install pg_duckdb


      5. 启动数据库,并加载pg_duckdb插件
      systemctl start postgresql
      psql -c "alter system set shared_preload_libraries ='pg_duckdb';"
      systemctl restart postgresql
      psql -c "create extension pg_duckdb;"
      • 2. 测试数据准备

      ps:tpc-ds是tpc组织定义的一个决策支持基准测试;tpc-ds关注数仓系统和决策支持系统的性能测试,模拟了一个大型的星型市场篮子分析场景,包括数据生成、数据加载和一系列复杂的查询操作。

        1. 安装duckdb
        wget https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip
        unzip duckdb_cli-linux-amd64.zip
        ./duckdb


        2. 安装tpcds扩展
        install tpcds;
        load tpcds;
        -- 生成规模因子为1的测试数据,即1GB左右数据
        call dsdgen(sf=1);
        -- 导出数据为csv格式,并且用“|”分隔
        export database 'public' (format csv, delimiter '|');


        3. 导入元数据到postgresql数据库中
        psql -f public/schema.sql


        4. 调整load.sql格式,并导入实际数据到表中
        sed -i 's/COPY/\\copy/' public/load.sql
        psql -f public/load.sql
        • 3. 执行tpc-ds用例1

        测试用例1

        https://github.com/duckdb/duckdb/blob/main/extension/tpcds/dsdgen/queries/01.sql
          1. 在postgresql内核中运行
          postgres=# set duckdb.enable_external_access = false;
          postgres=# \timing on
          Timing is on.
          postgres=#
          postgres=# WITH customer_total_return AS
          (SELECT sr_customer_sk AS ctr_customer_sk,
          sr_store_sk AS ctr_store_sk,
          sum(sr_return_amt) AS ctr_total_return
          FROM store_returns,
          date_dim
          WHERE sr_returned_date_sk = d_date_sk
          AND d_year = 2000
          GROUP BY sr_customer_sk,
          sr_store_sk)
          SELECT c_customer_id
          FROM customer_total_return ctr1,
          store,
          customer
          WHERE ctr1.ctr_total_return >
          (SELECT avg(ctr_total_return)*1.2
          FROM customer_total_return ctr2
          WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
          AND s_store_sk = ctr1.ctr_store_sk
          AND s_state = 'TN'
          AND ctr1.ctr_customer_sk = c_customer_sk
          ORDER BY c_customer_id
          LIMIT 100;
          Time: 134320.570 ms (02:14.321)


          2. 在pg_duckdb内核中执行
          postgres=# SET duckdb.force_execution = true;
          SET
          Time: 0.155 ms
          postgres=# WITH customer_total_return AS
          (SELECT sr_customer_sk AS ctr_customer_sk,
          sr_store_sk AS ctr_store_sk,
          sum(sr_return_amt) AS ctr_total_return
          FROM store_returns,
          date_dim
          WHERE sr_returned_date_sk = d_date_sk
          AND d_year = 2000
          GROUP BY sr_customer_sk,
          sr_store_sk)
          SELECT c_customer_id
          FROM customer_total_return ctr1,
          store,
          customer
          WHERE ctr1.ctr_total_return >
          (SELECT avg(ctr_total_return)*1.2
          FROM customer_total_return ctr2
          WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
          AND s_store_sk = ctr1.ctr_store_sk
          AND s_state = 'TN'
          AND ctr1.ctr_customer_sk = c_customer_sk
          ORDER BY c_customer_id
          LIMIT 100;
          Time: 121.312 ms

          结论:肉眼可见的这个SQL查询效率,性能上足足提升1107倍!


          03

          小结
          • 1. pg_duckdb 扩展的面世,将会推动着Postgresql数据库掀起在olap领域中的狂潮!届时,同时具备超强的tp和ap能力的数据库,舍Postgresql还有谁?

          • 2. pg_duckdb 目前还处于测试阶段,功能还比较有限,期待未来能够带给我们更多的惊喜!



          本文内容就到这啦,阅读完本篇,相信你对PostgreSQL的olap能力又多了一分期待了吧!我们下篇再见!

          点击上方公众号,关注我吧!

          文章转载自呆呆的私房菜,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

          评论