PostgreSQL数据库扩展hll

原创 chirpyli 2025-08-11

197

hll是一个PostgreSQL扩展，它新增了对hll数据类型的支持，即HyperLogLog数据结构。HyperLogLog是一种用于估算集合中不同元素数量（基数）的算法，它以极低的内存占用为代价，提供了对基数的近似计算。例如，仅用 1280 字节的 hll 结构，就能以几个百分点的误差估算数百亿不同值的数量统计。这对处理大数据集时非常有用，尤其是在需要快速获取近似结果的情况下。

使用方法

数据类型

hll

hll数据类型主要用于存储和操作HyperLogLog数据结构。可以直接在表中定义字段为hll类型来存储HyperLogLog数据。

CREATE TABLE user_visits (
    day DATE,
    visitors HLL
);

hll_hashval
表示一个经过哈希处理的数据值。其底层由一个64位整数（int8in）支持。通常仅由 hll_hash_*函数输出。bigint和integer都可以直接转换为该类型，如果你希望跳过使用典型的123::hll_hashval对这些值进行哈希处理。请注意，被转换的integer 也会被转换为64位整数。

基本操作函数

hll_empty
语法：hll_empty([log2m[, regwidth[, expthresh[, sparseon]]]])
返回一个空的HyperLogLog数据结构。任何一个参数为空时，将使用默认值。

hll_add
语法：hll_add(hll, hll_hashval)
将hll_hashval 添加到 hll 并返回hll的新表示形式。操作符 || 可用作简写，如hll || hll_hashval或hll_hashval || hll。

hll_union
语法：hll_union(hll, hll)
返回两个hll的并集（作为hll）。操作符||可用作简写。

hll_cardinality
语法：hll_cardinality(hll)
如果hll 的类型是UNDEFINED，则返回NULL。否则返回一个double precision浮点数值。操作符# 可用作简写。

hll_eq
语法：hll_eq(hll, hll)
返回一个boolean，指示当比较两个hll的二进制表示时它们是否匹配。操作符 =可用作简写。

hll_ne
语法：hll_ne(hll, hll)
返回一个 boolean，指示当比较两个hll的二进制表示时它们是否不匹配。操作符 <> 可用作简写。

hll_union_agg
语法：hll_union_agg(hll)
对hll进行聚合的函数，将输入集中的hll取并集并返回代表其并集的hll。

hll_add_agg
语法：hll_add_agg(hll_hashval, [log2m[, regwidth[, expthresh[, sparseon]]]])
对hll_hashval进行聚合的函数，将输入集中的每个元素插入到一个hll中，该 hll的参数由四个可选参数指定。如果未指定四个可选参数中的任何一个，则使用 hll_set_defaults()设置的默认值。返回代表输入集的hll。

hash函数

插入到hll中的所有值都应该进行哈希处理，因此hll_add和hll_add_agg仅接受hll_hashval。我们不建议直接对浮点数值进行原始哈希，因为它们的位表示不适合哈希。考虑在哈希之前将它们转换为可重现的、可比较的二进制表示（例如 IEEE 754-2008 交换格式）。

以下所有hll_hash_*函数都接受一个种子值，默认为0。我们不鼓励使用负种子，以保持与Google Guava实现的128位Murmur3的哈希值兼容性。使用负哈希种子时会产生警告。

hll_hash_boolean(boolean) : 将boolean值哈希为hll_hashval。
hll_hash_smallint(smallint) : 将smallint值哈希为hll_hashval。
hll_hash_integer(integer) : 将integer值哈希为hll_hashval。
hll_hash_bigint(bigint) : 将bigint值哈希为hll_hashval。
hll_hash_bytea(bytea) : 将bytea值哈希为hll_hashval。
hll_hash_text(text) : 将text值哈希为hll_hashval。
hll_hash_any(scalar) : 通过动态解析类型并分派到该类型的正确函数来哈希任何PG数据类型。这比类型特定的哈希函数慢得多，只有在事先不知道输入类型时才应使用。

性能对比

安装hll扩展：

CREATE EXTENSION hll;

创建测试表：

CREATE TABLE user_events (
    event_id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL,
    event_date DATE NOT NULL,
    event_type TEXT
);

插入测试数据：
接下来，我们将向 user_events 表中插入一些模拟数据。这里我们使用 PostgreSQL 的 generate_series 函数来生成大量数据。假设我们要模拟 5000 万条记录，每天有 100 万个不同的用户 ID（UV），持续 50 天。

INSERT INTO user_events (user_id, event_date, event_type)
SELECT 
    (random() * 1000000)::BIGINT + 1 AS user_id, -- 假设最多有 100 万个不同的用户
    ('2025-01-01'::DATE + (n % 50)) AS event_date, -- 模拟 50 天的数据
    'event_' || (n % 3) AS event_type -- 三种类型的事件
FROM generate_series(1, 50000000) AS n; -- 总共生成 5000 万条记录

使用 COUNT(DISTINCT) 查询
现在，我们可以比较两种方法的性能。首先是传统的 COUNT(DISTINCT) 方法：

postgres=# EXPLAIN ANALYZE
SELECT 
    event_date,
    COUNT(DISTINCT user_id) AS uv
FROM user_events
GROUP BY event_date
ORDER BY event_date;
      QUERY PLAN                                                                
-----------------------------------------------------------------------
 GroupAggregate  (cost=8762996.06..9137997.52 rows=50 width=12) (actual time=62922.591..104006.423 rows=50 loops=1)
   Group Key: event_date
   ->  Sort  (cost=8762996.06..8887996.38 rows=50000128 width=12) (actual time=62027.212..73629.218 rows=50000000 loops=1)
         Sort Key: event_date
         Sort Method: external merge  Disk: 1272064kB
         ->  Seq Scan on user_events  (cost=0.00..867649.28 rows=50000128 width=12) (actual time=0.051..17922.949 rows=50000000 loops=1)
 Planning Time: 2.036 ms
 Execution Time: 104129.509 ms
(8 rows)

Time: 104137.228 ms (01:44.137)

使用hll进行查询
首先，我们需要计算每个日期的hll值，并将其存储在一个新的表中以便快速查询。
创建预计算表

CREATE TABLE daily_hll_uv AS
SELECT 
    event_date,
    hll_add_agg(hll_hash_bigint(user_id)) AS hll_uv
FROM user_events
GROUP BY event_date;

CREATE INDEX idx_daily_hll_uv_date ON daily_hll_uv(event_date);

查询预计算结果：

postgres=# EXPLAIN ANALYZE
SELECT 
    event_date,
    hll_cardinality(hll_uv) AS uv
FROM daily_hll_uv
ORDER BY event_date;
                  QUERY PLAN                                                   
--------------------------------------------------------------------
 Sort  (cost=11.04..11.16 rows=50 width=12) (actual time=1.170..1.174 rows=50 loops=1)
   Sort Key: event_date
   Sort Method: quicksort  Memory: 27kB
   ->  Seq Scan on daily_hll_uv  (cost=0.00..9.62 rows=50 width=12) (actual time=0.034..1.114 rows=50 loops=1)
 Planning Time: 0.421 ms
 Execution Time: 1.230 ms
(6 rows)

Time: 2.651 ms

如果你想直接从原始表中计算hll结果，可以跳过创建预计算表的步骤，直接运行以下查询：

postgres=# EXPLAIN ANALYZE
SELECT 
    event_date,
    hll_cardinality(hll_add_agg(hll_hash_bigint(user_id))) AS uv
FROM user_events
GROUP BY event_date
ORDER BY event_date;
     QUERY PLAN                     
--------------------------------------------------
 Finalize GroupAggregate  (cost=2074771.54..2074772.79 rows=50 width=12) (actual time=22239.039..22243.029 rows=50 loops=1)
   Group Key: event_date
   ->  Sort  (cost=2074771.54..2074771.66 rows=50 width=36) (actual time=22238.956..22239.271 rows=100 loops=1)
         Sort Key: event_date
         Sort Method: quicksort  Memory: 427kB
         ->  Gather  (cost=1787539.50..2074770.13 rows=50 width=36) (actual time=18537.455..22238.001 rows=100 loops=1)
               Workers Planned: 1
               Workers Launched: 1
               ->  Partial HashAggregate  (cost=1786539.50..2073765.13 rows=50 width=36) (actual time=18533.479..22103.231 rows=50 loops=2)
                     Group Key: event_date
                     Planned Partitions: 4  Batches: 5  Memory Usage: 4276kB  Disk Usage: 299976kB
                     Worker 0:  Batches: 5  Memory Usage: 4276kB  Disk Usage: 296984kB
                     ->  Parallel Seq Scan on user_events  (cost=0.00..661766.40 rows=29411840 width=12) (actual time=0.041..6354.229 rows=25000000 loops=2)
 Planning Time: 0.166 ms
 Execution Time: 22277.333 ms
(15 rows)

Time: 22278.391 ms (00:22.278)

分析结果
hll方法明显性能更好。
对比分析误差
我们知道hll是概率估计基数，其与准确值存在偏差，那我们看一下实际偏差有多少，是否足够精确。可以看到，其偏差还是非常小的。

postgres=# SELECT event_date, 
    count(distinct user_id) as uv,
    hll_cardinality(hll_add_agg(hll_hash_bigint(user_id))) AS hll_uv,
    abs(count(distinct user_id) - hll_cardinality(hll_add_agg(hll_hash_bigint(user_id)))) / count(distinct user_id) as deviation
FROM user_events
GROUP BY event_date
ORDER BY event_date;
 event_date |   uv   |      hll_uv       |       deviation       
------------+--------+-------------------+-----------------------
 2025-01-01 | 632707 | 623183.6817002947 |  0.015051703710730684
 2025-01-02 | 631862 | 629808.5538699663 | 0.0032498332389567642
 2025-01-03 | 632243 | 636860.4180884082 |  0.007303233232172103
 2025-01-04 | 632183 | 620651.4563098092 |    0.0182408316740418
 2025-01-05 | 631976 | 635325.4937948387 |  0.005300033220942987
 2025-01-06 | 632338 |  640999.870852167 |  0.013698165936836027
 2025-01-07 | 632182 | 624286.2235228459 |  0.012489720487381974
 2025-01-08 | 631916 | 614456.3376329968 |   0.02762972035365959
 2025-01-09 | 632335 | 633364.3850611767 | 0.0016279109351478671
 2025-01-10 | 631702 | 627404.0187465709 |  0.006803811375346419
 2025-01-11 | 631859 |  623713.944111848 |  0.012890622572681537
 2025-01-12 | 632147 | 646820.1957640239 |   0.02321168298516631
 2025-01-13 | 632065 | 634287.4288388079 | 0.0035161396989358005
 2025-01-14 | 632399 | 643846.7042994013 |  0.018102027832746826
 2025-01-15 | 631917 | 629118.4882267824 |  0.004428606562598547
 2025-01-16 | 632482 | 633330.5426842811 |   0.00134160764145245
 2025-01-17 | 631815 | 614506.2953207952 |  0.027395210115626963
 2025-01-18 | 631911 | 623367.7094465303 |  0.013519768691270886
 2025-01-19 | 632013 | 634521.2452002096 |  0.003968660771549877
 2025-01-20 | 632575 | 644099.6603179295 |  0.018218646512950184
 2025-01-21 | 632620 | 620626.4299164931 |  0.018958569257226884
 2025-01-22 | 632053 | 629743.0063604935 |  0.003654746737230131
 2025-01-23 | 632100 |  640160.068094309 |   0.01275125469753038
 2025-01-24 | 632305 | 614464.4351094165 |  0.028215125438804868
 2025-01-25 | 632707 | 616490.0084499269 |  0.025631123964288596
 2025-01-26 | 631898 | 629491.9505699311 | 0.0038076547640108556
 2025-01-27 | 632101 | 636123.9346817475 |  0.006364385884134747
 2025-01-28 | 632469 | 625686.4592655643 |   0.01072391015913138
 2025-01-29 | 631597 | 618719.6260143474 |  0.020388592703341812
 2025-01-30 | 632735 | 617032.4707943607 |  0.024816912618456803
 2025-01-31 | 632044 | 636591.0666736438 |  0.007194224885678482
 2025-02-01 | 632365 | 644570.5906038798 |  0.019301496135744025
 2025-02-02 | 631472 | 615665.3987265908 |   0.02503135732607179
 2025-02-03 | 632121 | 626834.9340623614 |  0.008362427347989755
 2025-02-04 | 632551 | 626421.5447820029 |  0.009690056956667649
 2025-02-05 | 632488 | 614633.5782454007 |  0.028228870357381194
 2025-02-06 | 632102 | 650059.6271673761 |   0.02840938197850364
 2025-02-07 | 632023 | 605892.4764183772 |  0.041344260543718744
 2025-02-08 | 631728 | 645486.4629302825 |  0.021779093106973985
 2025-02-09 | 631834 | 623618.0120326917 |  0.013003396409987848
 2025-02-10 | 631935 | 616340.6884696558 |  0.024677081551653553
 2025-02-11 | 631489 | 612929.7370034965 |  0.029389685325482313
 2025-02-12 | 631837 | 635126.2602673196 |  0.005205868392195493
 2025-02-13 | 632184 | 615632.2259304954 |  0.026181893356213664
 2025-02-14 | 632195 | 617552.8537319473 |  0.023160806820763765
 2025-02-15 | 632059 | 616633.6443754714 |  0.024404929958324464
 2025-02-16 | 632088 |  642494.335830703 |  0.016463428874939928
 2025-02-17 | 632082 |  608454.422331834 |   0.03738055769372648
 2025-02-18 | 632182 | 612352.7700967655 |   0.03136633106167926
 2025-02-19 | 631738 | 627382.2311346314 | 0.0068948976717698824
(50 rows)

使用限制

精度限制：HyperLogLog是一种近似算法，其提供的基数估计值并不是精确的，其误差通常在一定范围内。如果应用场景要求绝对准确的结果，则不适用HyperLogLog。
特定数据类型支持：主要针对integer、text等基础数据类型，对于复杂数据结构或自定义对象，需要先将之转换为适当的哈希值形式才能被HyperLogLog处理。
hll不支持排序、过滤等类型的SQL操作。

实现原理

这里说一下其实现的主要思想，在数据量超大的情况下，通过哈希表或者集合等方式去统计基数往往是不现实的，主要原因是其耗费的资源过多，那么，在不需要绝对准确结果的情况下，有什么方式可以统计基数呢？可以借助概率的思想。将发生一定概率（概率p）的事件转换为事件的次数（次数N）。大致的关系为 N = 1/p。那么，可以根据统计数据发生的某个事件概率，来推算基数。HyperLogLog算法就是基于这个思想去实现的。具体算法实现原理可参考论文HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm以及文章
一文理解 HyperLogLog(HLL) 算法

参考文档：
HyperLogLog — Cornerstone of a Big Data Infrastructure