暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

ClickHouse - LowCardinality 数据类型的神秘之旅

DataFlow范式 2019-11-26
875

对于许多用户来说,ClickHouse 中存在很多神秘而陌生的功能,这些功能给 ClickHouse 带来了非常高的查询性能。

LowCardinality 数据类型就是 ClickHouse 中的一种特殊的功能,目前可以应用在生产环境中。在本文中,笔者将带领大家一起学习 LowCardinality 的工作原理以及如何使用 LowCardinality。

LowCardinality 实战

LowCardinality 是一种数据类型,或者换句话说,LowCardinality 是数据类型函数。我们可以使用它来修改任何 ClickHouse 数据类型,最常用于 String 数据类型。我们将以 官方提供的 ontime 数据集为例,该表包含 1.72 亿行数据,描述了多年来美国航班信息。

ontime 创建表语句:

  1. CREATE TABLE `ontime` (

  2. `Year` UInt16,

  3. `Quarter` UInt8,

  4. `Month` UInt8,

  5. `DayofMonth` UInt8,

  6. `DayOfWeek` UInt8,

  7. `FlightDate` Date,

  8. `UniqueCarrier` FixedString(7),

  9. `AirlineID` Int32,

  10. `Carrier` FixedString(2),

  11. `TailNum` String,

  12. `FlightNum` String,

  13. `OriginAirportID` Int32,

  14. `OriginAirportSeqID` Int32,

  15. `OriginCityMarketID` Int32,

  16. `Origin` FixedString(5),

  17. `OriginCityName` String,

  18. `OriginState` FixedString(2),

  19. `OriginStateFips` String,

  20. `OriginStateName` String,

  21. `OriginWac` Int32,

  22. `DestAirportID` Int32,

  23. `DestAirportSeqID` Int32,

  24. `DestCityMarketID` Int32,

  25. `Dest` FixedString(5),

  26. `DestCityName` String,

  27. `DestState` FixedString(2),

  28. `DestStateFips` String,

  29. `DestStateName` String,

  30. `DestWac` Int32,

  31. `CRSDepTime` Int32,

  32. `DepTime` Int32,

  33. `DepDelay` Int32,

  34. `DepDelayMinutes` Int32,

  35. `DepDel15` Int32,

  36. `DepartureDelayGroups` String,

  37. `DepTimeBlk` String,

  38. `TaxiOut` Int32,

  39. `WheelsOff` Int32,

  40. `WheelsOn` Int32,

  41. `TaxiIn` Int32,

  42. `CRSArrTime` Int32,

  43. `ArrTime` Int32,

  44. `ArrDelay` Int32,

  45. `ArrDelayMinutes` Int32,

  46. `ArrDel15` Int32,

  47. `ArrivalDelayGroups` Int32,

  48. `ArrTimeBlk` String,

  49. `Cancelled` UInt8,

  50. `CancellationCode` FixedString(1),

  51. `Diverted` UInt8,

  52. `CRSElapsedTime` Int32,

  53. `ActualElapsedTime` Int32,

  54. `AirTime` Int32,

  55. `Flights` Int32,

  56. `Distance` Int32,

  57. `DistanceGroup` UInt8,

  58. `CarrierDelay` Int32,

  59. `WeatherDelay` Int32,

  60. `NASDelay` Int32,

  61. `SecurityDelay` Int32,

  62. `LateAircraftDelay` Int32,

  63. `FirstDepTime` String,

  64. `TotalAddGTime` String,

  65. `LongestAddGTime` String,

  66. `DivAirportLandings` String,

  67. `DivReachedDest` String,

  68. `DivActualElapsedTime` String,

  69. `DivArrDelay` String,

  70. `DivDistance` String,

  71. `Div1Airport` String,

  72. `Div1AirportID` Int32,

  73. `Div1AirportSeqID` Int32,

  74. `Div1WheelsOn` String,

  75. `Div1TotalGTime` String,

  76. `Div1LongestGTime` String,

  77. `Div1WheelsOff` String,

  78. `Div1TailNum` String,

  79. `Div2Airport` String,

  80. `Div2AirportID` Int32,

  81. `Div2AirportSeqID` Int32,

  82. `Div2WheelsOn` String,

  83. `Div2TotalGTime` String,

  84. `Div2LongestGTime` String,

  85. `Div2WheelsOff` String,

  86. `Div2TailNum` String,

  87. `Div3Airport` String,

  88. `Div3AirportID` Int32,

  89. `Div3AirportSeqID` Int32,

  90. `Div3WheelsOn` String,

  91. `Div3TotalGTime` String,

  92. `Div3LongestGTime` String,

  93. `Div3WheelsOff` String,

  94. `Div3TailNum` String,

  95. `Div4Airport` String,

  96. `Div4AirportID` Int32,

  97. `Div4AirportSeqID` Int32,

  98. `Div4WheelsOn` String,

  99. `Div4TotalGTime` String,

  100. `Div4LongestGTime` String,

  101. `Div4WheelsOff` String,

  102. `Div4TailNum` String,

  103. `Div5Airport` String,

  104. `Div5AirportID` Int32,

  105. `Div5AirportSeqID` Int32,

  106. `Div5WheelsOn` String,

  107. `Div5TotalGTime` String,

  108. `Div5LongestGTime` String,

  109. `Div5WheelsOff` String,

  110. `Div5TailNum` String

  111. ) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)

如果要求查询与航班的热门城市相关的统计信息,如下查询: 

Query 1:

  1. :) select OriginCityName, count() from ontime group by OriginCityName order by count() desc limit 10;


  2. ┌─OriginCityName────────┬──count()─┐

  3. Chicago, IL 10872578

  4. Atlanta, GA 9279569

  5. Dallas/Fort Worth, TX 7760200

  6. Houston, TX 5898651

  7. Los Angeles, CA 5804789

  8. New York, NY 5283856

  9. Denver, CO 5199842

  10. Phoenix, AZ 4891313

  11. Washington, DC 4252095

  12. San Francisco, CA 4027466

  13. └───────────────────────┴──────────┘

  14. 10 rows in set. Elapsed: 2.089 sec. Processed 172.34 million rows, 3.82 GB (82.51 million rows/s., 1.83 GB/s.)

再下钻到城市 Chicago 以获取更详细的信息: 

Query 2:

  1. :) select OriginCityName, count(), uniq(FlightNum), sum(Distance)

  2. from ontime where OriginCityName = 'Chicago, IL'

  3. group by OriginCityName order by count() desc;


  4. ┌─OriginCityName─┬──count()─┬─uniq(FlightNum)─┬─sum(Distance)─┐

  5. Chicago, IL 10872578 7765 8143093140

  6. └────────────────┴──────────┴─────────────────┴───────────────┘


  7. 1 rows in set. Elapsed: 2.197 sec. Processed 172.34 million rows, 4.00 GB (78.45 million rows/s., 1.82 GB/s.)

接下来,我们可以使用 LowCardinality 对表结构进行修改。在修改表结构之前,让我们先查看表的列 DestCityName 和 OriginCityName 的元数据信息:

  1. :) SELECT column, any(type),

  2. sum(column_data_compressed_bytes) compressed,

  3. sum(column_data_uncompressed_bytes) uncompressed,

  4. sum(rows)

  5. FROM system.parts_columns

  6. WHERE (table = 'ontime') AND active AND (column LIKE '%CityName')

  7. GROUP BY column

  8. ORDER BY column ASC


  9. ┌─column─────────┬─any(type)─┬─compressed─┬─uncompressed─┬─sum(rows)─┐

  10. DestCityName String 421979321 2440948285 172338036

  11. OriginCityName String 427003910 2441007783 172338036

  12. └────────────────┴───────────┴────────────┴──────────────┴───────────┘

现在,我们将 OriginCityName 列的类型更改为 LowCardinality:

  1. :) ALTER TABLE ontime MODIFY COLUMN OriginCityName LowCardinality(String);


  2. 0 rows in set. Elapsed: 19.258 sec.

该 Alter Table 命令是在线执行的,修改完成后,再次查询元数据信息:

  1. ┌─column─────────┬─any(type)─┬─compressed─┬─uncompressed─┬─sum(rows)─┐

  2. DestCityName String 421979321 2440948285 172338036

  3. OriginCityName LowCardinality(String) 161295620 264243767 172338036

  4. └────────────────┴───────────┴────────────┴──────────────┴───────────┘

根据上面的查询结果,我们将 OriginCityName 与 DestCityName 进行比较,会发现 OriginCityName 列的存储(压缩情况下)减少了2.6倍,而未压缩的大小几乎减少了10倍。

那么查询性能如何呢?我们重新执行上面的 Query 1 和 Query 2: 

Query 1:

  1. :) select OriginCityName, count() from ontime group by OriginCityName order by count() desc limit 10;


  2. ┌─OriginCityName────────┬──count()─┐

  3. Chicago, IL 10872578

  4. Atlanta, GA 9279569

  5. Dallas/Fort Worth, TX 7760200

  6. Houston, TX 5898651

  7. Los Angeles, CA 5804789

  8. New York, NY 5283856

  9. Denver, CO 5199842

  10. Phoenix, AZ 4891313

  11. Washington, DC 4252095

  12. San Francisco, CA 4027466

  13. └───────────────────────┴──────────┘


  14. 10 rows in set. Elapsed: 0.595 sec. Processed 172.34 million rows, 281.33 MB (289.75 million rows/s., 472.99 MB/s.)

Query 2:

  1. :) select OriginCityName, count(), uniq(FlightNum), sum(Distance)

  2. from ontime where OriginCityName = 'Chicago, IL'

  3. group by OriginCityName order by count() desc;


  4. ┌─OriginCityName─┬──count()─┬─uniq(FlightNum)─┬─sum(Distance)─┐

  5. Chicago, IL 10872578 7765 8143093140

  6. └────────────────┴──────────┴─────────────────┴───────────────┘


  7. 1 rows in set. Elapsed: 1.475 sec. Processed 172.34 million rows, 460.89 MB (116.87 million rows/s., 312.54 MB/s.)

Query 1 的性能提高了 3.5 倍,但是它仅仅处理了列 OriginCityColumn。 而第二个查询也得到了改进,只提升了 33%,这是因为修改后的列 OriginCityName 只是用于过滤,其他列仍和之前一样读取和处理。所以我们需要进一步优化,将相同的方法应用于 FlightNum 列。

FlightNum 之前的元数据信息:

  1. ┌─column────┬─any(type)──────┬─compressed─┬─uncompressed─┬─sum(rows)─┐

  2. FlightNum String 537637866 773085928 172338036

  3. └───────────┴────────────────┴────────────┴──────────────┴───────────┘

FlightNum 修改之后的元数据信息:

  1. ┌─column────┬─any(type)──────┬─compressed─┬─uncompressed─┬─sum(rows)─┐

  2. FlightNum LowCardinality(String) 330646531 362920578 172338036

  3. └───────────┴────────────────┴────────────┴──────────────┴───────────┘

接着,我们再次查询 Query 2:

  1. :) select OriginCityName, count(), uniq(FlightNum), sum(Distance)

  2. from ontime where OriginCityName = 'Chicago, IL'

  3. group by OriginCityName order by count() desc;


  4. ┌─OriginCityName─┬──count()─┬─uniq(FlightNum)─┬─sum(Distance)─┐

  5. Chicago, IL 10872578 7765 8143093140

  6. └────────────────┴──────────┴─────────────────┴───────────────┘


  7. 1 rows in set. Elapsed: 1.064 sec. Processed 172.34 million rows, 549.77 MB (161.98 million rows/s., 516.74 MB/s.)

可以直观地看到查询性能又增加了 30%。

总结一下,上面的查询结果,如下表:


String 

1 个 LowCardinality列

2 个 LowCardinality 列

Query 1(sec)

2.0890.595(x3.5)

Query 2(sec)

2.1971.475(x1.5)1.064(x2)

因此,通过简单快速的更改表的字段数据类型,就可以显着提高查询性能。正如我们上面所提到的,ontime 数据集并不是 LowCardinality 使用的最佳选择,因为 OriginCityName 列的数据长度相对较短,FlightNum 列的数据更短。如果将 LowCardinality 应用在较长的字符串上,带来的性能提升会更加显著。

着手练一把

1. 使用普通数据类型创建表和加载数据

  1. :) CREATE TABLE Dict

  2. (

  3. d2 UInt32,

  4. d1 UInt32,

  5. uint UInt64,

  6. flt Float64,

  7. str String

  8. )

  9. ENGINE = MergeTree()

  10. PARTITION BY d2

  11. ORDER BY (d2, d1)


  12. :) INSERT INTO Dict SELECT

  13. intDiv(number, 100000) AS d2,

  14. number AS d1,

  15. (rand64() % 7000 +1)*10000 AS uint,

  16. uint * pi() as flt,

  17. ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'][rand()%10+1] AS str

  18. FROM numbers(10000000);


  19. Ok.


  20. 0 rows in set. Elapsed: 2.097 sec. Processed 10.03 million rows, 80.22 MB (4.78 million rows/s., 38.26 MB/s.)


  21. :) OPTIMIZE TABLE Dict FINAL

  22. 0 rows in set. Elapsed: 1.312 sec.

2. 使用 LowCardinality 数据类型创建表和加载数据

  1. :) CREATE TABLE LCDict

  2. (

  3. d2 UInt32,

  4. d1 UInt32,

  5. uintlc LowCardinality(UInt64),

  6. fltlc LowCardinality(Float64),

  7. strlc LowCardinality(String)

  8. )

  9. ENGINE = MergeTree()

  10. PARTITION BY d2

  11. ORDER BY (d2, d1);


  12. :) INSERT INTO LCDict SELECT

  13. intDiv(number, 100000) AS d2,

  14. number AS d1,

  15. (rand64() % 7000 +1)*10000 AS uint,

  16. uint * pi() as flt,

  17. ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'][rand()%10+1] AS str

  18. FROM numbers(10000000);


  19. Ok.


  20. 0 rows in set. Elapsed: 3.291 sec. Processed 10.03 million rows, 80.22 MB (3.05 million rows/s., 24.38 MB/s.)


  21. :) OPTIMIZE TABLE LCDict FINAL


  22. 0 rows in set. Elapsed: 1.483 sec.

如果创建表时提示如下错误:

  1. DB::Exception: Creating columns of type LowCardinality(UInt64) is prohibited by default due to expected negative impact on performance. It can be enabled with the "allow_suspicious_low_cardinality_types" setting..

可以设置如下参数:

  1. set allow_suspicious_low_cardinality_types = 1;

通过插入数据可以发现,LowCardinality 数据类型的表,插入数据比较耗时。

3. 查看新创建的两张表元数据

  1. :) SELECT

  2. table,

  3. sum(bytes) / 1048576 AS size,

  4. sum(rows) AS rows

  5. FROM system.parts

  6. WHERE active AND (database = 'default') AND (table LIKE '%Dict')

  7. GROUP BY table

  8. ORDER BY size DESC;


  9. ┌─table──┬──────────────size─┬─────rows─┐

  10. Dict 157.7021312713623 10000000

  11. LCDict 93.76335144042969 10000000

  12. └────────┴───────────────────┴──────────┘

LowCardinality 数据类型的表占用存储空间较小。

下面进行查询,每个查询取三次结果的平均值作为比较的依据。

4. 查询对比

笔者在这里只比较 String 数据类型,对于 Float64 和 UInt64 数据类型,大家自行查询比较。

4.1 group by 查询

  1. SELECT str, count(str) AS cnt FROM Dict

  2. GROUP BY str;

  3. 三次查询耗时(sec):0.017+0.017+0.017


  4. SELECT uintlc, count(uintlc) AS cnt FROM LCDict

  5. GROUP BY uintlc;

  6. 三次查询耗时(sec):0.010+0.011+0.011


4.2 like 匹配查询 

如果不加 count 的话,返回数据比较多,打印结果是需要花费时间,影响结果。所以使用 count 统计结果数。

  1. select count(1) from (SELECT d2, d1, str FROM Dict

  2. WHERE str LIKE '%en');

  3. 数据量:1998880

  4. 三次查询耗时(sec):0.082+0.076+0.077


  5. select count(1) from (SELECT d2, d1, strlc FROM LCDict WHERE strlc LIKE '%en');

  6. 数据量:2000744

  7. 三次查询耗时(sec):0.015+0.013+0.015


4.3 点条件查询 

如果不加 count 的话,返回数据比较多,打印结果是需要花费时间,影响结果。所以使用 count 统计结果数。

  1. select count(1) from (SELECT d2, d1, str FROM Dict

  2. WHERE str = 'ten' ORDER BY d2, d1);

  3. 数据量:999928

  4. 三次查询耗时(sec):0.049+0.049+0.049



  5. select count(1) from (SELECT d2, d1, strlc FROM LCDict

  6. WHERE strlc = 'ten' ORDER BY d2, d1);

  7. 数据量:1001054

  8. 三次查询耗时(sec):0.031+0.032+0.032

LowCardinality 理解

ClickHouse 的高性能给我们留下了深刻的印象,高性能是靠非常谨慎和智慧的工程来实现的。 LowCardinality 数据类型就是其中的一个示例。其实 LowCardinality 是字符串字典编码实现的,其中字符串被编码为 Position
(positions,可以理解为索引),并通过 position-to-string
 的映射引用字典。当源字符串很长且去重后值的数量不是很大时,它的效果最佳。ClickHouse 没有硬性限制具体去重后值的大小,如果去重后值的数量低于 1000 万,效果通常会很好。对于具有多个 partition 和 part 的 ClickHouse 大表,如果在 part 级别保留 1000 万限制,则去重后值的总数甚至可能更高。

LowCardinality 支持 String、Number、Date、DateTime、Nullable数据类型。

在内部,ClickHouse 创建一个或多个文件以存储 LowCardinality 字典数据。如果所有 LowCardinality 列都符合 8192 个不同的值,那么每个表可以是一个单独的文件,如果去重值的数量更多,则每个 LowCardinality 列就使用一个文件。

ClickHouse LowCardinality 优化不仅限于存储,它还使用字典 position 进行过滤、分组和加速某些查询功能(例如 length())等。这就是为什么我们在 Query 1 中看到的改进要比纯粹从存储效率提升的效果更大的原因。在分布式查询中,ClickHouse 还将尝试在大多数查询处理中对词典 position 进行操作。 

LowCardinality 与 Enum

值得一提的是,还有一种用字典编码字符串的可能性,那就是枚举类型:Enum。

ClickHouse 完全支持枚举。从存储的角度来看,它可能甚至更高效,因为枚举值存储在表定义上而不是存储在单独的数据文件中。枚举适用于静态字典。但是,如果插入了原始枚举之外的值,ClickHouse 将抛出异常。枚举值中的每个更改都需要 ALTER TABLE,这可能会带来很多麻烦。LowCardinality 在这方面要灵活得多。

总结

ClickHouse 是一个功能丰富的DBMS。它具有针对最佳性能的许多精心设计的技术决策, LowCardinalty 是其中之一。如果使用得当,它将有助于减少存储并显着提高查询性能。虽然 LowCardinality 支持好几种数据类型,但是笔者建议对数字列使用 set index,对低基数字符串列使用 LowCardinality。

参考

  • https://www.altinity.com/blog/2019/3/27/low-cardinality

  • https://github.com/yandex/clickhouse-presentations/blob/master/meetup19/string_optimization.pdf


文章转载自DataFlow范式,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论