对数据库而言,合适的字符集和规则能够大大提升使用者运维和分析的效率。TiDB 从 v4.0 开始支持新 collation 规则,并于 TiDB 6.0 版本再次进行了更新。本文将深入解读 Collation 规则在 TiDB 6.0 中的变更和应用。
Ps. 本文已入选 TiDB 6.0 Book Rush,如果你也想分享 6.0 的使用体验,并获得丰厚奖励,欢迎加入我们!
引
默认启用新 Collation 规则
TiDB 从 v4.0 开始支持新 collation 规则,在大小写不敏感、口音不敏感、padding 规则上与 MySQL 行为保持一致。
TiDB [(none)] 18:45:27> select * from mysql.tidb where variable_name = 'new_collation_enabled';+-----------------------+----------------+----------------------------------------------------+| VARIABLE_NAME | VARIABLE_VALUE | COMMENT |+-----------------------+----------------+----------------------------------------------------+| new_collation_enabled | True | If the new collations are enabled. Do not edit it. |+-----------------------+----------------+----------------------------------------------------+1 row in set (0.003 sec)
TiDB [(none)] 18:45:51> select version()\G*************************** 1. row ***************************version(): 5.7.25-TiDB-v6.0.01 row in set (0.001 sec)TiDB [(none)] 18:46:00> SELECT * FROM information_schema.collations;+--------------------+--------------------+------+------------+-------------+---------+| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |+--------------------+--------------------+------+------------+-------------+---------+| ascii_bin | ascii | 65 | Yes | Yes | 1 || binary | binary | 63 | Yes | Yes | 1 || gbk_bin | gbk | 87 | | Yes | 1 | (#28645)| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | (#28645)| latin1_bin | latin1 | 47 | Yes | Yes | 1 || utf8_bin | utf8 | 83 | Yes | Yes | 1 || utf8_general_ci | utf8 | 33 | | Yes | 1 || utf8_unicode_ci | utf8 | 192 | | Yes | 1 | (#18678)| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 || utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 || utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 1 | (#18678)+--------------------+--------------------+------+------------+-------------+---------+11 rows in set (0.001 sec)TiDB [test] 19:05:14> SHOW CHARACTER SET;+---------+-------------------------------------+-------------------+--------+| Charset | Description | Default collation | Maxlen |+---------+-------------------------------------+-------------------+--------+| ascii | US ASCII | ascii_bin | 1 || binary | binary | binary | 1 || gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 || latin1 | Latin1 | latin1_bin | 1 || utf8 | UTF-8 Unicode | utf8_bin | 3 || utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |+---------+-------------------------------------+-------------------+--------+6 rows in set (0.001 sec)
TiDB-v5.4 [test] 10:17:22> select version()\G*************************** 1. row ***************************version(): 5.7.25-TiDB-v5.4.01 row in set (0.001 sec)TiDB-v5.4 [test] 10:19:39> SELECT * FROM information_schema.collations;+----------------+--------------------+------+------------+-------------+---------+| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |+----------------+--------------------+------+------------+-------------+---------+| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 || latin1_bin | latin1 | 47 | Yes | Yes | 1 || binary | binary | 63 | Yes | Yes | 1 || ascii_bin | ascii | 65 | Yes | Yes | 1 || utf8_bin | utf8 | 83 | Yes | Yes | 1 || gbk_bin | gbk | 87 | Yes | Yes | 1 |+----------------+--------------------+------+------------+-------------+---------+6 rows in set (0.001 sec)
新 Collation 注意事项
service_configs: tidb: new_collations_enabled_on_first_bootstrap: true
Collation Bug 修复
DROP TABLE IF EXISTS t1;CREATE TABLE t1 (c1 char(20) CHARACTER SET utf8 COLLATE utf8_bin,c2 char(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin);INSERT INTO t1 VALUES ('UUtJeaV','snRXXCZHBPW');SET names utf8mb4 collate utf8mb4_bin;SELECT greatest( c1, c2 ) as expr1 FROM t1;SELECT least( c1, c2 ) as expr1 FROM t1;SET names utf8mb4 collate utf8mb4_general_ci;SELECT greatest( c1, c2 ) as expr1 FROM t1;SELECT least( c1, c2 ) as expr1 FROM t1;

// The collation of JSON is always utf8mb4_bin in builtin-func which is same as MySQL// see details https://github.com/pingcap/tidb/issues/31320#issuecomment-1010599311if isJSON { dstCharset, dstCollation = charset.CharsetUTF8MB4, charset.CollationUTF8MB4}
DROP TABLE IF EXISTS t2;CREATE TABLE t2 (c1 json);INSERT INTO t2 VALUES ('{\"测试\": \"你好\"}');SELECT collation(c1), collation(upper(c1)), collation(elt(1, c1, 0x12)) FROM t2;


新增内置函数 CHARSET()
TiDB [test] 23:54:42> select version()\G*************************** 1. row ***************************version(): 5.7.25-TiDB-v6.0.01 row in set (0.001 sec)TiDB [test] 00:03:51> set names utf8mb4;Query OK, 0 rows affected (0.000 sec)TiDB [test] 00:03:58> select charset(1);+------------+| charset(1) |+------------+| binary |+------------+1 row in set (0.001 sec)TiDB [test] 00:04:03> select charset('1');+--------------+| charset('1') |+--------------+| utf8mb4 |+--------------+1 row in set (0.001 sec)
文档拾遗

小结
本文对 Collation 特性在 TiDB 6.0 中的变更进行了汇总阐释及举例说明,当前 TiDB 所提供的几种 Collation 已经可以支撑大部分业务场景和大部分系统迁移需求。建议在项目设计之初,就选用普适类型的字符集和规则,毕竟效率是提升生产力的重要因素之一。
活动预告
新发布的 TiDB 6.0 大幅增强了 TiDB 作为企业级产品的可管理性,具备更成熟的 HTAP 与容灾能力,加入多项云原生数据库所需的基础特性。抢先试用 TiDB 6.0、TiFlash、TiEM 和 PingCAP Clinic 等多款新品与服务,参与 Book Rush 贡献,将有机会享受“TiDB 荣誉体验官”在内的多项权益!
扫描下方二维码了解详情


文章转载自PingCAP,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




