
前言
今天简单讲讲排序规则与索引的相关性问题,因为这涉及到索引是否会失效,还蛮重要的。
让我们开始今天的学习叭!!
MySQL 的字符集与排序规则
字符集
看看默认的设置:
root@localhost [(none)] 12:03:37 > show variables like '%char%';+--------------------------+-------------------------------------------+| Variable_name | Value |+--------------------------+-------------------------------------------+| character_set_client | utf8mb4 || character_set_connection | utf8mb4 || character_set_database | utf8mb4 || character_set_filesystem | binary || character_set_results | utf8mb4 || character_set_server | utf8mb4 || character_set_system | utf8mb3 || character_sets_dir | /application/mysql_8_3306/share/charsets/ |+--------------------------+-------------------------------------------+8 rows in set (0.12 sec)
官网老长了,我们简单描述一下:
character_set_client:从客户端发送的语句所使用的字符集
character_set_connection:某些数字或文字没有特定的字符集引导符转换成字符串时就会使用默认的字符集。(啥是引导符?类似于这种 SELECT _utf8mb4'abc123' COLLATE utf8mb4_danish_ci)
character_set_database:默认数据库所使用的字符集。每当默认数据库更改(use datbase;)时,服务器会设置该变量。如果没有默认数据库,该变量的值与 character_set_server 相同。
character_set_filesystem:文件系统使用的字符集。
character_set_results:将查询结果返回给客户端的字符集。
character_set_server:服务器默认的字符集。
character_set_system:服务器用于存储标识符的字符集。该值始终为 utf8mb3。
character_sets_dir:安装字符集的目录。
排序规则
再看看排序规则叭,再简单描述一下叭。
root@localhost [(none)] 12:26:49 > show global variables like '%collation%';+-------------------------------+--------------------+| Variable_name | Value |+-------------------------------+--------------------+| collation_connection | utf8mb4_0900_ai_ci || collation_database | utf8mb4_0900_ai_ci || collation_server | utf8mb4_0900_ai_ci || default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |+-------------------------------+--------------------+4 rows in set (0.00 sec)
collation_connection:character_set_connection 的排序规则。
collation_database:character_set_database 使用的排序规则。
collation_server:character_set_server 使用的排序规则。
default_collation_for_utf8mb4:MySQL 复制内部使用。
影响面
先创建 1 个 database
root@localhost [(none)] 13:28:32 > create database t_char charset utf8mb4;Query OK, 1 row affected, 1 warning (0.03 sec)
创建 3 个 utf8mb3(5.7 为 utf8) 的表
create table t1(id int primary key,info varchar(20))charset utf8mb3;insert into t1 values(1,'abc');create table t2(id varchar(20) primary key,info varchar(20))charset utf8mb3;insert into t2 values('aa','abc');create table t3(id varchar(20) primary key)charset utf8mb3;insert into t3 values('aa');
看一下有啥问题不
root@localhost [t_char] 16:20:24 > explain select * from t1 where id=1;+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+| 1 | SIMPLE | t1 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+1 row in set, 1 warning (0.00 sec)root@localhost [t_char] 16:20:33 > explain select * from t2 where id='aa';+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+| 1 | SIMPLE | t2 | NULL | const | PRIMARY | PRIMARY | 62 | const | 1 | 100.00 | NULL |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+1 row in set, 1 warning (0.00 sec)root@localhost [t_char] 16:22:35 > explain select * from t3 where id='aa';+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+| 1 | SIMPLE | t3 | NULL | const | PRIMARY | PRIMARY | 62 | const | 1 | 100.00 | Using index |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+1 row in set, 1 warning (0.00 sec)
正常状态下,肯定是都能走索引的,没啥问题
现在我们把它放到存储过程里面去执行
delimiter ;;create procedure test_p1(a int)beginexplain select * from t1 where id =a;end;;delimiter ;delimiter ;;create procedure test_p2(a varchar(20))beginexplain select * from t2 where id=a;end;;delimiter ;delimiter ;;create procedure test_p3(a varchar(20))beginexplain select * from t3 where id=a;end;;delimiter ;
看看输出结果叭
root@localhost [t_char] 16:20:39 > call test_p1(1);+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+| 1 | SIMPLE | t1 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+1 row in set (0.00 sec)Query OK, 0 rows affected, 1 warning (0.00 sec)
可以看到 t1 传数字没有问题,数字其实在字符集的排序方式没什么变化。
root@localhost [t_char] 16:25:56 > call test_p2('aa');+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+| 1 | SIMPLE | t2 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 1 | 100.00 | Using where |+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+1 row in set (0.00 sec)Query OK, 0 rows affected, 3 warnings (0.00 sec)root@localhost [t_char] 16:26:17 > show warnings;+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------+| Level | Code | Message |+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------+| Warning | 1739 | Cannot use ref access on index 'PRIMARY' due to type or collation conversion on field 'id' || Warning | 1739 | Cannot use range access on index 'PRIMARY' due to type or collation conversion on field 'id' || Note | 1003 | /* select#1 */ select `t_char`.`t2`.`id` AS `id`,`t_char`.`t2`.`info` AS `info` from `t_char`.`t2` where (`t_char`.`t2`.`id` = <cache>(a@0)) |+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------+3 rows in set (0.00 sec)
t2 传字符,走 ALL 出现 collation conversion 转换
root@localhost [t_char] 16:25:38 > call test_p3('aa');+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+| 1 | SIMPLE | t3 | NULL | index | PRIMARY | PRIMARY | 62 | NULL | 1 | 100.00 | Using where; Using index |+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+1 row in set (0.00 sec)Query OK, 0 rows affected, 3 warnings (0.00 sec)root@localhost [t_char] 16:25:47 > show warnings;+---------+------+---------------------------------------------------------------------------------------------------------------+| Level | Code | Message |+---------+------+---------------------------------------------------------------------------------------------------------------+| Warning | 1739 | Cannot use ref access on index 'PRIMARY' due to type or collation conversion on field 'id' || Warning | 1739 | Cannot use range access on index 'PRIMARY' due to type or collation conversion on field 'id' || Note | 1003 | /* select#1 */ select `t_char`.`t3`.`id` AS `id` from `t_char`.`t3` where (`t_char`.`t3`.`id` = <cache>(a@0)) |+---------+------+---------------------------------------------------------------------------------------------------------------+3 rows in set (0.00 sec)
t3 传字符,走 index 出现 collation conversion 转换,为啥走 index,index 是完整扫描索引,t3 只有一个字段即主键,所以全部都在 Btree里面,当然就是覆盖扫描索引了
看看他们的字符集和排序规则,可以看到,存储过程是 utf8mb4 的
root@localhost [t_char] 16:33:18 > select SPECIFIC_NAME,CHARACTER_SET_CLIENT,COLLATION_CONNECTION,DATABASE_COLLATION from INFORMATION_SCHEMA.ROUTINES where ROUTINE_SCHEMA='t_char';+---------------+----------------------+----------------------+--------------------+| SPECIFIC_NAME | CHARACTER_SET_CLIENT | COLLATION_CONNECTION | DATABASE_COLLATION |+---------------+----------------------+----------------------+--------------------+| test_p1 | utf8mb4 | utf8mb4_0900_ai_ci | utf8mb4_0900_ai_ci || test_p2 | utf8mb4 | utf8mb4_0900_ai_ci | utf8mb4_0900_ai_ci || test_p3 | utf8mb4 | utf8mb4_0900_ai_ci | utf8mb4_0900_ai_ci |+---------------+----------------------+----------------------+--------------------+3 rows in set (0.00 sec)
需要转换为 utf8 的字符集/排序规则去扫描主键,所以走不了
使用 mysqldump 导出来看看定义
mysqldump --single-transaction -R t_char >t_char.sql
让我们看一个 test_p1叭
---- Dumping routines for database 't_char'--/*!50003 DROP PROCEDURE IF EXISTS `test_p1` */;/*!50003 SET @saved_cs_client = @@character_set_client */ ;/*!50003 SET @saved_cs_results = @@character_set_results */ ;/*!50003 SET @saved_col_connection = @@collation_connection */ ;/*!50003 SET character_set_client = utf8mb4 */ ;/*!50003 SET character_set_results = utf8mb4 */ ;/*!50003 SET collation_connection = utf8mb4_0900_ai_ci */ ;/*!50003 SET @saved_sql_mode = @@sql_mode */ ;/*!50003 SET sql_mode = 'ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION' */ ;DELIMITER ;;CREATE DEFINER=`root`@`localhost` PROCEDURE `test_p1`(a int)beginexplain select * from t1 where id =a;end ;;DELIMITER ;
注意!以下符号不是不是注释,是mysql特殊的版本标识符,只有满足这个版本及以上才会执行符号内的操作
/*!50003 */ ;
5 是主要版本,代表5.0
00 是次要版本
03 是发行号
这个表示为 5.00.03 版本,即 5.0.3 ,只有达到这个版本才会执行符号内的操作。
可以看到
这里先从 session variables 保存一下内容在变量里面
/*!50003 SET @saved_cs_client = @@character_set_client */ ;/*!50003 SET @saved_cs_results = @@character_set_results */ ;/*!50003 SET @saved_col_connection = @@collation_connection */ ;
然后设置从会话里面获取的值
/*!50003 SET character_set_client = utf8mb4 */ ;/*!50003 SET character_set_results = utf8mb4 */ ;/*!50003 SET collation_connection = utf8mb4_0900_ai_ci */ ;
也就是说以上语句等效于
root@localhost [t_char] 16:51:47 > explain select * from t1 where id=1 collate utf8mb4_0900_ai_ci;+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+| 1 | SIMPLE | t1 | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL |+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+1 row in set, 1 warning (0.00 sec)root@localhost [t_char] 16:51:47 > explain select * from t2 where id='aa'collate utf8mb4_0900_ai_ci;+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+| 1 | SIMPLE | t2 | NULL | ALL | PRIMARY | NULL | NULL | NULL | 1 | 100.00 | Using where |+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+1 row in set, 3 warnings (0.00 sec)root@localhost [t_char] 16:51:47 > explain select * from t3 where id='aa' collate utf8mb4_0900_ai_ci;+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+| 1 | SIMPLE | t3 | NULL | index | PRIMARY | PRIMARY | 62 | NULL | 1 | 100.00 | Using where; Using index |+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+1 row in set, 3 warnings (0.00 sec)
看,是不是一样的
PG 的 Encoding 与 Collate
让我们先看 PG 与 mysql 的对等关系吧。
•在 PostgreSQL (PG) 中,databases 确实等同于 MySQL 中的实例。每个 PG database 是独立的,类似于 MySQL 的实例。
•PG 的 schema 相当于 MySQL 的 database,PG 中的 schema 可以互通,就像 MySQL 中的表可以跨数据库查询一样。
•每个 PG database 默认有一个 public schema,并且默认情况下所有用户都可以访问这个 schema。
在字符集和排序规则方面,PG 与 MySQL 也不太一样。
•PG 的 Encoding 等效于 mysql 的 charset
•PG 的 Collate 等效于 mysql 的 collate
•PG 的 Collate 是在 databases 级别和列级别设置的,不在表级别设置
排序规则
先看看 test 数据库的 Encoding 为 UTF8 ,Collate 为 C
testdb=# \l testdbList of databasesName | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges--------+----------+----------+-----------------+---------+-------+------------+-----------+-------------------testdb | postgres | UTF8 | libc | C | C | | |(1 row)
所以如果在这个数据库里面建表也会默认使用这个。
我们先创建一个表生成一万行数据
create table test_col(id int primary key,info text);insert into test_col select n,md5(n::text) from generate_series(1,10000) n;
看一下表结构
testdb=# \d test_colTable "public.test_col"Column | Type | Collation | Nullable | Default--------+---------+-----------+----------+---------id | integer | | not null |info | text | | |Indexes:"test_col_pkey" PRIMARY KEY, btree (id)
我们先创建一个普通索引
create index on test_col(info);
这是正常的,肯定是可以走索引的
testdb=# explain select * from test_col where info='c4ca4238a0b923820dcc509a6f75849b';QUERY PLAN-----------------------------------------------------------------------------------Index Scan using test_col_info_idx on test_col (cost=0.29..8.30 rows=1 width=37)Index Cond: (info = 'c4ca4238a0b923820dcc509a6f75849b'::text)(2 rows)
但是如果我们显示指定,就走不了索引了
testdb=# explain select * from test_col where info collate "C" ='c4ca4238a0b923820dcc509a6f75849b';QUERY PLAN---------------------------------------------------------------------Seq Scan on test_col (cost=0.00..209.00 rows=1 width=37)Filter: ((info)::text = 'c4ca4238a0b923820dcc509a6f75849b'::text)(2 rows)
现在我们改一下
testdb=# drop index test_col_info_idx;DROP INDEXtestdb=# create index on test_col(info collate "C");CREATE INDEXtestdb=# \d test_colTable "public.test_col"Column | Type | Collation | Nullable | Default--------+---------+-----------+----------+---------id | integer | | not null |info | text | | |Indexes:"test_col_pkey" PRIMARY KEY, btree (id)"test_col_info_idx" btree (info COLLATE "C")
再次测试刚才的 SQL
testdb=# explain select * from test_col where info collate "C" ='c4ca4238a0b923820dcc509a6f75849b';QUERY PLAN-----------------------------------------------------------------------------------Index Scan using test_col_info_idx on test_col (cost=0.29..8.30 rows=1 width=37)Index Cond: ((info)::text = 'c4ca4238a0b923820dcc509a6f75849b'::text)(2 rows)testdb=# explain select * from test_col where info ='c4ca4238a0b923820dcc509a6f75849b';QUERY PLAN-------------------------------------------------------------Seq Scan on test_col (cost=0.00..209.00 rows=1 width=37)Filter: (info = 'c4ca4238a0b923820dcc509a6f75849b'::text)(2 rows)
显示指定可以走,不显示指定就走不了索引了
我们再看看
like 模糊匹配
testdb=# explain select * from test_col where info like 'c4ca4238a0b923820%';QUERY PLAN--------------------------------------------------------------------------------------------Index Scan using test_col_info_idx on test_col (cost=0.29..8.31 rows=1 width=37)Index Cond: ((info >= 'c4ca4238a0b923820'::text) AND (info < 'c4ca4238a0b923821'::text))Filter: (info ~~ 'c4ca4238a0b923820%'::text)(3 rows)testdb=# explain select * from test_col where info collate "C" like 'c4ca4238a0b923820%';QUERY PLAN------------------------------------------------------------------------------------------------------------Index Scan using test_col_info_idx on test_col (cost=0.29..8.31 rows=1 width=37)Index Cond: (((info)::text >= 'c4ca4238a0b923820'::text) AND ((info)::text < 'c4ca4238a0b923821'::text))Filter: ((info)::text ~~ 'c4ca4238a0b923820%'::text)(3 rows)
可以看到 C 这种排序规则不管是显示还是不显示指定都可以走索引
操作符类
刚才看了排序规则,现在看看操作符类叭
在排序规则为非 C 的数据库中,常规索引不支持LIKE操作:
我们现在使用一下,zh_CN 看看结果
test=# create table test_col_2(id int,info text collate "zh_CN");CREATE TABLEtest=# insert into test_col_2 select * from test_col;INSERT 0 10000test=# create index on test_col_2(info);CREATE INDEX
可以看到不管是显示还是不显示指定都无法走索引
test=# explain select * from test_col_2 where info like 'c4ca4238a0b923820%';QUERY PLAN------------------------------------------------------------------Seq Scan on test_op_class_2 (cost=0.00..219.00 rows=1 width=41)Filter: (name ~~ 'c4ca4238%'::text)(2 rows)test=# explain select * from test_col_2 where info collate "zh_CN" like 'c4ca4238a0b923820%';QUERY PLAN------------------------------------------------------------------Seq Scan on test_op_class_2 (cost=0.00..219.00 rows=1 width=41)Filter: (name ~~ 'c4ca4238%'::text)
(2 rows)
这里就需要使用操作符类了
testdb=# \d+ test_col_2Table "public.test_col_2"Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------id | integer | | | | plain | | |info | text | zh_CN | | | extended | | |Indexes:"test_col_2_info_idx" btree (info)Access method: heaptestdb=# drop index test_col_2_info_idx;DROP INDEXtestdb=# create index on test_col_2(info text_pattern_ops);CREATE INDEX
再次测试一下刚才的 SQL
testdb=# explain select * from test_col_2 where info like 'c4ca4238a0b923820%';QUERY PLAN------------------------------------------------------------------------------------------------Index Scan using test_col_2_info_idx1 on test_col_2 (cost=0.29..8.31 rows=1 width=37)Index Cond: ((info ~>=~ 'c4ca4238a0b923820'::text) AND (info ~<~ 'c4ca4238a0b923821'::text))Filter: (info ~~ 'c4ca4238a0b923820%'::text)(3 rows)testdb=# explain select * from test_col_2 where info collate "zh_CN" like 'c4ca4238a0b923820%';QUERY PLAN------------------------------------------------------------------------------------------------Index Scan using test_col_2_info_idx1 on test_col_2 (cost=0.29..8.31 rows=1 width=37)Index Cond: ((info ~>=~ 'c4ca4238a0b923820'::text) AND (info ~<~ 'c4ca4238a0b923821'::text))Filter: (info ~~ 'c4ca4238a0b923820%'::text)(3 rows)
可以看到不管是加不加 collate "zh_CN" 都可以使用索引。
为什么?
因为非 C 的 btree 的默认操作符类只支持以下操作符
index_method | opclass_name | indexed_type | is_default | amopopr--------------+--------------+--------------+------------+---------------btree | text_ops | text | t | <(text,name)btree | text_ops | text | t | <=(text,name)btree | text_ops | text | t | =(text,name)btree | text_ops | text | t | >=(text,name)btree | text_ops | text | t | >(text,name)btree | text_ops | text | t | <(text,text)btree | text_ops | text | t | <=(text,text)btree | text_ops | text | t | =(text,text)btree | text_ops | text | t | >=(text,text)btree | text_ops | text | t | >(text,text)(10 rows)
参考
https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_character_set_client




