暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL 字符集,encoding,character sets,collate,ctype,本土化 - libc,icu

digoal 2019-05-28
1696

作者

digoal

日期

2019-05-28

标签

PostgreSQL , 字符集 , encoding , character sets , collate , ctype , 本土化


背景

创建数据库时,允许指定字符集,以及本土化相关的collate, ctype信息。

```
postgres=# \h create database
Command: CREATE DATABASE
Description: create a new database
Syntax:
CREATE DATABASE name
[ [ WITH ] [ OWNER [=] user_name ]
[ TEMPLATE [=] template ]
[ ENCODING [=] encoding ]
[ LC_COLLATE [=] lc_collate ]
[ LC_CTYPE [=] lc_ctype ]
[ TABLESPACE [=] tablespace_name ]
[ ALLOW_CONNECTIONS [=] allowconn ]
[ CONNECTION LIMIT [=] connlimit ]
[ IS_TEMPLATE [=] istemplate ] ]

URL: https://www.postgresql.org/docs/12/sql-createdatabase.html
```

这些值应该怎么填?

PG 支持哪些字符集

对应的代码是

src/backend/utils/mb/encnames.c

1、数据库支持的ENCODING

https://www.postgresql.org/docs/12/multibyte.html

Server = 'YES' 表示是数据库端支持的字符集,NO表示这是客户端字符集。

PG 支持哪些本土化设置,每种字符集有对应的本土化设置

1、每一个encoding支持哪些collate, cypte

```
select pg_encoding_to_char(collencoding),* from pg_collation
where (pg_encoding_to_char(collencoding)='' and collname<>'default' and collprovider<>'i')
or
pg_encoding_to_char(collencoding)='EUC_CN';

pg_encoding_to_char | oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | collversion
---------------------+-------+--------------+---------------+-----------+--------------+---------------------+--------------+--------------+--------------+-------------
| 950 | C | 11 | 10 | c | t | -1 | C | C |
| 951 | POSIX | 11 | 10 | c | t | -1 | POSIX | POSIX |
EUC_CN | 13077 | zh_CN | 11 | 10 | c | t | 2 | zh_CN | zh_CN |
EUC_CN | 13078 | zh_CN.gb2312 | 11 | 10 | c | t | 2 | zh_CN.gb2312 | zh_CN.gb2312 |
EUC_CN | 13081 | zh_SG | 11 | 10 | c | t | 2 | zh_SG | zh_SG |
EUC_CN | 13082 | zh_SG.gb2312 | 11 | 10 | c | t | 2 | zh_SG.gb2312 | zh_SG.gb2312 |
(6 rows)
```

euc_cn

```
select pg_encoding_to_char(collencoding),
collcollate,
collctype
from pg_collation
where (pg_encoding_to_char(collencoding)='' and collname<>'default' and collprovider<>'i')
or
pg_encoding_to_char(collencoding)='EUC_CN';

pg_encoding_to_char | collcollate | collctype
---------------------+--------------+--------------
| C | C
| POSIX | POSIX
EUC_CN | zh_CN | zh_CN
EUC_CN | zh_CN.gb2312 | zh_CN.gb2312
EUC_CN | zh_SG | zh_SG
EUC_CN | zh_SG.gb2312 | zh_SG.gb2312
(6 rows)
```

sql_ascii

postgres=# select pg_encoding_to_char(collencoding), collcollate, collctype from pg_collation where (pg_encoding_to_char(collencoding)='' and collname<>'default' and collprovider<>'i') or pg_encoding_to_char(collencoding)='SQL_ASCII'; pg_encoding_to_char | collcollate | collctype ---------------------+-------------+----------- | C | C | POSIX | POSIX (2 rows)

例子

```
postgres=# select pg_encoding_to_char(collencoding),
collcollate,
collctype
from pg_collation
where (pg_encoding_to_char(collencoding)='' and collcollate<>'' and collprovider<>'i')
or
pg_encoding_to_char(collencoding)='EUC_JP';
pg_encoding_to_char | collcollate | collctype
---------------------+--------------+--------------
| C | C
| POSIX | POSIX
EUC_JP | ja_JP | ja_JP
EUC_JP | ja_JP.eucjp | ja_JP.eucjp
EUC_JP | ja_JP.ujis | ja_JP.ujis
EUC_JP | japanese | japanese
EUC_JP | japanese.euc | japanese.euc
(7 rows)

postgres=# create database db123 with template template0 encoding 'EUC_JP' lc_collate 'ja_JP' lc_ctype 'japanese';
CREATE DATABASE
```

当客户端与数据库端字符集不相同时,如何转换字符集

```
postgres=# \h create conver
Command: CREATE CONVERSION
Description: define a new encoding conversion
Syntax:
CREATE [ DEFAULT ] CONVERSION name
FOR source_encoding TO dest_encoding FROM function_name

URL: https://www.postgresql.org/docs/12/sql-createconversion.html
```

CREATE CONVERSION defines a new conversion between character set encodings. Also, conversions that are marked DEFAULT can be used for automatic encoding conversion between client and server. For this purpose, two conversions, from encoding A to B and from encoding B to A, must be defined.

To be able to create a conversion, you must have EXECUTE privilege on the function and CREATE privilege on the destination schema.

查看pg_conversion系统表

```
postgres=# select
pg_encoding_to_char(conforencoding),
pg_encoding_to_char(contoencoding),conproc,*
from pg_conversion;

pg_encoding_to_char | pg_encoding_to_char | conproc | oid | conname | connamespace | conowner | conforencoding | contoencoding | conproc | condefault
---------------------+---------------------+--------------------------------+------+--------------------------------+--------------+----------+----------------+---------------+--------------------------------+------------
SQL_ASCII | MULE_INTERNAL | ascii_to_mic | 4400 | ascii_to_mic | 11 | 10 | 0 | 7 | ascii_to_mic | t
MULE_INTERNAL | SQL_ASCII | mic_to_ascii | 4401 | mic_to_ascii | 11 | 10 | 7 | 0 | mic_to_ascii | t
KOI8R | MULE_INTERNAL | koi8r_to_mic | 4402 | koi8_r_to_mic | 11 | 10 | 22 | 7 | koi8r_to_mic | t
MULE_INTERNAL | KOI8R | mic_to_koi8r | 4403 | mic_to_koi8_r | 11 | 10 | 7 | 22 | mic_to_koi8r | t
ISO_8859_5 | MULE_INTERNAL | iso_to_mic | 4404 | iso_8859_5_to_mic | 11 | 10 | 25 | 7 | iso_to_mic | t
MULE_INTERNAL | ISO_8859_5 | mic_to_iso | 4405 | mic_to_iso_8859_5 | 11 | 10 | 7 | 25 | mic_to_iso | t
WIN1251 | MULE_INTERNAL | win1251_to_mic | 4406 | windows_1251_to_mic | 11 | 10 | 23 | 7 | win1251_to_mic | t
MULE_INTERNAL | WIN1251 | mic_to_win1251 | 4407 | mic_to_windows_1251 | 11 | 10 | 7 | 23 | mic_to_win1251 | t
WIN866 | MULE_INTERNAL | win866_to_mic | 4408 | windows_866_to_mic | 11 | 10 | 20 | 7 | win866_to_mic | t
```

如何使用UDF接口,转换字符集

https://www.postgresql.org/docs/12/functions-string.html

postgres=# \df *.*conv* List of functions Schema | Name | Result data type | Argument data types | Type ------------+--------------------------+------------------+---------------------+------ pg_catalog | convert | bytea | bytea, name, name | func pg_catalog | convert_from | text | bytea, name | func pg_catalog | convert_to | bytea | text, name | func pg_catalog | pg_conversion_is_visible | boolean | oid | func (4 rows)

convert_from Convert string to the database encoding. The original encoding is specified by src_encoding. The string must be valid in this encoding.

convert_to Convert string to dest_encoding.

convert Convert string to dest_encoding. The original encoding is specified by src_encoding. The string must be valid in this encoding. Conversions can be defined by CREATE CONVERSION. Also there are some predefined conversions. See Table 9.11 for available conversions.

如何根据encodign id查询字符集

查询当前数据库字符集

指定字符集名,查询字符集ID

查询客户端字符集

查询指定字符集的最大宽度

根据字符集ID,查询对应的字符集名称

postgres=# \df *.*encod* List of functions Schema | Name | Result data type | Argument data types | Type ------------+------------------------+------------------+---------------------+------ pg_catalog | getdatabaseencoding | name | | func pg_catalog | pg_char_to_encoding | integer | name | func pg_catalog | pg_client_encoding | name | | func pg_catalog | pg_encoding_max_length | integer | integer | func pg_catalog | pg_encoding_to_char | name | integer | func

导出PG支持的libc collate, ctype

```
create table t13 (c name);
copy t13 from stdin;

EUC_CN
EUC_JP
EUC_JIS_2004
EUC_KR
EUC_TW
ISO_8859_5
ISO_8859_6
ISO_8859_7
ISO_8859_8
KOI8R
KOI8U
LATIN1
LATIN2
LATIN3
LATIN4
LATIN5
LATIN6
LATIN7
LATIN8
LATIN9
LATIN10
MULE_INTERNAL
SQL_ASCII
UTF8
WIN866
WIN874
WIN1250
WIN1251
WIN1252
WIN1253
WIN1254
WIN1255
WIN1256
WIN1257
WIN1258
.

-- 以上取值来自手册,支持哪些server端encoding, 不包括icu

create or replace function get_coll_ctype (in c name, out c1 name, out c2 name) returns setof record as $$
select
collcollate,
collctype
from pg_collation
where (pg_encoding_to_char(collencoding)='' and collcollate<>'' and collprovider<>'i')
or
pg_encoding_to_char(collencoding)=c;
$$ language sql strict;

select rtrim(c,' '), (get_coll_ctype(rtrim(c,' '))).* from t13 where rtrim(c,' ')<>''; ```

如何扩展libc collate

如果PG运行操作系统的libc有更新,支持了更多的collate,如何导入呢?

https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-ADMIN-COLLATION

Table 9.91. Collation Management Functions

Name | Return Type| Description
---|---|---
pg_collation_actual_version(oid)| text| Return actual version of collation from operating system
pg_import_system_collations(schema regnamespace)| integer| Import operating system collations

pg_collation_actual_version returns the actual version of the collation object as it is currently installed in the operating system. If this is different from the value in pg_collation.collversion, then objects depending on the collation might need to be rebuilt. See also ALTER COLLATION.

pg_import_system_collations adds collations to the system catalog pg_collation based on all the locales it finds in the operating system. This is what initdb uses; see Section 23.2.2 for more details. If additional locales are installed into the operating system later on, this function can be run again to add collations for the new locales. Locales that match existing entries in pg_collation will be skipped. (But collation objects based on locales that are no longer present in the operating system are not removed by this function.) The schema parameter would typically be pg_catalog, but that is not a requirement; the collations could be installed into some other schema as well. The function returns the number of new collation objects it created.

```
postgres=# select pg_import_system_collations('pg_catalog');
pg_import_system_collations


                       0

(1 row)
```

注意

以上例子没有考虑ICU。(PG支持libc , icu的collate. )

支持哪些libc collate和操作系统的libc库有关, 详见手册。

https://www.postgresql.org/docs/current/collation.html

Managing Collations
A collation is an SQL schema object that maps an SQL name to locales provided by libraries installed in the operating system. A collation definition has a provider that specifies which library supplies the locale data. One standard provider name is libc, which uses the locales provided by the operating system C library. These are the locales that most tools provided by the operating system use. Another provider is icu, which uses the external ICU library. ICU locales can only be used if support for ICU was configured when PostgreSQL was built.

A collation object provided by libc maps to a combination of LC_COLLATE and LC_CTYPE settings, as accepted by the setlocale() system library call. (As the name would suggest, the main purpose of a collation is to set LC_COLLATE, which controls the sort order. But it is rarely necessary in practice to have an LC_CTYPE setting that is different from LC_COLLATE, so it is more convenient to collect these under one concept than to create another infrastructure for setting LC_CTYPE per expression.) Also, a libc collation is tied to a character set encoding (see Section 23.3). The same collation name may exist for different encodings.

A collation object provided by icu maps to a named collator provided by the ICU library. ICU does not support separate “collate” and “ctype” settings, so they are always the same. Also, ICU collations are independent of the encoding, so there is always only one ICU collation of a given name in a database.

参考

src/backend/utils/mb/encnames.c

https://www.postgresql.org/docs/12/multibyte.html

https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-ADMIN-COLLATION

https://www.postgresql.org/docs/current/collation.html

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论