PostgreSQL中bit, boolean类型以及字符串类型的大小写

数据库杂记 2023-04-09

138

前言

这些年来，一直喜欢PG，有一大原因就是总能为各类问题找到或者接近找到相应的解决方案。随着用户范围的不断壮大，其易用性和可用性会越来越好。

1、bit以及boolean类型

最早起缘于2017年就有人报的一个BUG/issue, 见：https://github.com/pgjdbc/pgjdbc/issues/892，但是直到去年2022年才最终得到解决。这期间，估计经历过好几轮的折腾。这个bug的问题主要就是bit类型的字段，getObject()的调用会报异常。

还有一个类似的jdbc的问题（我在2020年报的)：

https://github.com/pgjdbc/pgjdbc/issues/1754 是关于boolean类型的SQL类型，得到的Java Type是Types.BIT but not Types.Boolean。他们也不打算fix 了，担心会影响很多已有功能。这在某种程度上还是有影响的。

这样一来，大家在处理这两种类型的元信息的时候，需要特别小心。一不小心，可能就会出错。

如果实在怕出错，可以尽量少去用它。

再来看看一些O-R Mapping的JPA库，对这些特殊字段类型的支持（映射）是什么样的？

这里只引用EclipseLink里头关于PostgreSQL Platform/Database的一段代码：(详见：https://github.com/eclipse-ee4j/eclipselink/blob/73f23ffed443cf5bb740e2c22dbd7b2ea07aca86/foundation/org.eclipse.persistence.core/src/main/java/org/eclipse/persistence/platform/database/PostgreSQLPlatform.java)

   @Override
    protected Hashtable<Class<?>, FieldTypeDefinition> buildFieldTypes() {
        Hashtable<Class<?>, FieldTypeDefinition> fieldTypeMapping = new Hashtable<>();

        fieldTypeMapping.put(Boolean.class, new FieldTypeDefinition("BOOLEAN", false));

        fieldTypeMapping.put(Integer.class, new FieldTypeDefinition("INTEGER", false));
        fieldTypeMapping.put(Long.class, new FieldTypeDefinition("BIGINT", false));
        fieldTypeMapping.put(Float.class, new FieldTypeDefinition("FLOAT", false));
        fieldTypeMapping.put(Double.class, new FieldTypeDefinition("FLOAT", false));
        fieldTypeMapping.put(Short.class, new FieldTypeDefinition("SMALLINT", false));
        fieldTypeMapping.put(Byte.class, new FieldTypeDefinition("SMALLINT", false));
        fieldTypeMapping.put(java.math.BigInteger.class, new FieldTypeDefinition("BIGINT", false));
        fieldTypeMapping.put(java.math.BigDecimal.class, new FieldTypeDefinition("DECIMAL", 38));
        fieldTypeMapping.put(Number.class, new FieldTypeDefinition("DECIMAL", 38));

        fieldTypeMapping.put(String.class, new FieldTypeDefinition("VARCHAR", DEFAULT_VARCHAR_SIZE));
        fieldTypeMapping.put(Character.class, new FieldTypeDefinition("CHAR", 1));

        fieldTypeMapping.put(Byte[].class, new FieldTypeDefinition("BYTEA", false));
        fieldTypeMapping.put(Character[].class, new FieldTypeDefinition("TEXT", false));
        fieldTypeMapping.put(byte[].class, new FieldTypeDefinition("BYTEA", false));
        fieldTypeMapping.put(char[].class, new FieldTypeDefinition("TEXT", false));
        fieldTypeMapping.put(java.sql.Blob.class, new FieldTypeDefinition("BYTEA"));
        fieldTypeMapping.put(java.sql.Clob.class, new FieldTypeDefinition("TEXT", false));

        fieldTypeMapping.put(java.sql.Date.class, new FieldTypeDefinition("DATE", false));
        fieldTypeMapping.put(java.sql.Time.class, new FieldTypeDefinition("TIME", false));
        fieldTypeMapping.put(java.sql.Timestamp.class, new FieldTypeDefinition("TIMESTAMP", false));

        fieldTypeMapping.put(java.time.LocalDate.class, new FieldTypeDefinition("DATE", false));
        fieldTypeMapping.put(java.time.LocalDateTime.class, new FieldTypeDefinition("TIMESTAMP", false));
        fieldTypeMapping.put(java.time.LocalTime.class, new FieldTypeDefinition("TIME", false));
        fieldTypeMapping.put(java.time.OffsetDateTime.class, new FieldTypeDefinition("TIMESTAMP", false));
        fieldTypeMapping.put(java.time.OffsetTime.class, new FieldTypeDefinition("TIME", false));

        return fieldTypeMapping;
    }

我们也顺便看看HANA数据库的对应的映射：(see: https://github.com/eclipse-ee4j/eclipselink/blob/73f23ffed443cf5bb740e2c22dbd7b2ea07aca86/foundation/org.eclipse.persistence.core/src/main/java/org/eclipse/persistence/platform/database/HANAPlatform.java)

@Override
   protected Hashtable<Class<?>, FieldTypeDefinition> buildFieldTypes() {
       final Hashtable<Class<?>, FieldTypeDefinition> fieldTypeMapping = new Hashtable<>();
       fieldTypeMapping.put(Boolean.class, new FieldTypeDefinition("SMALLINT", false)); // TODO
                                                                                        // boolean
       fieldTypeMapping.put(Number.class, new FieldTypeDefinition("DOUBLE", false));
       fieldTypeMapping.put(Short.class, new FieldTypeDefinition("SMALLINT", false));
       fieldTypeMapping.put(Integer.class, new FieldTypeDefinition("INTEGER", false));
       fieldTypeMapping.put(Long.class, new FieldTypeDefinition("BIGINT", false));
       fieldTypeMapping.put(Float.class, new FieldTypeDefinition("FLOAT", false));
       fieldTypeMapping.put(Double.class, new FieldTypeDefinition("DOUBLE", false));

       fieldTypeMapping.put(BigInteger.class, new FieldTypeDefinition("DECIMAL", 34));
       fieldTypeMapping.put(BigDecimal.class,
               new FieldTypeDefinition("DECIMAL", 34).setLimits(34, -34, 34));

       fieldTypeMapping.put(Character.class, new FieldTypeDefinition("NCHAR", 1));
       fieldTypeMapping.put(Character[].class, new FieldTypeDefinition("NVARCHAR", 255));
       fieldTypeMapping.put(char[].class, new FieldTypeDefinition("NVARCHAR", 255));
       fieldTypeMapping.put(String.class, new FieldTypeDefinition("NVARCHAR", 255));

       fieldTypeMapping.put(Byte.class, new FieldTypeDefinition("SMALLINT", false));
       fieldTypeMapping.put(Byte[].class, new FieldTypeDefinition("VARBINARY", 255));
       fieldTypeMapping.put(byte[].class, new FieldTypeDefinition("VARBINARY", 255));

       fieldTypeMapping.put(Blob.class, new FieldTypeDefinition("BLOB", false));
       fieldTypeMapping.put(Clob.class, new FieldTypeDefinition("NCLOB", false));

       fieldTypeMapping.put(Date.class, new FieldTypeDefinition("DATE", false));
       fieldTypeMapping.put(Time.class, new FieldTypeDefinition("TIME", false));
       fieldTypeMapping.put(Timestamp.class, new FieldTypeDefinition("TIMESTAMP", false));
       return fieldTypeMapping;
   }

从上述映射中，我们也能看到一些不相同的地方。

针对boolean的java类型，PG中还是可以用boolean的字段类型来完成映射的。看来一些框架在内部应该做了相应处理。但是针对HANA的实现，仍然是使用SMALLINT字段类型来完成映射，就算是HANA现在已经有了boolean字段类型也没用。因为在你使用eclipse link时，使用的Object Query会直接翻译成SMALLINT字段类型的处理方式。这在使用这些框架进行开发的时候尤其需要注意。基本上使用它们的时候，HANA里就没办法使用BOOLEAN字段类型了。

同样，针对PostgreSQL，因为找不到"BIT"中"BIT[]"字段类型的对应的映射，所以这个字段类型在使用相应的JPA的时候，也极容易出错。除非自己去控制。

一个有意思的现象是，PG没有针对byte[]的专门的VARBINARY类型，而是使用bytea通吃。了解到这一点就好。

2、字符串类型及大小写问题

2.1 先看看PostgreSQL数据库

先看下边一则实例：

mydb=# create table users(id int primary key, name varchar(32));
CREATE TABLE
mydb=# insert into users values(1, 'Wang jun'), (2, 'jia Li'), (3, 'Hu Lai');
INSERT 0 3
mydb=# select id from users where name = 'wang jun';
 id
----
(0 rows)

mydb=# select id from users where lower(name) = 'wang jun';
 id
----
  1
(1 row)

mydb=# CREATE OR REPLACE FUNCTION random_string( int ) RETURNS TEXT as $$
mydb$#     SELECT string_agg(substring('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', round(random() * 52 + 0.5)::integer, 1), '')
mydb$#         FROM generate_series(1, $1);
mydb$# $$ language sql;
CREATE FUNCTION
mydb=# insert into users select n, random_string(12) from generate_series(4, 200000) as n;
INSERT 0 199997
mydb=# select * from users where id between 100001 and 100004;
   id   |     name
--------+--------------
 100001 | lAanFDGkzpmD
 100002 | PGKewAuAEUok
 100003 | pKEFWfqcytUP
 100004 | VcXrvhtyASIB
(4 rows)

建了一张表users, 如果针对name字段，要求不区分大小写并可以精确匹配。

2.1.1 函数索引

数据量大的话，PG里头，一种常见的方案是直接建一个函数索引即可。如下所示：

-- 建索引前
mydb=# explain (analyze, buffers, verbose) select * from users where lower(name) = lower('PGKewAuAEUok');
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..4138.71 rows=1000 width=17) (actual time=35.433..37.053 rows=1 loops=1)
   Output: id, name
   Workers Planned: 1
   Workers Launched: 1
   Buffers: shared hit=1274
   ->  Parallel Seq Scan on public.users  (cost=0.00..3038.71 rows=588 width=17) (actual time=25.160..33.527 rows=0 loops=2)
         Output: id, name
         Filter: (lower((users.name)::text) = 'pgkewauaeuok'::text)
         Rows Removed by Filter: 100000
         Buffers: shared hit=1274
         Worker 0:  actual time=15.260..31.994 rows=1 loops=1
           Buffers: shared hit=587
 Planning Time: 0.065 ms
 Execution Time: 37.068 ms
(14 rows)

-- 建索引之后
mydb=# create index idx_users_lower_name on users(lower(name));
CREATE INDEX
mydb=# explain (analyze, buffers, verbose) select * from users where lower(name) = lower('PGKewAuAEUok');
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.users  (cost=24.17..1294.74 rows=1000 width=17) (actual time=0.036..0.037 rows=1 loops=1)
   Output: id, name
   Recheck Cond: (lower((users.name)::text) = 'pgkewauaeuok'::text)
   Heap Blocks: exact=1
   Buffers: shared hit=1 read=3
   ->  Bitmap Index Scan on idx_users_lower_name  (cost=0.00..23.92 rows=1000 width=0) (actual time=0.033..0.033 rows=1 loops=1)
         Index Cond: (lower((users.name)::text) = 'pgkewauaeuok'::text)
         Buffers: shared read=3
 Planning:
   Buffers: shared hit=18 read=1
 Planning Time: 0.187 ms
 Execution Time: 0.050 ms
(12 rows)

我们发现，建完索引之后，查询时间立刻从37ms降到0.05ms。效果是立杆见影的。

2.1.2 使用特有插件(citext)

我们可以看下文档：https://www.postgresql.org/docs/15/citext.html

mydb=# create extension citext;
CREATE EXTENSION
mydb=# select 'abcD'::citext;
 citext
--------
 abcD
(1 row)

mydb=# select 'abcD'::citext = 'Abcd'::citext;
 ?column?
----------
 t
(1 row)

我们看到，它确实不区分大小写。

针对前边的users表，我们做一下整改。

mydb=# drop index idx_users_lower_name;
DROP INDEX
mydb=# alter table users alter name type citext;
ALTER TABLE
mydb=# select * from users where name = 'PGKewAuAEUok';
   id   |     name
--------+--------------
 100002 | PGKewAuAEUok
(1 row)
mydb=# explain (analyze, buffers, verbose) select * from users where name = 'PGKewAuAEUok';
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Seq Scan on public.users  (cost=0.00..3774.00 rows=1000 width=36) (actual time=52.640..103.731 rows=1 loops=1)
   Output: id, name
   Filter: (users.name = 'PGKewAuAEUok'::citext)
   Rows Removed by Filter: 199999
   Buffers: shared hit=1274
 Planning Time: 0.050 ms
 Execution Time: 103.749 ms
(7 rows)

建个索引试试：

mydb=# create index idx_users_citext_name on users(name);
CREATE INDEX
mydb=# explain (analyze, buffers, verbose) select * from users where name = 'PGKewAuAEUok';
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.users  (cost=24.17..1292.24 rows=1000 width=36) (actual time=0.052..0.052 rows=1 loops=1)
   Output: id, name
   Recheck Cond: (users.name = 'PGKewAuAEUok'::citext)
   Heap Blocks: exact=1
   Buffers: shared hit=1 read=3
   ->  Bitmap Index Scan on idx_users_citext_name  (cost=0.00..23.92 rows=1000 width=0) (actual time=0.048..0.049 rows=1 loops=1)
         Index Cond: (users.name = 'PGKewAuAEUok'::citext)
         Buffers: shared read=3
 Planning:
   Buffers: shared hit=16 read=1
 Planning Time: 0.166 ms
 Execution Time: 0.067 ms
(12 rows)

mydb=# \timing
Timing is on.
mydb=# select * from users where name = 'PGKewAuAEUok';
   id   |     name
--------+--------------
 100002 | PGKewAuAEUok
(1 row)

Time: 0.341 ms

看起来两种方案都是可以的。citext类型，可能消耗空间要大一些。有兴趣可以继续挖一挖。

2.2 HANA之类的数据库怎么解决

截至目前，HANA还不支持基于函数的索引，否则PG的第一个方案可以直接照搬。要想实现大数据量下的查询，还要保持效率，昨办？有同学会问，HANA是列存储，不建索引，一样高效。话是这么说，HANA同时也支持行存储啊。或者换个别的不支持函数索引的数据库，怎么弄？

CREATE FUNCTION random_string (IN max_count int)
RETURNS val CLOB
AS
BEGIN
 SELECT string_agg(SUBSTRING('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', round((rand() * 52 + 0.5)), 1), '') INTO val  FROM SERIES_GENERATE_DECIMAL(1, 0, max_count);
END;

create table users(id int primary key, name varchar(32), name2 varchar(32) GENERATED ALWAYS AS lower(name));

INSERT INTO  USERS(id, name) SELECT element_number, random_string(12)  FROM  SERIES_GENERATE_DECIMAL(1,0,100000);

SET 'CASE_SENSITIVE' = 'TRUE' ;
SELECT 1 FROM users WHERE 'abc' = 'ABC';

从上边我们也能看出来，也有两种方案：

2.2.1 直接设置session一级的选项：

SET 'CASE_SENSITIVE' = 'TRUE' ;

2.2.2 采用增加一个自动生成列的方式

如这里name2，它里边存储的是name的小写值，然后基于它去创建索引，查询之类的，使用这个冗余列去作为条件列。不过，这样的改动，对于开发人员而言，可能会引起代码不一致的地方（如果考虑同时支持多种不同的DB）。

当然HANA还有别的方式，如正则表达式，Text search来解决这类问题，但那是另一个层面上的事了。

postgresql bit 字符串函数大小写时间戳

文章转载自数据库杂记，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。