
前言
这些年来,一直喜欢PG,有一大原因就是总能为各类问题找到或者接近找到相应的解决方案。随着用户范围的不断壮大,其易用性和可用性会越来越好。
1、bit以及boolean类型
最早起缘于2017年就有人报的一个BUG/issue, 见:https://github.com/pgjdbc/pgjdbc/issues/892,但是直到去年2022年才最终得到解决。这期间,估计经历过好几轮的折腾。这个bug的问题主要就是bit类型的字段,getObject()的调用会报异常。
还有一个类似的jdbc的问题(我在2020年报的):
https://github.com/pgjdbc/pgjdbc/issues/1754 是关于boolean类型的SQL类型,得到的Java Type是Types.BIT but not Types.Boolean。他们也不打算fix 了,担心会影响很多已有功能。这在某种程度上还是有影响的。
这样一来,大家在处理这两种类型的元信息的时候,需要特别小心。一不小心,可能就会出错。
如果实在怕出错,可以尽量少去用它。
再来看看一些O-R Mapping的JPA库,对这些特殊字段类型的支持(映射)是什么样的?
这里只引用EclipseLink里头关于PostgreSQL Platform/Database的一段代码:(详见:https://github.com/eclipse-ee4j/eclipselink/blob/73f23ffed443cf5bb740e2c22dbd7b2ea07aca86/foundation/org.eclipse.persistence.core/src/main/java/org/eclipse/persistence/platform/database/PostgreSQLPlatform.java)
@Override
protected Hashtable<Class<?>, FieldTypeDefinition> buildFieldTypes() {
Hashtable<Class<?>, FieldTypeDefinition> fieldTypeMapping = new Hashtable<>();
fieldTypeMapping.put(Boolean.class, new FieldTypeDefinition("BOOLEAN", false));
fieldTypeMapping.put(Integer.class, new FieldTypeDefinition("INTEGER", false));
fieldTypeMapping.put(Long.class, new FieldTypeDefinition("BIGINT", false));
fieldTypeMapping.put(Float.class, new FieldTypeDefinition("FLOAT", false));
fieldTypeMapping.put(Double.class, new FieldTypeDefinition("FLOAT", false));
fieldTypeMapping.put(Short.class, new FieldTypeDefinition("SMALLINT", false));
fieldTypeMapping.put(Byte.class, new FieldTypeDefinition("SMALLINT", false));
fieldTypeMapping.put(java.math.BigInteger.class, new FieldTypeDefinition("BIGINT", false));
fieldTypeMapping.put(java.math.BigDecimal.class, new FieldTypeDefinition("DECIMAL", 38));
fieldTypeMapping.put(Number.class, new FieldTypeDefinition("DECIMAL", 38));
fieldTypeMapping.put(String.class, new FieldTypeDefinition("VARCHAR", DEFAULT_VARCHAR_SIZE));
fieldTypeMapping.put(Character.class, new FieldTypeDefinition("CHAR", 1));
fieldTypeMapping.put(Byte[].class, new FieldTypeDefinition("BYTEA", false));
fieldTypeMapping.put(Character[].class, new FieldTypeDefinition("TEXT", false));
fieldTypeMapping.put(byte[].class, new FieldTypeDefinition("BYTEA", false));
fieldTypeMapping.put(char[].class, new FieldTypeDefinition("TEXT", false));
fieldTypeMapping.put(java.sql.Blob.class, new FieldTypeDefinition("BYTEA"));
fieldTypeMapping.put(java.sql.Clob.class, new FieldTypeDefinition("TEXT", false));
fieldTypeMapping.put(java.sql.Date.class, new FieldTypeDefinition("DATE", false));
fieldTypeMapping.put(java.sql.Time.class, new FieldTypeDefinition("TIME", false));
fieldTypeMapping.put(java.sql.Timestamp.class, new FieldTypeDefinition("TIMESTAMP", false));
fieldTypeMapping.put(java.time.LocalDate.class, new FieldTypeDefinition("DATE", false));
fieldTypeMapping.put(java.time.LocalDateTime.class, new FieldTypeDefinition("TIMESTAMP", false));
fieldTypeMapping.put(java.time.LocalTime.class, new FieldTypeDefinition("TIME", false));
fieldTypeMapping.put(java.time.OffsetDateTime.class, new FieldTypeDefinition("TIMESTAMP", false));
fieldTypeMapping.put(java.time.OffsetTime.class, new FieldTypeDefinition("TIME", false));
return fieldTypeMapping;
}
我们也顺便看看HANA数据库的对应的映射:(see: https://github.com/eclipse-ee4j/eclipselink/blob/73f23ffed443cf5bb740e2c22dbd7b2ea07aca86/foundation/org.eclipse.persistence.core/src/main/java/org/eclipse/persistence/platform/database/HANAPlatform.java)
@Override
protected Hashtable<Class<?>, FieldTypeDefinition> buildFieldTypes() {
final Hashtable<Class<?>, FieldTypeDefinition> fieldTypeMapping = new Hashtable<>();
fieldTypeMapping.put(Boolean.class, new FieldTypeDefinition("SMALLINT", false)); // TODO
// boolean
fieldTypeMapping.put(Number.class, new FieldTypeDefinition("DOUBLE", false));
fieldTypeMapping.put(Short.class, new FieldTypeDefinition("SMALLINT", false));
fieldTypeMapping.put(Integer.class, new FieldTypeDefinition("INTEGER", false));
fieldTypeMapping.put(Long.class, new FieldTypeDefinition("BIGINT", false));
fieldTypeMapping.put(Float.class, new FieldTypeDefinition("FLOAT", false));
fieldTypeMapping.put(Double.class, new FieldTypeDefinition("DOUBLE", false));
fieldTypeMapping.put(BigInteger.class, new FieldTypeDefinition("DECIMAL", 34));
fieldTypeMapping.put(BigDecimal.class,
new FieldTypeDefinition("DECIMAL", 34).setLimits(34, -34, 34));
fieldTypeMapping.put(Character.class, new FieldTypeDefinition("NCHAR", 1));
fieldTypeMapping.put(Character[].class, new FieldTypeDefinition("NVARCHAR", 255));
fieldTypeMapping.put(char[].class, new FieldTypeDefinition("NVARCHAR", 255));
fieldTypeMapping.put(String.class, new FieldTypeDefinition("NVARCHAR", 255));
fieldTypeMapping.put(Byte.class, new FieldTypeDefinition("SMALLINT", false));
fieldTypeMapping.put(Byte[].class, new FieldTypeDefinition("VARBINARY", 255));
fieldTypeMapping.put(byte[].class, new FieldTypeDefinition("VARBINARY", 255));
fieldTypeMapping.put(Blob.class, new FieldTypeDefinition("BLOB", false));
fieldTypeMapping.put(Clob.class, new FieldTypeDefinition("NCLOB", false));
fieldTypeMapping.put(Date.class, new FieldTypeDefinition("DATE", false));
fieldTypeMapping.put(Time.class, new FieldTypeDefinition("TIME", false));
fieldTypeMapping.put(Timestamp.class, new FieldTypeDefinition("TIMESTAMP", false));
return fieldTypeMapping;
}
从上述映射中,我们也能看到一些不相同的地方。
针对boolean的java类型,PG中还是可以用boolean的字段类型来完成映射的。看来一些框架在内部应该做了相应处理。但是针对HANA的实现 ,仍然是使用SMALLINT字段类型来完成映射,就算是HANA现在已经有了boolean字段类型也没用。因为在你使用eclipse link时,使用的Object Query会直接翻译成SMALLINT字段类型的处理方式。这在使用这些框架进行开发的时候尤其需要注意。基本上使用它们的时候,HANA里就没办法使用BOOLEAN字段类型了。
同样,针对PostgreSQL,因为找不到"BIT"中"BIT[]"字段类型的对应的映射,所以这个字段类型在使用相应的JPA的时候,也极容易出错。除非自己去控制。
一个有意思的现象是,PG没有针对byte[]的专门的VARBINARY类型,而是使用bytea通吃。了解到这一点就好。
2、字符串类型及大小写问题
2.1 先看看PostgreSQL数据库
先看下边一则实例:
mydb=# create table users(id int primary key, name varchar(32));
CREATE TABLE
mydb=# insert into users values(1, 'Wang jun'), (2, 'jia Li'), (3, 'Hu Lai');
INSERT 0 3
mydb=# select id from users where name = 'wang jun';
id
----
(0 rows)
mydb=# select id from users where lower(name) = 'wang jun';
id
----
1
(1 row)
mydb=# CREATE OR REPLACE FUNCTION random_string( int ) RETURNS TEXT as $$
mydb$# SELECT string_agg(substring('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', round(random() * 52 + 0.5)::integer, 1), '')
mydb$# FROM generate_series(1, $1);
mydb$# $$ language sql;
CREATE FUNCTION
mydb=# insert into users select n, random_string(12) from generate_series(4, 200000) as n;
INSERT 0 199997
mydb=# select * from users where id between 100001 and 100004;
id | name
--------+--------------
100001 | lAanFDGkzpmD
100002 | PGKewAuAEUok
100003 | pKEFWfqcytUP
100004 | VcXrvhtyASIB
(4 rows)
建了一张表users, 如果针对name字段,要求不区分大小写并可以精确匹配。
2.1.1 函数索引
数据量大的话,PG里头,一种常见的方案是直接建一个函数索引即可。如下所示:
-- 建索引前
mydb=# explain (analyze, buffers, verbose) select * from users where lower(name) = lower('PGKewAuAEUok');
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..4138.71 rows=1000 width=17) (actual time=35.433..37.053 rows=1 loops=1)
Output: id, name
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=1274
-> Parallel Seq Scan on public.users (cost=0.00..3038.71 rows=588 width=17) (actual time=25.160..33.527 rows=0 loops=2)
Output: id, name
Filter: (lower((users.name)::text) = 'pgkewauaeuok'::text)
Rows Removed by Filter: 100000
Buffers: shared hit=1274
Worker 0: actual time=15.260..31.994 rows=1 loops=1
Buffers: shared hit=587
Planning Time: 0.065 ms
Execution Time: 37.068 ms
(14 rows)
-- 建索引之后
mydb=# create index idx_users_lower_name on users(lower(name));
CREATE INDEX
mydb=# explain (analyze, buffers, verbose) select * from users where lower(name) = lower('PGKewAuAEUok');
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.users (cost=24.17..1294.74 rows=1000 width=17) (actual time=0.036..0.037 rows=1 loops=1)
Output: id, name
Recheck Cond: (lower((users.name)::text) = 'pgkewauaeuok'::text)
Heap Blocks: exact=1
Buffers: shared hit=1 read=3
-> Bitmap Index Scan on idx_users_lower_name (cost=0.00..23.92 rows=1000 width=0) (actual time=0.033..0.033 rows=1 loops=1)
Index Cond: (lower((users.name)::text) = 'pgkewauaeuok'::text)
Buffers: shared read=3
Planning:
Buffers: shared hit=18 read=1
Planning Time: 0.187 ms
Execution Time: 0.050 ms
(12 rows)
我们发现,建完索引之后,查询时间立刻从37ms降到0.05ms。效果是立杆见影的。
2.1.2 使用特有插件(citext)
我们可以看下文档:https://www.postgresql.org/docs/15/citext.html
mydb=# create extension citext;
CREATE EXTENSION
mydb=# select 'abcD'::citext;
citext
--------
abcD
(1 row)
mydb=# select 'abcD'::citext = 'Abcd'::citext;
?column?
----------
t
(1 row)
我们看到,它确实不区分大小写。
针对前边的users表,我们做一下整改。
mydb=# drop index idx_users_lower_name;
DROP INDEX
mydb=# alter table users alter name type citext;
ALTER TABLE
mydb=# select * from users where name = 'PGKewAuAEUok';
id | name
--------+--------------
100002 | PGKewAuAEUok
(1 row)
mydb=# explain (analyze, buffers, verbose) select * from users where name = 'PGKewAuAEUok';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan on public.users (cost=0.00..3774.00 rows=1000 width=36) (actual time=52.640..103.731 rows=1 loops=1)
Output: id, name
Filter: (users.name = 'PGKewAuAEUok'::citext)
Rows Removed by Filter: 199999
Buffers: shared hit=1274
Planning Time: 0.050 ms
Execution Time: 103.749 ms
(7 rows)
建个索引试试:
mydb=# create index idx_users_citext_name on users(name);
CREATE INDEX
mydb=# explain (analyze, buffers, verbose) select * from users where name = 'PGKewAuAEUok';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.users (cost=24.17..1292.24 rows=1000 width=36) (actual time=0.052..0.052 rows=1 loops=1)
Output: id, name
Recheck Cond: (users.name = 'PGKewAuAEUok'::citext)
Heap Blocks: exact=1
Buffers: shared hit=1 read=3
-> Bitmap Index Scan on idx_users_citext_name (cost=0.00..23.92 rows=1000 width=0) (actual time=0.048..0.049 rows=1 loops=1)
Index Cond: (users.name = 'PGKewAuAEUok'::citext)
Buffers: shared read=3
Planning:
Buffers: shared hit=16 read=1
Planning Time: 0.166 ms
Execution Time: 0.067 ms
(12 rows)
mydb=# \timing
Timing is on.
mydb=# select * from users where name = 'PGKewAuAEUok';
id | name
--------+--------------
100002 | PGKewAuAEUok
(1 row)
Time: 0.341 ms
看起来两种方案都是可以的。citext类型,可能消耗空间要大一些。 有兴趣可以继续挖一挖。
2.2 HANA之类的数据库怎么解决
截至目前,HANA还不支持基于函数的索引,否则PG的第一个方案可以直接照搬。要想实现大数据量下的查询,还要保持效率,昨办?有同学会问,HANA是列存储,不建索引,一样高效。话是这么说,HANA同时也支持行存储啊。或者换个别的不支持函数索引的数据库,怎么弄?
CREATE FUNCTION random_string (IN max_count int)
RETURNS val CLOB
AS
BEGIN
SELECT string_agg(SUBSTRING('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', round((rand() * 52 + 0.5)), 1), '') INTO val FROM SERIES_GENERATE_DECIMAL(1, 0, max_count);
END;
create table users(id int primary key, name varchar(32), name2 varchar(32) GENERATED ALWAYS AS lower(name));
INSERT INTO USERS(id, name) SELECT element_number, random_string(12) FROM SERIES_GENERATE_DECIMAL(1,0,100000);
SET 'CASE_SENSITIVE' = 'TRUE' ;
SELECT 1 FROM users WHERE 'abc' = 'ABC';
从上边我们也能看出来,也有两种方案:
2.2.1 直接设置session一级的选项:
SET 'CASE_SENSITIVE' = 'TRUE' ;
2.2.2 采用增加一个自动生成列的方式
如这里name2,它里边存储的是name的小写值,然后基于它去创建索引,查询之类的,使用这个冗余列去作为条件列。 不过,这样的改动,对于开发人员而言,可能会引起代码不一致的地方(如果考虑同时支持多种不同的DB)。
当然HANA还有别的方式,如正则表达式,Text search来解决这类问题,但那是另一个层面上的事了。





