暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

PostgreSQL batch insert - jdbc reWriteBatchedInserts 批量写入 (copy,insert,begin commit,group commit)

digoal 2019-10-30
3428

作者

digoal

日期

2019-10-30

标签

PostgreSQL , batch , jdbc , reWriteBatchedInserts , values (),(),()... , prepare statement


背景

如何快速将数据写入PG数据库?除了本身数据库的优化(例如:1、异步提交。2、延迟analyze。3、延迟创建索引。4、拉长检查点。5、增加shared buffer。等等),在写入方式上也有讲究。

写入方式如下:

1、copy 性能最好

例如,

```
db2=> \h copy
Command: COPY
Description: copy data between a file and a table
Syntax:
COPY table_name [ ( column_name [, ...] ) ]
FROM { 'filename' | PROGRAM 'command' | STDIN }
[ [ WITH ] ( option [, ...] ) ]
[ WHERE condition ]

COPY { table_name [ ( column_name [, ...] ) ] | ( query ) }
TO { 'filename' | PROGRAM 'command' | STDOUT }
[ [ WITH ] ( option [, ...] ) ]

where option can be one of:

FORMAT format_name  
FREEZE [ boolean ]  
DELIMITER 'delimiter_character'  
NULL 'null_string'  
HEADER [ boolean ]  
QUOTE 'quote_character'  
ESCAPE 'escape_character'  
FORCE_QUOTE { ( column_name [, ...] ) | * }  
FORCE_NOT_NULL ( column_name [, ...] )  
FORCE_NULL ( column_name [, ...] )  
ENCODING 'encoding_name'

URL: https://www.postgresql.org/docs/12/sql-copy.html

db2=> create table t(id int);
CREATE TABLE

db2=> copy t from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.

1
2
3
4
5
6
.
COPY 6
```

2、insert into .. values (),(),...; 性能与copy差不多

```
db2=> insert into t values (1),(2),(3);
INSERT 0 3

db2=> prepare a (int,int,int) as insert into t values ($1),($2),($3);
PREPARE
db2=> execute a(1,2,3);
INSERT 0 3
db2=> select * from t;
id


1
2
3
(3 rows)
```

3、事务,多条语句放在一个事务中,减少wal flush io同步等待。

begin ... commit;

db2=> begin; BEGIN db2=> insert into t values (4); INSERT 0 1 db2=> insert into t values (5); INSERT 0 1 db2=> insert into t values (6); INSERT 0 1 db2=> commit; COMMIT

4、分组提交 group commit

当配置了commit_delay时,如果同时进入提交状态的事务达到commit_siblings,他们只需要flush一次wal日志,从而减少wal io。

仅对高并发,autocommit有提升效果,低并发没有太大效果。

  • ommit_delay (integer)

commit_delay adds a time delay, measured in microseconds, before a WAL flush is initiated. This can improve group commit throughput by allowing a larger number of transactions to commit via a single WAL flush, if system load is high enough that additional transactions become ready to commit within the given interval. However, it also increases latency by up to commit_delay microseconds for each WAL flush. Because the delay is just wasted if no other transactions become ready to commit, a delay is only performed if at least commit_siblings other transactions are active when a flush is about to be initiated. Also, no delays are performed if fsync is disabled. The default commit_delay is zero (no delay). Only superusers can change this setting.

In PostgreSQL releases prior to 9.3, commit_delay behaved differently and was much less effective: it affected only commits, rather than all WAL flushes, and waited for the entire configured delay even if the WAL flush was completed sooner. Beginning in PostgreSQL 9.3, the first process that becomes ready to flush waits for the configured interval, while subsequent processes wait only until the leader completes the flush operation.

  • ommit_siblings (integer)

Minimum number of concurrent open transactions to require before performing the commit_delay delay. A larger value makes it more probable that at least one other transaction will become ready to commit during the delay interval. The default is five transactions.

查询参数

```
db2=> show commit_delay ;
commit_delay


0
(1 row)

db2=> show commit_siblings ;
commit_siblings


5
(1 row)
```

jdbc 支持 insert into .. values (),(),...;

https://stackoverflow.com/questions/47664889/jdbc-batch-operations-understanding/48349524#48349524

https://vladmihalcea.com/postgresql-multi-row-insert-rewritebatchedinserts-property/

https://jdbc.postgresql.org/documentation/head/connect.html

  • reWriteBatchedInserts = boolean

This will change batch inserts from insert into foo (col1, col2, col3) values (1,2,3) into insert into foo (col1, col2, col3) values (1,2,3), (4,5,6) this provides 2-3x performance improvement

例子1,假的batch

```
@Entity(name = "Post")
@Table(name = "post")
public class Post {

@Id  
@GeneratedValue(  
    strategy = GenerationType.SEQUENCE  
)  
private Long id;

private String title;

public Post() {}

public Post(String title) {  
    this.title = title;  
}

//Getters and setters omitted for brevity

}
```

<property name="hibernate.jdbc.batch_size" value="10" />

for (int i = 0; i < 10; i++) { entityManager.persist( new Post( String.format("Post no. %d", i + 1) ) ); }

Query: ["insert into post (title, id) values (?, ?)"], Params:[(Post no. 1, 1), (Post no. 2, 2), (Post no. 3, 3), (Post no. 4, 4), (Post no. 5, 5), (Post no. 6, 6), (Post no. 7, 7), (Post no. 8, 8), (Post no. 9, 9), (Post no. 10, 10) ]

实际上在数据库中拆分成了10条insert

```
log_statement = 'all'

LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 1', $2 = '1'
LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 2', $2 = '2'
LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 3', $2 = '3'
LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 4', $2 = '4'
LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 5', $2 = '5'
LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 6', $2 = '6'
LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 7', $2 = '7'
LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 8', $2 = '8'
LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 9', $2 = '9'
LOG: execute S_2: insert into post (title, id) values ($1, $2)
DETAIL: parameters: $1 = 'Post no. 10', $2 = '10'
```

例子2,真的batch

```
PGSimpleDataSource dataSource =
(PGSimpleDataSource) super.dataSource();

dataSource.setReWriteBatchedInserts(true);
```

JDBC batch API. That is ···PreparedStatement#addBatch()··· enables driver to send multiple "query executions" in a single network roundtrip. Current implementation, however would still split large batches into smaller ones to avoid TCP deadlock.

为了防止tcp deadlock会拆分成多条,可以看到启用setReWriteBatchedInserts后,已经支持了insert into () values (),(),()...

LOG: execute <unnamed>: insert into post (title, id) values ($1, $2),($3, $4),($5, $6),($7, $8),($9, $10),($11, $12),($13, $14),($15, $16) DETAIL: parameters: $1 = 'Post no. 1', $2 = '1', $3 = 'Post no. 2', $4 = '2', $5 = 'Post no. 3', $6 = '3', $7 = 'Post no. 4', $8 = '4', $9 = 'Post no. 5', $10 = '5', $11 = 'Post no. 6', $12 = '6', $13 = 'Post no. 7', $14 = '7', $15 = 'Post no. 8', $16 = '8' LOG: execute <unnamed>: insert into post (title, id) values ($1, $2),($3, $4) DETAIL: parameters: $1 = 'Post no. 9', $2 = '9', $3 = 'Post no. 10', $4 = '10'

参考

https://stackoverflow.com/questions/47664889/jdbc-batch-operations-understanding/48349524#48349524

https://vladmihalcea.com/postgresql-multi-row-insert-rewritebatchedinserts-property/

https://jdbc.postgresql.org/documentation/head/connect.html

PostgreSQL 许愿链接

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

9.9元购买3个月阿里云RDS PostgreSQL实例

PostgreSQL 解决方案集合

德哥 / digoal's github - 公益是一辈子的事.

digoal's wechat

文章转载自digoal,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论