PGSync - PostgreSQL 逻辑订阅同步到 ElasticSearch

digoal 2021-01-05

2706

作者

digoal

日期

2021-05-14

背景

pgsync开源tool, 通过逻辑订阅将PG增量数据同步到ElasticSearch.

https://github.com/toluaina/pgsync

需要注意PG目前的版本不支持slot failover, 如果你的PG是通过流复制实现的HA, 有主从结构的话, 那么一旦发生HA, 逻辑订阅的slot将丢失. 这个非常不友好.
为了解决这个问题目前阿里云RDS PG内核层面改进支持了slot failover, 也就是发生HA, slot 也会failover. 不会导致逻辑订阅问题.

如果你用的是社区版本, 在云端使用, 建议使用云盘存储数据, 通过云盘来实现多副本高可靠. 同时建立standby(灾备, 不做自动切换.)

PGSync

PostgreSQL to Elasticsearch sync

PGSync is a middleware for syncing data from Postgres to Elasticsearch effortlessly.
It allows you to keep Postgres as your source of truth and
expose structured denormalized documents in Elasticsearch.

Changes to nested entities are propagated to Elasticsearch.
PGSync's advanced query builder then generates optimized SQL queries
on the fly based on your schema.
PGSync's advisory model allows you to quickly move and transform large volumes of data quickly whilst maintaining relational integrity.

Simply describe your document structure or schema in JSON and PGSync will
continuously capture changes in your data and load it into Elasticsearch
without writing any code.
PGSync transforms your relational data into a structured document format.

It allows you to take advantage of the expressive power and scalability of
Elasticsearch directly from Postgres.
You don't have to write complex queries and transformation pipelines.
PGSync is lightweight, flexible and fast.

Elasticsearch is more suited as as secondary denormalised search engine to accompany a more traditional normalized datastore.
Moreover, you shouldn't store your primary data in Elasticsearch.

So how do you then get your data into Elasticsearch in the first place?
Tools like Logstash and Kafka can aid this task but they still require a bit
of engineering and development.

Extract Transform Load and Change data capture tools can be complex and require expensive engineering effort.

Other benefits of PGSync include:
- Real-time analytics
- Reliable primary datastore/source of truth
- Scale on-demand
- Easily join multiple nested tables

Why?

At a high level, you have data in a Postgres database and you want to mirror it in Elasticsearch.
This means every change to your data (Insert, Update, Delete and Truncate statements) needs to be replicated to Elasticsearch.
At first, this seems easy and then it's not. Simply add some code to copy the data to Elasticsearch after updating the database (or so called dual writes).
Writing SQL queries spanning multiple tables and involving multiple relationships are hard to write.
Detecting changes within a nested document can also be quite hard.
Of course, if your data never changed, then you could just take a snapshot in time and load it into Elasticsearch as a one-off operation.

PGSync is appropriate for you if:
- Postgres is your read/write source of truth whilst Elasticsearch is your
read-only search layer.
- You need to denormalize relational data into a NoSQL data source.
- Your data is constantly changing.
- You have existing data in a relational database such as Postgres and you need
a secondary NoSQL database like Elasticsearch for text-based queries or autocomplete queries to mirror the existing data without having your application perform dual writes.
- You want to keep your existing data untouched whilst taking advantage of
the search capabilities of Elasticsearch by exposing a view of your data without compromising the security of your relational data.
- Or you simply want to expose a view of your relational data for search purposes.

How it works

PGSync is written in Python (supporting version 3.6 onwards) and the stack is composed of: Redis, Elasticsearch, Postgres, and SQlAlchemy.

PGSync leverages the logical decoding feature of Postgres (introduced in PostgreSQL 9.4) to capture a continuous stream of change events.
This feature needs to be enabled in your Postgres configuration file by setting in the postgresql.conf file:
```

wal_level = logical
```

You can select any pivot table to be the root of your document.

PGSync's query builder builds advanced queries dynamically against your schema.

PGSync operates in an event-driven model by creating triggers for tables in your database to handle notification events.

This is the only time PGSync will ever make any changes to your database.

NOTE: If you change the structure of your PGSync's schema config, you would need to rebuild your Elasticsearch indices.
There are plans to support zero-downtime migrations to streamline this process.

Quickstart

There are several ways of installing and trying PGSync
- Running in Docker is the easiest way to get up and running.
- Manual configuration

Running in Docker

To startup all services with docker.
Run:
$ docker-compose up

Show the content in Elasticsearch
$ curl -X GET http://[elasticsearch host]:9201/reservations/_search?pretty=true

Manual configuration

Setup
Ensure the database user is a superuser
Enable logical decoding. You would also need to set up at least two parameters at postgresql.conf

wal_level = logical

max_replication_slots = 1
Installation
$ pip install pgsync
Create a schema.json for you document representation
Bootstrap the database (one time only) bootstrap --config schema.json
Run the program with pgsync --config schema.json or as a daemon pgsync --config schema.json -d

Features

Key features of PGSync are:

Easily denormalize relational data.
Works with any PostgreSQL database (version 9.6 or later).
Negligible impact on database performance.
Transactionally consistent output in Elasticsearch. This means: writes appear only when they are committed to the database, insert, update and delete operations appear in the same order as they were committed (as opposed to eventual consistency).
Fault-tolerant: does not lose data, even if processes crash or a network interruption occurs, etc. The process can be recovered from the last checkpoint.
Returns the data directly as Postgres JSON from the database for speed.
Supports composite primary and foreign keys.
Supports an arbitrary depth of nested entities i.e Tables having long chain of relationship dependencies.
Supports Postgres JSON data fields. This means: we can extract JSON fields in a database table as a separate field in the resulting document.
Customizable document structure.

Requirements

Python 3.6+
Postgres 9.6+
Redis 3.1.0
Elasticsearch 6.3.1+
SQlAlchemy 1.3.4+

Example

Consider this example of a Book library database.

Book

| isbn (PK) | title | description |
| ------------- | ------------- | ------------- |
| 9785811243570 | Charlie and the chocolate factory | Willy Wonka’s famous chocolate factory is opening at last! |
| 9788374950978 | Kafka on the Shore | Kafka on the Shore is a 2002 novel by Japanese author Haruki Murakami. |
| 9781471331435 | 1984 | 1984 was George Orwell’s chilling prophecy about the dystopian future. |

Author

| id (PK) | name |
| ------------- | ------------- |
| 1 | Roald Dahl |
| 2 | Haruki Murakami |
| 3 | Philip Gabriel |
| 4 | George Orwell |

BookAuthor

| id (PK) | book_isbn | author_id |
| -- | ------------- | ---------- |
| 1 | 9785811243570 | 1 |
| 2 | 9788374950978 | 2 |
| 3 | 9788374950978 | 3 |
| 4 | 9781471331435 | 4 |

With PGSync, we can simply define this JSON schema where the book table is the pivot.
A pivot table indicates the root of your document.

json { "table": "book", "columns": [ "isbn", "title", "description" ], "children": [ { "table": "author", "columns": [ "name" ] } ] }

To get this document structure in Elasticsearch

json [ { "isbn": "9785811243570", "title": "Charlie and the chocolate factory", "description": "Willy Wonka’s famous chocolate factory is opening at last!", "authors": ["Roald Dahl"] }, { "isbn": "9788374950978", "title": "Kafka on the Shore", "description": "Kafka on the Shore is a 2002 novel by Japanese author Haruki Murakami", "authors": ["Haruki Murakami", "Philip Gabriel"] }, { "isbn": "9781471331435", "title": "1984", "description": "1984 was George Orwell’s chilling prophecy about the dystopian future", "authors": ["George Orwell"] } ]

Behind the scenes, PGSync is generating advanced queries for you such as.

sql SELECT JSON_BUILD_OBJECT( 'isbn', book_1.isbn, 'title', book_1.title, 'description', book_1.description, 'authors', anon_1.authors ) AS "JSON_BUILD_OBJECT_1", book_1.id FROM book AS book_1 LEFT OUTER JOIN (SELECT JSON_AGG(anon_2.anon) AS authors, book_author_1.book_isbn AS book_isbn FROM book_author AS book_author_1 LEFT OUTER JOIN (SELECT author_1.name AS anon, author_1.id AS id FROM author AS author_1) AS anon_2 ON anon_2.id = book_author_1.author_id GROUP BY book_author_1.book_isbn) AS anon_1 ON anon_1.book_isbn = book_1.isbn

You can also configure PGSync to rename attributes via the schema config
e.g

json { "isbn": "9781471331435", "this_is_a_custom_title": "1984", "desc": "1984 was George Orwell’s chilling prophecy about the dystopian future", "contributors": ["George Orwell"] }

PGSync addresses the following challenges:
- What if we update the author's name in the database?
- What if we wanted to add another author for an existing book?
- What if we have lots of documents already with the same author we wanted to change the author name?
- What if we delete or update an author?
- What if we truncate an entire table?

Benefits

PGSync is a simple to use out of the box solution for Change data capture.
PGSync handles data deletions.
PGSync requires little development effort. You simply define a schema config describing your data.
PGSync generates advanced queries matching your schema directly.
PGSync allows you to easily rebuild your indexes in case of a schema change.
You can expose only the data you require in Elasticsearch.
Supports multiple Postgres schemas for multi-tennant applications.

Contributing

Contributions are very welcome! Check out the Contribution Guidelines for instructions.

Credits

This package was created with Cookiecutter
Elasticsearch is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.

License

This code is released under the GNU Lesser General Public License, version 3.0 (LGPL-3.0).
Please see LICENSE for more details.

You should have received a copy of the GNU Lesser General Public License along with PGSync.
If not, see https://www.gnu.org/licenses/.

文章转载自digoal，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。