TiDB: A Raft-based HTAP Database

盖国强

970

13页

15次

2020-10-12

免费下载

TiDB: A Raft-based HTAP Database

Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang

∗

, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang,

Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu,

Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, Xin Tang

PingCAP

{huang, liuqi, cuiqiu, fangzhuhe, maxiaoyu, xufei, shenli, tl, z, menglong,

weiwan, liucong, zhangjian, jay, wuxuelian, songlingyu, sunruoxi, yusp,

zhaolei, nick, liquanpei, tangxin}@pingcap.com

ABSTRACT

Hybrid Transactional and Analytical Processing (HTAP) databases

require processing transactional and analytical queries in isolation

to remove the interference between them. To achieve this, it is nec-

essary to maintain different replicas of data speciﬁed for the two

types of queries. However, it is challenging to provide a consistent

view for distributed replicas within a storage system, where ana-

lytical requests can efﬁciently read consistent and fresh data from

transactional workloads at scale and with high availability.

To meet this challenge, we propose extending replicated state

machine-based consensus algorithms to provide consistent replicas

for HTAP workloads. Based on this novel idea, we present a Raft-

based HTAP database: TiDB. In the database, we design a multi-

Raft storage system which consists of a row store and a column

store. The row store is built based on the Raft algorithm. It is scal-

able to materialize updates from transactional requests with high

availability. In particular, it asynchronously replicates Raft logs to

learners which transform row format to column format for tuples,

forming a real-time updatable column store. This column store al-

lows analytical queries to efﬁciently read fresh and consistent data

with strong isolation from transactions on the row store. Based on

this storage system, we build an SQL engine to process large-scale

distributed transactions and expensive analytical queries. The SQL

engine optimally accesses row-format and column-format replicas

of data. We also include a powerful analysis engine, TiSpark, to

help TiDB connect to the Hadoop ecosystem. Comprehensive ex-

periments show that TiDB achieves isolated high performance un-

der CH-benCHmark, a benchmark focusing on HTAP workloads.

PVLDB Reference Format:

Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li

Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian

Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu,

Lei Zhao, Nicholas Cameron, Liquan Pei, Xin Tang. TiDB: A Raft-based

HTAP Database. PVLDB, 13(12): 3072-3084, 2020.

DOI: https://doi.org/10.14778/3415478.3415535

1. INTRODUCTION

Relational database management systems (RDBMS) are popu-

lar with their relational model, strong transactional guarantees, and

∗

Zhuhe Fang is the corresponding author.

This work is licensed under the Creative Commons Attribution-

NonCommercial-NoDerivatives 4.0 International License. To view a copy

of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For

any use beyond those covered by this license, obtain permission by emailing

info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 13, No. 12

ISSN 2150-8097.

DOI: https://doi.org/10.14778/3415478.3415535

SQL interface. They are widely adopted in traditional applica-

tions, like business systems. However, old RDBMSs do not pro-

vide scalability and high availability. Therefore, at the beginning

of the 2000s [11], internet applications preferred NoSQL systems

like Google Bigtable [12] and DynamoDB [36]. NoSQL systems

loosen the consistency requirements and provide high scalability

and alternative data models, like key-value pairs, graphs, and doc-

uments. However, many applications also need strong transac-

tions, data consistency, and an SQL interface, so NewSQL systems

appeared. NewSQL systems like CockroachDB [38] and Google

Spanner [14] provide the high scalability of NoSQL for Online

Transactional Processing (OLTP) read/write workloads and still ma-

intain ACID guarantees for transactions [32]. In addition, SQL-

based Online Analytical Processing (OLAP) systems are being de-

veloped quickly, like many SQL-on-Hadoop systems [16].

These systems follow the “one size does not ﬁt all” paradigm

[37], using different data models and technologies for the different

purposes of OLAP and OLTP. However, multiple systems are very

expensive to develop, deploy, and maintain. In addition, analyzing

the latest version of data in real time is compelling. This has given

rise to hybrid OLTP and OLAP (HTAP) systems in industry and

academia [30]. HTAP systems should implement scalability, high

availability, and transnational consistency like NewSQL systems.

Besides, HTAP systems need to efﬁciently read the latest data to

guarantee the throughput and latency for OLTP and OLAP requests

under two additional requirements: freshness and isolation.

Freshness means how recent data is processed by the analytical

queries [34]. Analyzing the latest data in real time has great busi-

ness value. But it is not guaranteed in some HTAP solutions, such

as those based on an Extraction-Transformation-Loading (ETL) pro-

cessing. Through the ETL process, OLTP systems periodically re-

fresh a batch of the latest data to OLAP systems. The ETL costs

several hours or days, so it cannot offer real-time analysis. The ETL

phase can be replaced by streaming the latest updates to OLAP sys-

tems to reduce synchronization time. However, because these two

approaches lack a global data governance model, it is more com-

plex to consider consistency semantics. Interfacing with multiple

systems introduces additional overhead.

Isolation refers to guaranteeing isolated performance for seper-

ate OLTP and OLAP queries. Some in-memory databases (such as

HyPer [18]) enable analytical queries to read the latest version of

data from transactional processing on the same server. Although

this approach provides fresh data, it cannot achieve high perfor-

mance for both OLTP and OLAP. This is due to data synchroniza-

tion penalties and workload interference. This effect is studied in

[34] by running CH-benCHmark [13], an HTAP benchmark on Hy-

Per and SAP HANA. The study found that when a system co-runs

analytical queries, its maximum attainable OLTP throughput is sig-

3072

niﬁcantly reduced. SAP HANA [22] throughput was reduced by

at least three times, and HyPer by at least ﬁve times. Similar re-

sults are conﬁrmed in MemSQL [24]. Furthermore, in-memory

databases cannot provide high availability and scalability if they

are only deployed on a single server.

To guarantee isolated performance, it is necessary to run OLTP

and OLAP requests on different hardware resources. The essen-

tial difﬁculty is to maintain up-to-date replicas for OLAP requests

from OLTP workloads within a single system. Besides, the system

needs to maintain data consistency among more replicates. Note

that maintaining consistent replicas is also required for availabil-

ity [29]. High availability can be achieved using well-known con-

sensus algorithms, such as Paxos [20] and Raft [29]. They are

based on replicated state machines to synchronize replicas. It is

possible to extend these consensus algorithms to provide consis-

tent replicas for HTAP workloads. To the best of our knowledge,

this idea has not been studied before.

Following this idea, we propose a Raft-based HTAP database:

TiDB. It introduces dedicated nodes (called learners) to the Raft

consensus algorithm. The learners asynchronously replicate trans-

actional logs from leader nodes to construct new replicas for OLAP

queries. In particular, the learners transform the row-format tuples

in the logs into column format so that the replicas are better-suited

to analytical queries. Such log replication incurs little overhead on

transactional queries running on leader nodes. Moreover, the la-

tency of such replication is so short that it can guarantee data fresh-

ness for OLAP. We use different data replicas to separately process

OLAP and OLTP requests to avoid interference between them. We

can also optimize HTAP requests based on both row-format and

column-format data replicas. Based on the Raft protocol, TiDB

provides high availability, scalability, and data consistency.

TiDB presents an innovative solution that helps consensus alg-

orithms-based NewSQL systems evolve into HTAP systems. New-

SQL systems ensure high availability, scalability, and data dura-

bility for OLTP requests by replicating their database like Google

Spanner and CockroachDB. They synchronize data across data repli-

cas via replication mechanisms typically from consensus algorithms.

Based on the log replication, NewSQL systems can provide a colum-

nar replica dedicated to OLAP requests so that they can support

HTAP requests in isolation like TiDB.

We conclude our contributions as follows.

• We propose building an HTAP system based on consensus al-

gorithms and have implemented a Raft-based HTAP database,

TiDB. It is an open-source project [7] that provides high avail-

ability, consistency, scalability, data freshness, and isolation for

HTAP workloads.

• We introduce the learner role to the Raft algorithm to generate a

columnar store for real-time OLAP queries.

• We implement a multi-Raft storage system and optimize its reads

and writes so that the system offers high performance when scal-

ing to more nodes.

• We tailor an SQL engine for large-scale HTAP queries. The en-

gine can optimally choose to use a row-based store and a colum-

nar store.

• We conduct comprehensive experiments to evaluate TiDB’s per-

formance about OLTP, OLAP, and HTAP using CH-benCHmark,

an HTAP benchmark.

The remainder of this paper is organized as follows. We describe

the main idea, Raft-based HTAP, in Section 2, and illustrate the ar-

chitecture of TiDB in Section 3. TiDB’s multi-Raft storage and

HTAP engines are elaborated upon in Sections 4 and 5. Experi-

mental evaluation is presented in Section 6. We summarize related

work in Section 7. Finally, we conclude our paper in Section 8.

Client

Write/Read

Read

Leader Learner

Raft

module

State machine

Log

Raft

module

State machine

Log

Raft

module

State machine

Log

Raft

module

State machine

Log

FollowerFollower

Asynchronous replication

Quorum replication

Raft group

Figure 1: Adding columnar learners to a Raft group

2. RAFT-BASED HTAP

Consensus algorithms such as Raft and Paxos are the foundation

of building consistent, scalable, and highly-available distributed

systems. Their strength is that data is reliably replicated among

servers in real time using the replicated state machine. We adapt

this function to replicate data to different servers for different HTAP

workloads. In this way, we guarantee that OLTP and OLAP work-

loads are isolated from each other, but also that OLAP requests

have a fresh and consistent view of the data. To the best of our

knowledge, there is no previous work to use these consensus algo-

rithms to build an HTAP database.

Since the Raft algorithm is designed to be easy to understand

and implement, we focus on our Raft extension on implementing

a production-ready HTAP database. As illustrated in Figure 1, at

a high level, our ideas are as follows: Data is stored in multi-

ple Raft groups using row format to serve transactional queries.

Each group is composed of a leader and followers. We add a

learner role for each group to asynchronously replicate data from

the leader. This approach is low-overhead and maintains data con-

sistency. Data replicated to learners are transformed to column-

based format. Query optimizer is extended to explore physical

plans accessing both the row-based and column-based replicas.

In a standard Raft group, each follower can become the leader to

serve read and write requests. Simply adding more followers, there-

fore, will not isolate resources. Moreover, adding more followers

will impact the performance of the group because the leader must

wait for responses from a larger quorum of nodes before respond-

ing to clients. Therefore, we introduced a learner role to the Raft

consensus algorithm. A learner does not participate in leader elec-

tions, nor is it part of a quorum for log replication. Log replication

from the leader to a learner is asynchronous; the leader does not

need to wait for success before responding to the client. The strong

consistency between the leader and the learner is enforced during

the read time. By design, the log replication lag between the leader

and learners is low, as demonstrated in the evaluation section.

Transactional queries require efﬁcient data updates, while ana-

lytical queries such as join or aggregation require reading a sub-

set of columns, but a large number of rows for those columns.

Row-based format can leverage indexes to efﬁciently serve trans-

actional queries. Column-based format can leverage data compres-

sion and vectorized processing efﬁciently. Therefore, when repli-

cating to Raft learners, data is transformed from row-based format

to column-based format. Moreover, learners can be deployed in

separate physical resources. As a result, transaction queries and

analytical queries are processed in isolated resources.

Our design also provides new optimization opportunities. Be-

cause data is kept consistent between both the row-based format

and column-based format, our query optimizer can produce physi-

cal plans which access either or both stores.

We have presented our ideas of extending Raft to satisfy the

freshness and isolation requirements of an HTAP database. To

make an HTAP database production ready, we have overcome many

engineering challenges, mainly including:

3073

of 13

免费下载

tidb htap

关注

评论