暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
TiDB: A Raft-based HTAP Database
969
13页
15次
2020-10-12
免费下载
TiDB: A Raft-based HTAP Database
Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang
, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang,
Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu,
Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, Xin Tang
PingCAP
{huang, liuqi, cuiqiu, fangzhuhe, maxiaoyu, xufei, shenli, tl, z, menglong,
weiwan, liucong, zhangjian, jay, wuxuelian, songlingyu, sunruoxi, yusp,
zhaolei, nick, liquanpei, tangxin}@pingcap.com
ABSTRACT
Hybrid Transactional and Analytical Processing (HTAP) databases
require processing transactional and analytical queries in isolation
to remove the interference between them. To achieve this, it is nec-
essary to maintain different replicas of data specified for the two
types of queries. However, it is challenging to provide a consistent
view for distributed replicas within a storage system, where ana-
lytical requests can efficiently read consistent and fresh data from
transactional workloads at scale and with high availability.
To meet this challenge, we propose extending replicated state
machine-based consensus algorithms to provide consistent replicas
for HTAP workloads. Based on this novel idea, we present a Raft-
based HTAP database: TiDB. In the database, we design a multi-
Raft storage system which consists of a row store and a column
store. The row store is built based on the Raft algorithm. It is scal-
able to materialize updates from transactional requests with high
availability. In particular, it asynchronously replicates Raft logs to
learners which transform row format to column format for tuples,
forming a real-time updatable column store. This column store al-
lows analytical queries to efficiently read fresh and consistent data
with strong isolation from transactions on the row store. Based on
this storage system, we build an SQL engine to process large-scale
distributed transactions and expensive analytical queries. The SQL
engine optimally accesses row-format and column-format replicas
of data. We also include a powerful analysis engine, TiSpark, to
help TiDB connect to the Hadoop ecosystem. Comprehensive ex-
periments show that TiDB achieves isolated high performance un-
der CH-benCHmark, a benchmark focusing on HTAP workloads.
PVLDB Reference Format:
Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li
Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian
Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu,
Lei Zhao, Nicholas Cameron, Liquan Pei, Xin Tang. TiDB: A Raft-based
HTAP Database. PVLDB, 13(12): 3072-3084, 2020.
DOI: https://doi.org/10.14778/3415478.3415535
1. INTRODUCTION
Relational database management systems (RDBMS) are popu-
lar with their relational model, strong transactional guarantees, and
Zhuhe Fang is the corresponding author.
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 13, No. 12
ISSN 2150-8097.
DOI: https://doi.org/10.14778/3415478.3415535
SQL interface. They are widely adopted in traditional applica-
tions, like business systems. However, old RDBMSs do not pro-
vide scalability and high availability. Therefore, at the beginning
of the 2000s [11], internet applications preferred NoSQL systems
like Google Bigtable [12] and DynamoDB [36]. NoSQL systems
loosen the consistency requirements and provide high scalability
and alternative data models, like key-value pairs, graphs, and doc-
uments. However, many applications also need strong transac-
tions, data consistency, and an SQL interface, so NewSQL systems
appeared. NewSQL systems like CockroachDB [38] and Google
Spanner [14] provide the high scalability of NoSQL for Online
Transactional Processing (OLTP) read/write workloads and still ma-
intain ACID guarantees for transactions [32]. In addition, SQL-
based Online Analytical Processing (OLAP) systems are being de-
veloped quickly, like many SQL-on-Hadoop systems [16].
These systems follow the “one size does not fit all” paradigm
[37], using different data models and technologies for the different
purposes of OLAP and OLTP. However, multiple systems are very
expensive to develop, deploy, and maintain. In addition, analyzing
the latest version of data in real time is compelling. This has given
rise to hybrid OLTP and OLAP (HTAP) systems in industry and
academia [30]. HTAP systems should implement scalability, high
availability, and transnational consistency like NewSQL systems.
Besides, HTAP systems need to efficiently read the latest data to
guarantee the throughput and latency for OLTP and OLAP requests
under two additional requirements: freshness and isolation.
Freshness means how recent data is processed by the analytical
queries [34]. Analyzing the latest data in real time has great busi-
ness value. But it is not guaranteed in some HTAP solutions, such
as those based on an Extraction-Transformation-Loading (ETL) pro-
cessing. Through the ETL process, OLTP systems periodically re-
fresh a batch of the latest data to OLAP systems. The ETL costs
several hours or days, so it cannot offer real-time analysis. The ETL
phase can be replaced by streaming the latest updates to OLAP sys-
tems to reduce synchronization time. However, because these two
approaches lack a global data governance model, it is more com-
plex to consider consistency semantics. Interfacing with multiple
systems introduces additional overhead.
Isolation refers to guaranteeing isolated performance for seper-
ate OLTP and OLAP queries. Some in-memory databases (such as
HyPer [18]) enable analytical queries to read the latest version of
data from transactional processing on the same server. Although
this approach provides fresh data, it cannot achieve high perfor-
mance for both OLTP and OLAP. This is due to data synchroniza-
tion penalties and workload interference. This effect is studied in
[34] by running CH-benCHmark [13], an HTAP benchmark on Hy-
Per and SAP HANA. The study found that when a system co-runs
analytical queries, its maximum attainable OLTP throughput is sig-
3072
nificantly reduced. SAP HANA [22] throughput was reduced by
at least three times, and HyPer by at least five times. Similar re-
sults are confirmed in MemSQL [24]. Furthermore, in-memory
databases cannot provide high availability and scalability if they
are only deployed on a single server.
To guarantee isolated performance, it is necessary to run OLTP
and OLAP requests on different hardware resources. The essen-
tial difficulty is to maintain up-to-date replicas for OLAP requests
from OLTP workloads within a single system. Besides, the system
needs to maintain data consistency among more replicates. Note
that maintaining consistent replicas is also required for availabil-
ity [29]. High availability can be achieved using well-known con-
sensus algorithms, such as Paxos [20] and Raft [29]. They are
based on replicated state machines to synchronize replicas. It is
possible to extend these consensus algorithms to provide consis-
tent replicas for HTAP workloads. To the best of our knowledge,
this idea has not been studied before.
Following this idea, we propose a Raft-based HTAP database:
TiDB. It introduces dedicated nodes (called learners) to the Raft
consensus algorithm. The learners asynchronously replicate trans-
actional logs from leader nodes to construct new replicas for OLAP
queries. In particular, the learners transform the row-format tuples
in the logs into column format so that the replicas are better-suited
to analytical queries. Such log replication incurs little overhead on
transactional queries running on leader nodes. Moreover, the la-
tency of such replication is so short that it can guarantee data fresh-
ness for OLAP. We use different data replicas to separately process
OLAP and OLTP requests to avoid interference between them. We
can also optimize HTAP requests based on both row-format and
column-format data replicas. Based on the Raft protocol, TiDB
provides high availability, scalability, and data consistency.
TiDB presents an innovative solution that helps consensus alg-
orithms-based NewSQL systems evolve into HTAP systems. New-
SQL systems ensure high availability, scalability, and data dura-
bility for OLTP requests by replicating their database like Google
Spanner and CockroachDB. They synchronize data across data repli-
cas via replication mechanisms typically from consensus algorithms.
Based on the log replication, NewSQL systems can provide a colum-
nar replica dedicated to OLAP requests so that they can support
HTAP requests in isolation like TiDB.
We conclude our contributions as follows.
We propose building an HTAP system based on consensus al-
gorithms and have implemented a Raft-based HTAP database,
TiDB. It is an open-source project [7] that provides high avail-
ability, consistency, scalability, data freshness, and isolation for
HTAP workloads.
We introduce the learner role to the Raft algorithm to generate a
columnar store for real-time OLAP queries.
We implement a multi-Raft storage system and optimize its reads
and writes so that the system offers high performance when scal-
ing to more nodes.
We tailor an SQL engine for large-scale HTAP queries. The en-
gine can optimally choose to use a row-based store and a colum-
nar store.
We conduct comprehensive experiments to evaluate TiDB’s per-
formance about OLTP, OLAP, and HTAP using CH-benCHmark,
an HTAP benchmark.
The remainder of this paper is organized as follows. We describe
the main idea, Raft-based HTAP, in Section 2, and illustrate the ar-
chitecture of TiDB in Section 3. TiDB’s multi-Raft storage and
HTAP engines are elaborated upon in Sections 4 and 5. Experi-
mental evaluation is presented in Section 6. We summarize related
work in Section 7. Finally, we conclude our paper in Section 8.
Client
Write/Read
Read
Leader Learner
Raft
module
State machine
Log
Raft
module
State machine
Log
Raft
module
State machine
Log
Raft
module
State machine
Log
FollowerFollower
Asynchronous replication
Quorum replication
Raft group
Figure 1: Adding columnar learners to a Raft group
2. RAFT-BASED HTAP
Consensus algorithms such as Raft and Paxos are the foundation
of building consistent, scalable, and highly-available distributed
systems. Their strength is that data is reliably replicated among
servers in real time using the replicated state machine. We adapt
this function to replicate data to different servers for different HTAP
workloads. In this way, we guarantee that OLTP and OLAP work-
loads are isolated from each other, but also that OLAP requests
have a fresh and consistent view of the data. To the best of our
knowledge, there is no previous work to use these consensus algo-
rithms to build an HTAP database.
Since the Raft algorithm is designed to be easy to understand
and implement, we focus on our Raft extension on implementing
a production-ready HTAP database. As illustrated in Figure 1, at
a high level, our ideas are as follows: Data is stored in multi-
ple Raft groups using row format to serve transactional queries.
Each group is composed of a leader and followers. We add a
learner role for each group to asynchronously replicate data from
the leader. This approach is low-overhead and maintains data con-
sistency. Data replicated to learners are transformed to column-
based format. Query optimizer is extended to explore physical
plans accessing both the row-based and column-based replicas.
In a standard Raft group, each follower can become the leader to
serve read and write requests. Simply adding more followers, there-
fore, will not isolate resources. Moreover, adding more followers
will impact the performance of the group because the leader must
wait for responses from a larger quorum of nodes before respond-
ing to clients. Therefore, we introduced a learner role to the Raft
consensus algorithm. A learner does not participate in leader elec-
tions, nor is it part of a quorum for log replication. Log replication
from the leader to a learner is asynchronous; the leader does not
need to wait for success before responding to the client. The strong
consistency between the leader and the learner is enforced during
the read time. By design, the log replication lag between the leader
and learners is low, as demonstrated in the evaluation section.
Transactional queries require efficient data updates, while ana-
lytical queries such as join or aggregation require reading a sub-
set of columns, but a large number of rows for those columns.
Row-based format can leverage indexes to efficiently serve trans-
actional queries. Column-based format can leverage data compres-
sion and vectorized processing efficiently. Therefore, when repli-
cating to Raft learners, data is transformed from row-based format
to column-based format. Moreover, learners can be deployed in
separate physical resources. As a result, transaction queries and
analytical queries are processed in isolated resources.
Our design also provides new optimization opportunities. Be-
cause data is kept consistent between both the row-based format
and column-based format, our query optimizer can produce physi-
cal plans which access either or both stores.
We have presented our ideas of extending Raft to satisfy the
freshness and isolation requirements of an HTAP database. To
make an HTAP database production ready, we have overcome many
engineering challenges, mainly including:
3073
of 13
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜