暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
OceanBase- A 707 Million tpmC Distributed Relational Database System.pdf
443
13页
21次
2022-11-25
免费下载
OceanBase: A 707 Million tpmC Distributed Relational Database
System
Zhenkun Yang, Chuanhui Yang, Fusheng Han, Mingqiang Zhuang, Bing Yang, Zhifeng Yang,
Xiaojun Cheng, Yuzhong Zhao, Wenhui Shi, Huafeng Xi, Huang Yu, Bin Liu, Yi Pan, Boxue Yin,
Junquan Chen, Quanqing Xu
OceanBase
OceanBaseLabs@list.alibaba-inc.com
ABSTRACT
We have designed and developed OceanBase, a distributed re-
lational database system from the very basics for a decade. Being
a scale-out multi-tenant system, OceanBase is cross-region fault
tolerant, which is based on the shared-nothing architecture. Besides
sharing many similar goals with alternative distributed DBMS, such
as horizontal scalability, fault-tolerance, etc., our design has been
driven by the demands of typical RDBMS compatibility as well
as both on-premise and o-premise deployments. OceanBase has
fullled its design goal. It implements the salient features of cer-
tain mainstream classical RDBMS, and most applications on them
can run on OceanBase, with or without a few minor modications.
Tens of thousands of OceanBase servers have been deployed in
Alipay.com as well as many other commercial organizations. It has
also successfully passed the TPC-C benchmark test and seized the
rst place with more than 707 million tpmC. This paper presents
the goals, design criteria, infrastructure, and key components of
OceanBase including its engines for storage and transaction process-
ing. Further, it details how OceanBase achieves the above leading
TPC-C benchmark in a distributed cluster with more than 1,500
servers from 3 zones. It also describes lessons what we have learnt
in building OceanBase for more than a decade.
PVLDB Reference Format:
Zhenkun Yang, Chuanhui Yang, Fusheng Han, Mingqiang Zhuang, Bing
Yang, Zhifeng Yang, Xiaojun Cheng, Yuzhong Zhao, Wenhui Shi, Huafeng
Xi, Huang Yu, Bin Liu, Yi Pan, Boxue Yin, Junquan Chen, Quanqing Xu.
OceanBase: A 707 Million tpmC Distributed Relational Database System.
PVLDB, 15(12): 3385 - 3397, 2022.
doi:10.14778/3554821.3554830
PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at
https://github.com/oceanbase/obdeploy.
1 INTRODUCTION
Strong transaction guarantee, relational model, and excellently
expressible Structured Query Language (SQL) make Relational Data-
base Management System (RDBMS) the crucial information infras-
tructure of the majority of business systems. For the last three
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 15, No. 12 ISSN 2150-8097.
doi:10.14778/3554821.3554830
decades, the development of Internet platforms has facilitated the
ourishing global businesses, e.g., the likes of Alipay.com, Ama-
zon.com, and Taobao.com, serve the general populace instead of a
single organization. Classical centralized RDBMS are not capable
of meeting the requirements of the scalability, cross-region fault
tolerance, and cost-eectiveness of these businesses.
We launched the design and development of OceanBase [
6
,
7
], a
commodity hardware-based distributed relational database system
from the very basics, in May 2010. OceanBase has been rst used as
the Favorite of Taobao.com [
3
] in 2011, a service similar to the Wish
List of Amazon.com [
11
]. Thereafter, it was used by Alipay.com, in
2014, and by Zhejiang E-Commerce Bank in 2015, and many other
commercial banks, insurance companies, and other organizations
for communication and energy applications.
This paper rst presents the detailed design goals and criteria,
system architecture, SQL engine and multi-tenancy of OceanBase in
§2. Second, it presents an LSM-tree-based [
35
] storage engine, and
discusses the asymmetric read and write design, daily incremental
major compaction, and replica type in §3. Third, in §4, it proposes
the transaction processing engine including the timestamp ser-
vice, transaction processing, isolation level, and replicated table
in OceanBase. Fourth, in §5, we performed the TPC-C benchmark
test of OceanBase in 2020. §6 presents lessons learnt in building
OceanBase. §7 provides a brief review of the related work. Finally,
we conclude our work in §8. We briey list our contributions in the
following items.
We have built OceanBase, a distributed relational database
from the very basics, since 2010. As a scale-out multi-tenant
system, OceanBase is cross-region fault tolerant, and it sup-
ports the shared-nothing architecture. In case of the failure
of a minority of the nodes, its RPO (Recovery Point Objec-
tive) turns zero, and its RTO (Recovery Time Objective) is
less than 30 seconds.
We present an LSM-tree-based storage engine, which achieves
the performance close to that of the in-memory database
after multiple optimizations. An asymmetric read and write
data block storage system as well as a daily incremental
major compaction have been designed and implemented.
We propose a Paxos-based 2PC named OceanBase 2PC to
improve the distributed transaction processing capability
and reduce the transaction latency, which introduces the
Paxos protocol to 2PC, thus making the distributed transac-
tions have an automatic fault tolerance. Compared with the
traditional 2PC, the state of the coordinator does not persist
in OceanBase 2PC, thereby reducing the number of Paxos
3385
synchronizations from three to two and further truncating
the transaction latency to only one Paxos synchronization.
We have performed the TPC-C benchmark test of OceanBase
to reach 707 million tpmC in 2020, which is the best global
record, hitherto.
OceanBase is an open source project under Mulan Public Li-
cense 2.0 [
5
] and the source code is available on both gitee [
6
] and
GitHub [7].
2 DESIGN OVERVIEW
We work on the design of OceanBase that supports the fast
scale-out (scale-in) on the commodity hardware, to achieve high
performance and low total cost of ownership (TCO), cross-region
deployment, and fault tolerance. It is compatible with certain main-
stream classical RDBMS. In this section, we introduce our design
goals and criteria, and discuss the system infrastructure and de-
ployment of OceanBase.
2.1 Goals
Goals of OceanBase include the following.
1) Fast scale-out (scale-in) on commodity hardware to ac-
hieve high performance and low TCO.
Similar to the high-end server and SAN storage, the classical
centralized RDBMS are highly expensive and dicult to
expand. Sharding and resharding can be extremely taxing on
human resource and time [
19
]. OceanBase should be much
more cost-eective than classical RDBMS and, e.g., be able
to scale-out and scale-in quickly, before and after a business
promotion event, respectively.
2) Cross-region deployment and fault tolerance.
High availability of the classical RDBMS system is solely
based on the high availability of the hardware, e.g., the high-
end server and SAN storage. Database master and backup
mirroring cannot guarantee both the service availability and
data integrity following the failure of the master database.
OceanBase should be cross-region fault tolerant, and hence
it guarantees the data integrity even in case of the failure of
one region.
3) Compatible with some mainstream classical RDBMS.
Hundreds of thousands of legacy applications are running
on various classical RDBMS. The cost, time, and risk of the
migration of these legacy applications from the classical
RDBMS to OceanBase should be minimized. This invokes
the compatibility of OceanBase with these classical RDBMS.
It is discussed in details in §2.2.
2.2 Criteria of Design
In the past several decades, classical RDBMS have been adopted
by multiple independent software development vendors, solution
providers, and used by numerous organizations. Various complex
SQL statements and stored procedures, consisting of a few to tens of
thousands of SQL statements, have been running on these RDBMS
to support various types of businesses. Each new relational database
has to encounter the following challenges:
The cost, time, and risk of migrating the businesses from the
old database to the new database.
The cost and time of learning of the new database and migrat-
ing their solution from the old database to the new database
by independent software vendors and solution providers,
etc.
The cost and time of learning the new database by the third-
party database service providers or by users themselves.
Being a general-purpose relational database system, the design
and implementation of OceanBase comply with the following crite-
rion:
Criterion 2.1.
Native compatibility with certain mainstream classi-
cal RDBMS, and taking into account, the needs of the large, medium,
and small organizations.
Native compatibility with certain mainstream classical RDBMS
that implements the salient features of these classical RDBMS
while being compatible with all of them, including the data
types, secondary index, view, trigger, cursor, constrains, func-
tions, and stored procedure. Applications on these RDBMS
should be able to run on OceanBase with or without only a
few minor modications.
Suitability for large, medium and small organizations such
that a large online shopping and payment organization may
need tens of thousands of high-prole database servers,
whereas a small organization may need only a few low-
prole database servers. Hence, one OceanBase cluster may
consist of tens of thousands of high-prole servers to meet
the requirements of a large organization, or it may also con-
sist of a few low-prole servers to meet the cost and perfor-
mance requirements of a small organization.
2.3 Infrastructure
OceanBase supports the shared-nothing architecture, and its
overall architecture is shown in Figure 1. Multiple servers in a
distributed cluster of OceanBase concurrently provide database
services with high availability. In Figure 1, the application layer
sends a request to the proxy layer (i.e., OBProxy), and after the
routing of the proxy service, it is sent to a database node (OBServer)
of the actual service data, and the execution result follows the
reverse path to the application layer. Dierent components in the
whole process achieve high availability in dierent ways.
Each OceanBase cluster consists of several zones, viz., 1, 3, or
5 zones. These zones can be restricted to one region or spread
over multiple regions. In each zone, OceanBase can be deployed
as shared-nothing. Transactions are replicated among the zones
using Paxos [
27
]. OceanBase supports the cross-region disaster
tolerance for multiple regions and zero data loss (RPO
=
0, RTO
<=
30
seconds) [10].
Database tables, especially the large ones, are partitioned explic-
itly by the user and these partitions are the basic units for the data
distribution and load balance. For the convenience of discussion,
non-partitioned table is considered as a partitioned table with one
partition. For every partition, there is a replica in each zone, and
these replicas form a Paxos group.
In each node, OceanBase is similar to a classical RDBMS. Subse-
quent to an OceanBase node receiving a SQL statement or a group
of SQL statements (e.g., a stored procedure), it compiles the SQL
statement(s) to produce a SQL execution plan. If it is a local plan, it
3386
of 13
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜