暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
【IBM 2024SIGMOD】Native Cloud Object Storage in Db2 Warehouse.pdf
86
13页
2次
2025-04-15
免费下载
Native Cloud Object Storage in Db2 Warehouse:
Implementing a Fast and Cost-Eicient Cloud Storage Architecture
David Kalmuk
IBM Canada
Markham, ON, Canada
dckalmuk@ca.ibm.com
Kostas Rakopoulos
IBM Canada
Markham, ON, Canada
kostasr@ca.ibm.com
Robert C. Hooper
IBM Canada
Markham, ON, Canada
robert.christopher.hooper
@ibm.com
Patrick Perez
IBM Canada
Markham, ON, Canada
pperez@ca.ibm.com
Daniel C. Zilio
IBM Canada
Markham, ON, Canada
zilio@ca.ibm.com
Christian Garcia-Arellano
IBM Canada
Markham, ON, Canada
cmgarcia@ca.ibm.com
Hamdi Roumani
IBM Canada
Markham, ON, Canada
roumani@ca.ibm.com
Mahew Emmerton
IBM Canada
Markham, ON, Canada
memmerto@ca.ibm.com
Aleksandrs Santars
IBM Canada
Markham, ON, Canada
santars@ibm.com
Imran Sayyid
IBM Canada
Markham, ON, Canada
isayyid@ca.ibm.com
Krishna K. Ramachandran
IBM
Durham, NC, USA
krishnakumar@ibm.com
Ronald Barber
IBM Research
San Jose, CA, USA
rjbarber@us.ibm.com
William Minor
IBM Canada
Markham, ON, Canada
bminor@ca.ibm.com
Zach Hoggard
IBM Canada
Markham, ON, Canada
zacharyh@ca.ibm.com
Michael Chen
IBM Canada
Markham, ON, Canada
michael.chen@ca.ibm.com
Humphrey Li
IBM Canada
Markham, ON, Canada
xuhui.li@ibm.com
Yiren Shen
IBM Canada
Markham, ON, Canada
yiren@ibm.com
Richard Sidle
IBM Research
Oawa, ON, Canada
ricsidle@ca.ibm.com
Alexander Cheung
IBM Canada
Markham, ON, Canada
alex.cheung@ca.ibm.com
Sco Walkty
IBM Canada
Markham, ON, Canada
swalktky@ca.ibm.com
Mahew Olan
IBM Canada
Markham, ON, Canada
olan@ca.ibm.com
Ketan Rampurkar
IBM Canada
Markham, ON, Canada
ketan.rampurkar@ibm.com
ABSTRACT
Database systems built on traditional storage subsystems typically
store their data in small blocks referred to as data pages
(commonly sized in a multiple of 4KB for historical reasons).
These traditional storage subsystems, for example network
attached block storage, were designed for efficient random-access
I/O patterns at the block level, and the block size is usually
configurable by the application based on its needs. For large scale
analytic databases in cloud environments, these traditional
storage subsystems are not cost effective when compared to cloud
object storage, and database systems that exploit them risk
becoming uncompetitive. This paper describes the modernization
of the storage architecture of Db2 Warehouse, a traditional full
feature and high-performance database system with 3 decades of
development, to exploit the new paradigm of cost-effective
storage for the cloud. We discuss a solution based on the
integration of LSM trees as part of the storage subsystem, that
enables Db2 Warehouse to efficiently store data pages within
object storage, and through the application of special techniques
to minimize read and write latencies as well as all of the
amplification factors (write, read, and storage), achieve not only
storage cost savings, but also higher performance. Further, by
retaining the traditional data page format, we are able to avoid
significantly re-architecting the database kernel and thereby
retain the substantial capabilities and optimizations of the existing
system.
CCS CONCEPTS
Information systems → Database management system engines.
KEYWORDS
Cloud Data Warehouse, Data Lake, Cloud Object Storage, LSM
tree, OLAP, Analytics, RocksDB
ACM Reference format:
David Kalmuk, Christian Garcia-Arellano, Ronald Barber, Richard Sidle,
Kostas Rakopoulos, Hamdi Roumani, William Minor, Alexander Cheung,
Robert C. Hooper, Matthew Emmerton, Zach Hoggard, Scott Walkty,
This work is licensed under a Creative Commons
Attribution-NoDerivs International 4.0 License.
SIGMOD-Companion '24, June 915, 2024, Santiago, AA, Chile
© 2024 Copyright is held by the owner/author(s).
ACM ISBN 979-8-4007-0422-2/24/06.
https://doi.org/10.1145/3626246.3653393
188
SIGMOD-Companion '24, June 915, 2024, Santiago, AA, Chile
David Kalmuk et al.
Patrick Perez, Aleksandrs Santars, Michael Chen, Matthew Olan, Daniel
C. Zilio, Imran Sayyid, Humphrey Li, Ketan Rampurkar, Krishna K.
Ramachandran, Yiren Shen. 2024. Native Cloud Object Storage in Db2
Warehouse: Implementing a Fast and Cost-Ecient Storage Architecture.
In Companion of the 2024 International Conference on Management of Data
(SIGMOD-Companion '24), June 9--15, 2024, Santiago, AA, Chile, 12 pages.
hps://doi.org/10.1145/3626246.3653393
1 INTRODUCTION
The advent of hyperscale cloud infrastructure from public cloud
vendors such as AWS, Azure, and Google has fundamentally
altered the economics of analytic data systems, with the most
prominent impacts seen in the area of storage. In the domain of
SQL Data Warehousing, the availability of low cost effectively
unlimited capacity cloud object storage [1] [2] (COS) has given
rise to modern cloud architectures that exploit this independently
scalable storage in combination with compute servers with
ephemeral local NVMe caches to deliver the high performance and
advanced SQL capabilities of traditional data warehousing
systems with the low storage cost and scalability more typically
associated with data lakes. The price / performance advantages of
these cloud architectures have made it difficult for more
traditional data warehouse architectures built on higher
performance, higher cost storage that is tightly coupled to the
compute to compete in the public cloud sphere.
While this shift has seen the rise of new "born in the cloud"
data warehousing systems (most notably Snowflake [3]), for
vendors with established non-cloud data warehousing
technologies, significant advantages can be achieved if they can
capitalize on the depth of innovation in their existing systems by
modernizing them to a cloud native storage architecture. The
promise is the substantial savings in time and effort by avoiding
reinventing sophisticated, high performance, battle tested
capabilities. The challenge is how (and even if) such an uplift can
be performed in a manner that is both cost effective and doesn’t
compromise the capabilities and performance of the original
technology in the process.
In this paper we describe the effort we undertook to reinvent
IBM’s Db2 Warehouse [4] to adopt a modern storage architecture
and achieve cost and performance competitiveness on the public
cloud, while retaining all its substantial SQL and transactional
capabilities.
1.1 Challenges of Cloud Object Storage
When moving a database system storage that was designed for a
traditional storage subsystem, like network-attached block
storage [5] [6], to COS, a key challenge is the differences in I/O
characteristics. These differences are both in terms of throughput
and latency: COS is seen as a storage solution that is throughput
optimized whereas block storage is a storage solution that is
balanced for both throughput and latency [7]. In the case of
throughput, assuming adequate server side resources for the COS
implementation, the throughput limit for COS is imposed by the
network bandwidth available to the compute nodes performing
the I/O and the parallelism utilized to maximize the use of that
bandwidth, whereas for block storage this limit is imposed by the
bandwidth each device attached to the compute nodes can accept,
and throughput scalability is achieved by attaching more devices
to each compute node or increasing this bandwidth (and also on
the parallel access to each of those to maximize the throughput).
In the case of latency, COS throughout the industry is known to
offer a significantly higher fixed latency per request when
compared to block storage (~100-300ms vs ~10-30ms, a 10X
difference), resulting in the need to perform I/O operations in
significantly larger block sizes (order of 10s of MBs for individual
objects vs order of KBs for blocks in block storage access) to better
amortize the higher latency cost per operation, as we will show in
the Section 4 of this paper. Further complicating the picture is the
fact that COS writes operate at the object granularity, meaning
that modifying an existing object requires rewriting it in its
entirety.
Due to these I/O and performance differences, a direct
adaptation of the existing block storage optimized page storage to
target COS would result in very poor performance due to the
latency impact on small page I/O. A naive approach to improve
over this would be to group adjacent data pages into larger blocks
to accommodate the larger object sizes that are ideal for COS. In
Db2 for example, data pages within a table space are organized
into logical blocks of contiguous data pages called “extents”,
which could be configured as the object size to store in object
store. This would also align with existing data organization
optimizations such as Db2’s BLU Acceleration (Db2’s column
store engine) where pages from individual column groups are
grouped into extents to improve data locality and maximize read
performance for analytic workloads [8]. With such an approach,
each extent would be stored in COS as an independent object and
would retain the existing data clustering characteristics. This
approach, however, suffers from multiple issues. First, the larger
object sizes required for COS I/O efficiency would require a
significant increase in the size of these extents, making it difficult
to maintain data locality and space efficiency across varied insert-
update-delete (IUD) patterns (in Db2, this would result in the need
to enlarge extents from the current default size of 128KB to at least
32MB, that is, move from storing 4 32KB data pages per extent to
storing 1024 32KB data pages per extent, potentially incurring
substantial space wastage from partially filled column extents).
The space wastage incurred would manifest not only in added
write amplification, but also read amplification and a significant
loss of read cache efficiency as a result. Furthermore, any random
data page modification patterns would result in the need to
synchronously rewrite the entire MB sized objects, resulting in
high write amplification. For these reasons it was evident from the
early stages that in order to retain the performance characteristics
of Db2 Warehouse we would need a fundamentally different page
storage model one that could efficiently translate the pattern of
small random page I/Os inherent in the existing Db2 page model
into the larger sequential object writes required for efficient usage
of COS storage.
1.2 LSM tree based page storage model
One of the main challenges for the development of a new storage
layer for Db2 Warehouse was the need to preserve the existing
MPP engine capabilities, optimizations, and behaviors, which
189
of 13
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜