HTAP Databases- What is New and What is Next.pdf

章芋文

878

6页

34次

2022-08-03

免费下载

HTAP Databases: What is New and What is Next

Guoliang Li

Department of Computer Science, Tsinghua University

liguoliang@tsinghua.edu.cn

Chao Zhang

Department of Computer Science, Tsinghua University

cycchao@mail.tsinghua.edu.cn

ABSTRACT

Processing the mixed workloads of transactions and analytical

queries in a single database system can eliminate the ETL process

and enable real-time data analysis on the transaction data. How-

ever, there is no free lunch. Such systems must balance the trade-o

between workload isolation and data freshness due to interweav-

ing workloads of OLTP and OLAP. Since Gartner coined the term,

Hybrid Transactional/Analytical Processing (HTAP), we have wit-

nessed the emergence of various database systems to support HTAP.

One common feature is that they leverage the best of row store

and column store to achieve high quality of HTAP. As they have

disparate storage strategies and processing techniques to satisfy the

requirements of various HTAP applications, it is essential to under-

stand, compare, and evaluate their key techniques. In this tutorial,

we oer a comprehensive survey of HTAP databases. We introduce

a taxonomy of state-of-the-art HTAP databases according to their

storage strategies and architectures. We then take a deep dive into

their key techniques regarding transaction processing, analytical

processing, data synchronization, query optimization, and resource

scheduling. We also introduce existing HTAP benchmarks. Finally,

we discuss the research challenges and open problems for HTAP.

CCS CONCEPTS

• Information systems

→

Database transaction processing;

Database query processing.

KEYWORDS

HTAP Databases; Transaction Processing; Query Processing

ACM Reference Format:

Guoliang Li and Chao Zhang. 2022. HTAP Databases: What is New and

What is Next. In Proceedings of the 2022 International Conference on Manage-

ment of Data (SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM,

Philadelphia, PA, USA, 6 pages. https://doi.org/10.1145/3514221.3522565

1 INTRODUCTION

Background. All organizations are processing more data than ever

at their disposal, and data keeps coming with high velocity, vol-

ume and variety [

]. For businesses with data-intensive

applications, it is benecial to have a single HTAP system that

not only can eciently handle on-line transactional processing

(OLTP), but also can perform on-line analytical processing (OLAP)

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA.

ACM ISBN 978-1-4503-9249-5/22/06.. .$15.00

https://doi.org/10.1145/3514221.3522565

for prompt decision-making. For instance, when equipped with an

HTAP system, entrepreneurs in retail applications can analyze the

latest transaction data in real time and identify the sales trend, then

take timely actions, e.g., roll out advertising campaigns for promis-

ing products [

]. In nance applications, vendors can leverage an

HTAP system to process the customer transactions eciently while

detecting the fraudulent transactions simultaneously [16, 36, 47].

HTAP Denition. Hybrid Transactional/Analytical Processing

(HTAP) is an application architecture proposed by a Gartner report

[

] at 2014, which utilizes in-memory computing technologies

to enable concurrent analytical and transaction processing on the

same in-memory data store. Such an architecture should elimi-

nate the need of Extract-Transform-Load (ETL) process, thereby

accelerating data analytics and bringing dramatic business innova-

tion. In 2018, Gartner extended the HTAP concept to "In-Process

HTAP" [

], an application architecture that supports weaving an-

alytical and transaction processing techniques together as needed

to accomplish the business task. Such a new denition indicates

HTAP is no longer limited to in-memory computing techniques.

Motivation. Over the last few years, numerous database systems

[

–

] have been developed to enable HTAP. One

common feature is that they utilize the best of row store and col-

umn store to achieve high quality of HTAP. Nevertheless, they have

disparate storage strategies and processing techniques albeit the

dual-store feature. This main reason for such diversity is that dif-

ferent classes of HTAP systems target at dierent applications. For

instance, it depends on whether OLTP or OLAP is the rst citizen

of the applications, or both are important. It also depends on the re-

quirements of availability, scalability, system performance, and data

freshness [

] specied in the service level agreements (SLAs) [

Consequently, HTAP systems must balance the trade-o between

workload isolation and data freshness due to interweaving work-

loads of OLTP and OLAP. To better harness these HTAP forces for

various applications, it is of paramount importance to study, under-

stand, and compare their key techniques. In this tutorial, we study

HTAP databases that utilize row store and column store together

to eciently handle the mixed workloads of OLTP and OLAP in a

single database system.

Tutorial Overview. We will provide a comprehensive tutorial on

HTAP databases. The intended length of the tutorial is 3 hours. The

tutorial consists of four sections as follows.

(1) HTAP Databases (30 min). This section starts with an intro-

duction to the background of HTAP databases. It provides a classi-

cation according to their storage architectures, then introduces

the main approaches in each category. As shown in Figure 1, it clas-

sies HTAP databases into four categories: (a) Primary Row store

+ In-Memory Column store; (b) Distributed Row Store + Column

Store Replica; (c) Disk Row Store + Distributed Column Store; and

(d) Primary Column Store + Delta Row Store. Then, it presents the

main HTAP techniques and representatives for each architecture.

Tutorial

SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

2483

Node 3

Row Store

Disk

Master

Node 2

Node 1

Memory

Node 3

(a) Primary Row Store+In-Memory Column Store (b) Distributed Row Store + Column Store Replica (c) Disk Row Store + Distributed Column Store (d) Primary Column Store + Delta Row Store

Persistent Storage

Memory

Log

Merge

Column Store

Delta

ClientClient

Disk

Column Store

Memory

Node 1

Partition 1

Partition 2

Partition 3

Master

Partition 3

Partition 1

Partition 2

Node 2

Partition 2

Partition 3

Partition 1

Transform

Row Store

Delta

Column Store

Persistent Storage

Log

Merge

Transform

Figure 1: Storage Architectures of State-Of-The-Art HTAP Databases

Table 1: A Classication of State-Of-The-Art HTAP Databases based on the Storage Architecture

Category HTAP databases TP Throughput AP Throughput TP Scalability AP Scalability Isolation Freshness

Primary Row Store + In-

Memory Column Store

Oracle Dual-Format[19],

SQL Server[20], DB2 BLU[39]

High High Medium Low Low High

Distributed Row Store +

Column Store Replica

TiDB[18], SingleStore[44] Medium Medium High High High Low

Disk Row Store + Dis-

tributed Column Store

MySQL Heatwave[31] Medium Medium Medium High High Medium

Primary Column Store

+ Delta Row Store

SAP HANA[43] Medium High Low Medium Low High

Particularly, it summarizes the pros and cons of dierent HTAP

solutions regarding performance, scalability, workload isolation,

and data freshness (see Table 1).

(2) HTAP Techniques (40 min). This section takes a deep dive into

the key techniques of HTAP databases, paying particular attentions

to their techniques concerning transaction processing, analytical

processing, data synchronization, query optimization, and resource

scheduling. The detailed key techniques in each module are shown

in Table 2. Overall, it focuses on ve task types for HTAP as follows.

– Transaction processing (TP) techniques. This part will introduce

two types of TP techniques, including (i) MVCC + logging [

] that relies on multi-version concurrency control (MVCC)

protocols and logging techniques for transaction processing; and

(ii) 2PC+Raft+logging [

] that processes the transactions in a dis-

tributed architecture based on a two-phase commit (2PC) protocol,

a Raft-based consensus algorithm, and logging techniques.

– Analytical processing (AP) techniques. This part will introduce

three kinds of AP techniques. The rst type is (i) in-memory delta

and column scan [

] that responds to an analytical

query by performing a scan on the in-memory columnar data and

visible delta tuples yet being merged simultaneously. The second

type is (ii) disk-based delta and column scan [

] that scans the

log-based delta les and the column store together for an incoming

query. The third type is (iii) column scan [

] that performs the

query purely in the column store.

– Data synchronization (DS) techniques. This part will introduce

three types of DS techniques for synchronizing data between OLTP

and OLAP, including (i) in-memory delta merge [

]

that merges the newly-inserted in-memory delta data to the main

column store; and (ii) disk-based delta merge [

] that periodically

merges the disk-based delta les to the main column store; and (iii)

rebuild from primary row store [

] that rebuilds the in-memory

column store from the primary row store.

– Query optimization techniques. This part will introduce three as-

pects of query optimization techniques, including (i) column se-

lection for HTAP [

] that automatically selects the columns

from the primary store into main memory based on the history

workload; (ii) hybrid row/column scan [

] that relies on cost-

based functions to determines whether to perform a query over the

row store or over the column store; and (iii) CPU/GPU Accelera-

tion for HTAP [

] that leverages heterogeneous hardware, i.e.,

CPU/GPU architecture to accelerate HTAP workloads, respectively.

– Resource scheduling techniques. This part will introduce the re-

source scheduling techniques that aim to improve the resource

utilization by dynamically allocating resources, e.g., CPU and mem-

ory, for HTAP. It mainly introduces two types of techniques. The

rst one is the workload-driven scheduling [

] that adaptively

adjusts the resources of OLTP and OLAP workloads based on the

execution status of workload. The second one is the freshness-

driven scheduling [

] that controls the execution modes of HTAP

workloads based on the freshness metric.

(3) HTAP Benchmarks (10 mins). This section introduces the

existing benchmarks and evaluation practices on HTAP databases.

It will introduce several end-to-end HTAP benchmarks including

TPC-C [

], TPC-H [

], HTAPbench [

], and CH-benchmark [

Specically, it will walk through the key aspects of the benchmarks,

including data generation, execution rule, and performance metrics.

In addition, it will introduce two HTAP micro-benchmarks: ADAPT

[

] and HAP [

] benchmarks. After that, it summarizes the key

insights from existing evaluation practices [13, 38, 40, 42, 45].

(4) Challenges and Open Problems (10 mins). The nal section

concludes the tutorial and discusses the research challenges and

open problems for HTAP techniques. It summarizes the tutorial

topics, then presents several challenges and open problems. Firstly,

it presents the limitations of existing methods on column selection

for HTAP workloads, then discusses the possibility of learning-

based methods on this task. Secondly, it discusses the challenges for

HTAP query optimization and calls for a learned query optimizer

for HTAP. Thirdly, it discusses the limitation of current approaches

on HTAP resource scheduling, then calls for new adaptive meth-

ods. Finally, it discusses the limitation of existing benchmarks and

envisions a new HTAP benchmark suite.

Tutorial

SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA

2484