Hybrid Transactional:Analytical Processing- A Survey.pdf

章芋文

207

5页

0次

2022-08-04

免费下载

Hybrid Transactional/Analytical Processing: A Survey

Fatma Özcan

IBM Resarch - Almaden

fozcan@us.ibm.com

Yuanyuan Tian

IBM Research - Almaden

ytian@us.ibm.com

Pınar Tözün

IBM Research - Almaden

ptozun@us.ibm.com

ABSTRACT

The popularity of large-scale real-time analytics applications

(real-time inventory/pricing, recommendations from mobile

apps, fraud detection, risk analysis, IoT, etc.) keeps ris-

ing. These applications require distributed data manage-

ment systems that can handle fast concurrent transactions

(OLTP) and analytics on the recent data. Some of them

even need running analytical queries (OLAP) as part of

transactions. Eﬃcient processing of individual transactional

and analytical requests, however, leads to diﬀerent optimiza-

tions and architectural decisions while building a data man-

agement system.

For the kind of data processing that requires both ana-

lytics and transactions, Gartner recently coined the term

Hybrid Transactional/Analytical Processing (HTAP). Many

HTAP solutions are emerging both from the industry as well

as academia that target these new applications. While some

of these are single system solutions, others are a looser cou-

pling of OLTP databases or NoSQL systems with analytical

big data platforms, like Spark. The goal of this tutorial is

to 1-) quickly review the historical progression of OLTP and

OLAP systems, 2-) discuss the driving factors for HTAP,

and ﬁnally 3-) provide a deep technical analysis of existing

and emerging HTAP solutions, detailing their key architec-

tural diﬀerences and trade-oﬀs.

1. INTRODUCTION

In this tutorial, we plan to survey existing and emerging

HTAP (Hybrid Transactions and Analytics Processing) so-

lutions. HTAP is a term created by Gartner to describe

systems that can support both OLTP (On-line transaction

processing) as well as OLAP (on-line analytics processing)

within a single transaction. However, the term HTAP is

currently used more broadly, even for solutions that sup-

port insertions (not necessarily ACID transactions) as well

as OLAP queries. Some of these systems have the ability to

run analytical queries over the very recent data, while others

need some delay before the queries see the latest data.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

SIGMOD ’17, May 14–19, 2017, Chicago, IL, USA.

 2017 ACM. ISBN 978-1-4503-4197-4/17/05. . . $15.00

DOI: http://dx.doi.org/sigt013

To understand HTAP, we ﬁrst need to look into OLTP

and OLAP systems and how they progressed over the years.

Relational databases have been used for both transaction

processing as well as analytics. However, OLTP and OLAP

systems have very diﬀerent characteristics. OLTP systems

are identiﬁed by their individual record insert/delete/up-

date statements, as well as point queries that beneﬁt from

indexes. One cannot think about OLTP systems without

indexing support. OLAP systems, on the other hand, are

updated in batches and usually require scans of the tables.

Batch insertion into OLAP systems are an artifact of ETL

(extract transform load) systems that consolidate and trans-

form transactional data from OLTP systems into an OLAP

environment for analysis.

After the seminal paper of Stonebraker [34] arguing for

multiple specialized systems, the database ﬁeld has seen an

inﬂux of specialized column-oriented OLAP systems, such as

BLU[30], Vertica[23], ParAccel, GreenPlumDB, Vectorvise,

as well as many in-memory OLTP systems, including VoltDB[35],

Hekaton [13], MemSQL [24] among others. The main driver

for this re-surgency in database engines is the advances in

modern hardware. This second generation of OLAP and

OLTP systems take better advantage of multi-core, various

levels of memory caches, and large memories.

At the same time, the last decade seen an explosion of

many big data technologies, driven by new generation appli-

cations. NoSQL or key-value stores, such as Voldemort[32],

Cassandra[8], RocksDB [31], oﬀer fast inserts and lookups,

and very high scale out, but lack in their query capabilities,

and oﬀer only loose transactional guarantees (see Mohan’s

tutorial[25]). There have been also many SQL-on-Hadoop

[10] oﬀerings, including Hive [36], Big SQL [15] , Impala[20],

and Spark SQL[3], that provide analytics capabilities over

large data sets, focusing on OLAP queries only, and lacking

transaction support. Although all these systems support

queries over text and CSV ﬁles, their focus have been on

columnar storage formats, like ORCFile[27], and Parquet

[1].

Recent years have seen the need for more real-time ana-

lytics. In addition, mobile and Internet of Things have given

rise to a new generation of applications that are character-

ized by heavy ingest rates, i.e. they produce large amounts

of data in a short time, as well as their need for more real-

time analysis. Enterprises are pushing for more real-time

analysis of their data to drive competitive advantage, and

as such they need the ability to run analytics on their oper-

ational data as soon as possible.

With these developments, there is now a lot of interest,

and research focus on providing HTAP solutions over big

data sets. In this tutorial, we plan to provide a quick histor-

ical perspective into the progression of diﬀerent technologies

and discuss the current set of HTAP solutions. We will ex-

amine the diﬀerent architectural aspects of the current so-

lutions, identifying their strengths and weaknesses. We will

categorize existing systems along many technological dimen-

sions, and provide deep dives into a few representative sys-

tems. Finally, we will discuss existing research challenges to

realize true HTAP, where a single transaction can contain

both insert/update/delete statements, as well as complex

OLAP queries.

2. HTAP SOLUTIONS: DESIGN OPTIONS

HTAP solutions today follow a variety of design practices.

This part of the tutorial goes over them to highlight their

main trade-oﬀs while giving examples from industrial oﬀer-

ings and academic solutions.

One of the major design decisions HTAP systems have

to make is whether or not to use the same engine for both

OLTP and OLAP requests.

2.1 Single System for OLTP and OLAP

The traditional relational databases (e.g., DB2, Oracle,

Microsoft SQL Server) have the ability to support OLTP and

OLAP in one engine using single type of data organization

(mainly row-stores). However, they are not very eﬃcient for

either of these workloads.

Therefore, following the one size doesn’t ﬁt all rule [34],

the past decade has seen the rise of specialized engines for

OLTP and OLAP exploiting the advances in modern hard-

ware (larger main-memory, multicores, etc.). Various ven-

dors and academic groups have built in-memory optimized

row-stores (e.g., VoltDB [35], Hekaton [13], MemSQL [24],

Silo [37], ...) and column-stores (e.g., MonetDB [7], Ver-

tica [23] , BLU [30], SAP HANA [14], ...) specialized for

transactional and analytical processing, respectively. These

systems departed from traditional code-bases of relational

databases, and built leaner engines from scratch to also avoid

the large instruction footprint of the traditional engines.

However, many of the systems optimized for one type of

processing, later started adding support for the other type in

order to support HTAP. These systems mainly diﬀer based

on the data organization they use for their transactional and

analytical requests.

2.1.1 Using Separate Data Organization for OLTP

and OLAP

SAP HANA [14] or Oracle’s TimesTen [22] have engines

that are mainly optimized for in-memory columnar process-

ing, which is more beneﬁcial for OLAP workloads. These

systems also support ACID transactions. However, they use

a diﬀerent data organization for data ingestion (row-wise)

and analytics (columnar).

Conversely, MemSQL [24] has an engine that was pri-

marily designed for scalable in-memory OLTP, but today it

supports fast analytical queries as well. It ingests the data

in row format as well as keeping the in-memory portion of

the data in row format. When data is written to disk, it is

converted to columnar format for faster analytics. Similarly,

IBM dashDB [11] is the evolution of a traditional row store

into an HTAP system with hybrid row-wise and columnar

data organizations for OLTP and OLAP workloads, respec-

tively.

On the other hand, from the beginning, HyPer [19] was

designed to support both fast transactions and analytics us-

ing one engine. Even though, initially it used row-wise pro-

cessing of data for both OLTP and OLAP, today it also

provides the option for choosing a columnar format to be

able to run the analytical requests more eﬃciently.

Finally, the recent academic project Pelaton [28] aims to

build an autonomous in-memory HTAP system. It provides

adaptive data organization [4], which changes the data for-

mat at run-time based on the type of requests.

All these systems require converting the data between row

and columnar formats for transactions and analytics. Due

to these conversions, the latest committed data might not

be available to the analytical queries right away for these

types of systems.

2.1.2 Same Data Organization for both OLTP and

OLAP

TAP [2] is an academic project that aims to build an

HTAP system focusing mainly on the hardware utilization

of a single node when running on heterogeneous hardware.

It falls under this category since the system is designed as a

row-store.

Among the SQL-on-Hadoop systems, there has also been

HTAP solutions that extend existing OLAP systems with

the ability to update data. Hive, since version 0.13, has in-

troduced the transaction support (insert, update, and delete)

at the row level [18] for ORCFile, their columnar data for-

mat. However, the primary use cases are for updating di-

mension tables and streaming data ingest. The integration

of Impala [20] with the storage manager Kudu [21], also

allows the SQL-on-Hadoop engine to handle updates and

deletes. The same Kudu storage is also used for running

analytical queries.

Since these systems do not require conversion from one

data organization to another in order to perform transac-

tional and analytical requests, the OLAP queries can read

the latest committed data. However, they might face the

same shortcomings the traditional relational engines faced.

They do not have a data organization that is optimal for

both types of processing. Therefore, they may rely on batch-

ing of requests for fast transactions due to the overheads of

processing data over a non-row-wise format, or perform sub-

optimally for analytics due to non-columnar format.

2.2 Separate OLTP and OLAP Systems

The systems under this category can be further distin-

guished in the way they handle the underlying storage, i.e.,

whether they use the same storage for OLTP and OLAP.

2.2.1 Decoupling the Storage for OLTP and OLAP

Many applications loosely couple an OLTP and an OLAP

systems together for HTAP. It is up to the applications to

maintain the hybrid architecture. The operational data in

the OLTP system are aged to the OLAP system using stan-

dard ETL process. In fact, this is very common in the big

data world, where applications use a fast key-value store like

Cassandra for transactional workloads, and the operational

data are groomed into Parquet or ORC ﬁles on HDFS for

a SQL-on-Hadoop system for queries. As a result, there is

Hybrid Transactional/Analytical Processing: A Survey

Fatma Özcan

IBM Resarch - Almaden

fozcan@us.ibm.com

Yuanyuan Tian

IBM Research - Almaden

ytian@us.ibm.com

Pınar Tözün

IBM Research - Almaden

ptozun@us.ibm.com

ABSTRACT

The popularity of large-scale real-time analytics applications

(real-time inventory/pricing, recommendations from mobile

apps, fraud detection, risk analysis, IoT, etc.) keeps ris-

ing. These applications require distributed data manage-

ment systems that can handle fast concurrent transactions

(OLTP) and analytics on the recent data. Some of them

even need running analytical queries (OLAP) as part of

transactions. Eﬃcient processing of individual transactional

and analytical requests, however, leads to diﬀerent optimiza-

tions and architectural decisions while building a data man-

agement system.

For the kind of data processing that requires both ana-

lytics and transactions, Gartner recently coined the term

Hybrid Transactional/Analytical Processing (HTAP). Many

HTAP solutions are emerging both from the industry as well

as academia that target these new applications. While some

of these are single system solutions, others are a looser cou-

pling of OLTP databases or NoSQL systems with analytical

big data platforms, like Spark. The goal of this tutorial is

to 1-) quickly review the historical progression of OLTP and

OLAP systems, 2-) discuss the driving factors for HTAP,

and ﬁnally 3-) provide a deep technical analysis of existing

and emerging HTAP solutions, detailing their key architec-

tural diﬀerences and trade-oﬀs.

1. INTRODUCTION

In this tutorial, we plan to survey existing and emerging

HTAP (Hybrid Transactions and Analytics Processing) so-

lutions. HTAP is a term created by Gartner to describe

systems that can support both OLTP (On-line transaction

processing) as well as OLAP (on-line analytics processing)

within a single transaction. However, the term HTAP is

currently used more broadly, even for solutions that sup-

port insertions (not necessarily ACID transactions) as well

as OLAP queries. Some of these systems have the ability to

run analytical queries over the very recent data, while others

need some delay before the queries see the latest data.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

SIGMOD ’17, May 14–19, 2017, Chicago, IL, USA.

 2017 ACM. ISBN 978-1-4503-4197-4/17/05. . . $15.00

DOI: http://dx.doi.org/sigt013

To understand HTAP, we ﬁrst need to look into OLTP

and OLAP systems and how they progressed over the years.

Relational databases have been used for both transaction

processing as well as analytics. However, OLTP and OLAP

systems have very diﬀerent characteristics. OLTP systems

are identiﬁed by their individual record insert/delete/up-

date statements, as well as point queries that beneﬁt from

indexes. One cannot think about OLTP systems without

indexing support. OLAP systems, on the other hand, are

updated in batches and usually require scans of the tables.

Batch insertion into OLAP systems are an artifact of ETL

(extract transform load) systems that consolidate and trans-

form transactional data from OLTP systems into an OLAP

environment for analysis.

After the seminal paper of Stonebraker [34] arguing for

multiple specialized systems, the database ﬁeld has seen an

inﬂux of specialized column-oriented OLAP systems, such as

BLU[30], Vertica[23], ParAccel, GreenPlumDB, Vectorvise,

as well as many in-memory OLTP systems, including VoltDB[35],

Hekaton [13], MemSQL [24] among others. The main driver

for this re-surgency in database engines is the advances in

modern hardware. This second generation of OLAP and

OLTP systems take better advantage of multi-core, various

levels of memory caches, and large memories.

At the same time, the last decade seen an explosion of

many big data technologies, driven by new generation appli-

cations. NoSQL or key-value stores, such as Voldemort[32],

Cassandra[8], RocksDB [31], oﬀer fast inserts and lookups,

and very high scale out, but lack in their query capabilities,

and oﬀer only loose transactional guarantees (see Mohan’s

tutorial[25]). There have been also many SQL-on-Hadoop

[10] oﬀerings, including Hive [36], Big SQL [15] , Impala[20],

and Spark SQL[3], that provide analytics capabilities over

large data sets, focusing on OLAP queries only, and lacking

transaction support. Although all these systems support

queries over text and CSV ﬁles, their focus have been on

columnar storage formats, like ORCFile[27], and Parquet

[1].

Recent years have seen the need for more real-time ana-

lytics. In addition, mobile and Internet of Things have given

rise to a new generation of applications that are character-

ized by heavy ingest rates, i.e. they produce large amounts

of data in a short time, as well as their need for more real-

time analysis. Enterprises are pushing for more real-time

analysis of their data to drive competitive advantage, and

as such they need the ability to run analytics on their oper-

ational data as soon as possible.

With these developments, there is now a lot of interest,

and research focus on providing HTAP solutions over big

data sets. In this tutorial, we plan to provide a quick histor-

ical perspective into the progression of diﬀerent technologies

and discuss the current set of HTAP solutions. We will ex-

amine the diﬀerent architectural aspects of the current so-

lutions, identifying their strengths and weaknesses. We will

categorize existing systems along many technological dimen-

sions, and provide deep dives into a few representative sys-

tems. Finally, we will discuss existing research challenges to

realize true HTAP, where a single transaction can contain

both insert/update/delete statements, as well as complex

OLAP queries.

2. HTAP SOLUTIONS: DESIGN OPTIONS

HTAP solutions today follow a variety of design practices.

This part of the tutorial goes over them to highlight their

main trade-oﬀs while giving examples from industrial oﬀer-

ings and academic solutions.

One of the major design decisions HTAP systems have

to make is whether or not to use the same engine for both

OLTP and OLAP requests.

2.1 Single System for OLTP and OLAP

The traditional relational databases (e.g., DB2, Oracle,

Microsoft SQL Server) have the ability to support OLTP and

OLAP in one engine using single type of data organization

(mainly row-stores). However, they are not very eﬃcient for

either of these workloads.

Therefore, following the one size doesn’t ﬁt all rule [34],

the past decade has seen the rise of specialized engines for

OLTP and OLAP exploiting the advances in modern hard-

ware (larger main-memory, multicores, etc.). Various ven-

dors and academic groups have built in-memory optimized

row-stores (e.g., VoltDB [35], Hekaton [13], MemSQL [24],

Silo [37], ...) and column-stores (e.g., MonetDB [7], Ver-

tica [23] , BLU [30], SAP HANA [14], ...) specialized for

transactional and analytical processing, respectively. These

systems departed from traditional code-bases of relational

databases, and built leaner engines from scratch to also avoid

the large instruction footprint of the traditional engines.

However, many of the systems optimized for one type of

processing, later started adding support for the other type in

order to support HTAP. These systems mainly diﬀer based

on the data organization they use for their transactional and

analytical requests.

2.1.1 Using Separate Data Organization for OLTP

and OLAP

SAP HANA [14] or Oracle’s TimesTen [22] have engines

that are mainly optimized for in-memory columnar process-

ing, which is more beneﬁcial for OLAP workloads. These

systems also support ACID transactions. However, they use

a diﬀerent data organization for data ingestion (row-wise)

and analytics (columnar).

Conversely, MemSQL [24] has an engine that was pri-

marily designed for scalable in-memory OLTP, but today it

supports fast analytical queries as well. It ingests the data

in row format as well as keeping the in-memory portion of

the data in row format. When data is written to disk, it is

converted to columnar format for faster analytics. Similarly,

IBM dashDB [11] is the evolution of a traditional row store

into an HTAP system with hybrid row-wise and columnar

data organizations for OLTP and OLAP workloads, respec-

tively.

On the other hand, from the beginning, HyPer [19] was

designed to support both fast transactions and analytics us-

ing one engine. Even though, initially it used row-wise pro-

cessing of data for both OLTP and OLAP, today it also

provides the option for choosing a columnar format to be

able to run the analytical requests more eﬃciently.

Finally, the recent academic project Pelaton [28] aims to

build an autonomous in-memory HTAP system. It provides

adaptive data organization [4], which changes the data for-

mat at run-time based on the type of requests.

All these systems require converting the data between row

and columnar formats for transactions and analytics. Due

to these conversions, the latest committed data might not

be available to the analytical queries right away for these

types of systems.

2.1.2 Same Data Organization for both OLTP and

OLAP

TAP [2] is an academic project that aims to build an

HTAP system focusing mainly on the hardware utilization

of a single node when running on heterogeneous hardware.

It falls under this category since the system is designed as a

row-store.

Among the SQL-on-Hadoop systems, there has also been

HTAP solutions that extend existing OLAP systems with

the ability to update data. Hive, since version 0.13, has in-

troduced the transaction support (insert, update, and delete)

at the row level [18] for ORCFile, their columnar data for-

mat. However, the primary use cases are for updating di-

mension tables and streaming data ingest. The integration

of Impala [20] with the storage manager Kudu [21], also

allows the SQL-on-Hadoop engine to handle updates and

deletes. The same Kudu storage is also used for running

analytical queries.

Since these systems do not require conversion from one

data organization to another in order to perform transac-

tional and analytical requests, the OLAP queries can read

the latest committed data. However, they might face the

same shortcomings the traditional relational engines faced.

They do not have a data organization that is optimal for

both types of processing. Therefore, they may rely on batch-

ing of requests for fast transactions due to the overheads of

processing data over a non-row-wise format, or perform sub-

optimally for analytics due to non-columnar format.

2.2 Separate OLTP and OLAP Systems

The systems under this category can be further distin-

guished in the way they handle the underlying storage, i.e.,

whether they use the same storage for OLTP and OLAP.

2.2.1 Decoupling the Storage for OLTP and OLAP

Many applications loosely couple an OLTP and an OLAP

systems together for HTAP. It is up to the applications to

maintain the hybrid architecture. The operational data in

the OLTP system are aged to the OLAP system using stan-

dard ETL process. In fact, this is very common in the big

data world, where applications use a fast key-value store like

Cassandra for transactional workloads, and the operational

data are groomed into Parquet or ORC ﬁles on HDFS for

a SQL-on-Hadoop system for queries. As a result, there is

of 5

免费下载

htap paper

关注

评论