暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
Hybrid Transactional:Analytical Processing- A Survey.pdf
207
5页
0次
2022-08-04
免费下载
Hybrid Transactional/Analytical Processing: A Survey
Fatma Özcan
IBM Resarch - Almaden
fozcan@us.ibm.com
Yuanyuan Tian
IBM Research - Almaden
ytian@us.ibm.com
Pınar Tözün
IBM Research - Almaden
ptozun@us.ibm.com
ABSTRACT
The popularity of large-scale real-time analytics applications
(real-time inventory/pricing, recommendations from mobile
apps, fraud detection, risk analysis, IoT, etc.) keeps ris-
ing. These applications require distributed data manage-
ment systems that can handle fast concurrent transactions
(OLTP) and analytics on the recent data. Some of them
even need running analytical queries (OLAP) as part of
transactions. Efficient processing of individual transactional
and analytical requests, however, leads to different optimiza-
tions and architectural decisions while building a data man-
agement system.
For the kind of data processing that requires both ana-
lytics and transactions, Gartner recently coined the term
Hybrid Transactional/Analytical Processing (HTAP). Many
HTAP solutions are emerging both from the industry as well
as academia that target these new applications. While some
of these are single system solutions, others are a looser cou-
pling of OLTP databases or NoSQL systems with analytical
big data platforms, like Spark. The goal of this tutorial is
to 1-) quickly review the historical progression of OLTP and
OLAP systems, 2-) discuss the driving factors for HTAP,
and finally 3-) provide a deep technical analysis of existing
and emerging HTAP solutions, detailing their key architec-
tural differences and trade-offs.
1. INTRODUCTION
In this tutorial, we plan to survey existing and emerging
HTAP (Hybrid Transactions and Analytics Processing) so-
lutions. HTAP is a term created by Gartner to describe
systems that can support both OLTP (On-line transaction
processing) as well as OLAP (on-line analytics processing)
within a single transaction. However, the term HTAP is
currently used more broadly, even for solutions that sup-
port insertions (not necessarily ACID transactions) as well
as OLAP queries. Some of these systems have the ability to
run analytical queries over the very recent data, while others
need some delay before the queries see the latest data.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
SIGMOD ’17, May 14–19, 2017, Chicago, IL, USA.
c
2017 ACM. ISBN 978-1-4503-4197-4/17/05. . . $15.00
DOI: http://dx.doi.org/sigt013
To understand HTAP, we first need to look into OLTP
and OLAP systems and how they progressed over the years.
Relational databases have been used for both transaction
processing as well as analytics. However, OLTP and OLAP
systems have very different characteristics. OLTP systems
are identified by their individual record insert/delete/up-
date statements, as well as point queries that benefit from
indexes. One cannot think about OLTP systems without
indexing support. OLAP systems, on the other hand, are
updated in batches and usually require scans of the tables.
Batch insertion into OLAP systems are an artifact of ETL
(extract transform load) systems that consolidate and trans-
form transactional data from OLTP systems into an OLAP
environment for analysis.
After the seminal paper of Stonebraker [34] arguing for
multiple specialized systems, the database field has seen an
influx of specialized column-oriented OLAP systems, such as
BLU[30], Vertica[23], ParAccel, GreenPlumDB, Vectorvise,
as well as many in-memory OLTP systems, including VoltDB[35],
Hekaton [13], MemSQL [24] among others. The main driver
for this re-surgency in database engines is the advances in
modern hardware. This second generation of OLAP and
OLTP systems take better advantage of multi-core, various
levels of memory caches, and large memories.
At the same time, the last decade seen an explosion of
many big data technologies, driven by new generation appli-
cations. NoSQL or key-value stores, such as Voldemort[32],
Cassandra[8], RocksDB [31], offer fast inserts and lookups,
and very high scale out, but lack in their query capabilities,
and offer only loose transactional guarantees (see Mohan’s
tutorial[25]). There have been also many SQL-on-Hadoop
[10] offerings, including Hive [36], Big SQL [15] , Impala[20],
and Spark SQL[3], that provide analytics capabilities over
large data sets, focusing on OLAP queries only, and lacking
transaction support. Although all these systems support
queries over text and CSV files, their focus have been on
columnar storage formats, like ORCFile[27], and Parquet
[1].
Recent years have seen the need for more real-time ana-
lytics. In addition, mobile and Internet of Things have given
rise to a new generation of applications that are character-
ized by heavy ingest rates, i.e. they produce large amounts
of data in a short time, as well as their need for more real-
time analysis. Enterprises are pushing for more real-time
analysis of their data to drive competitive advantage, and
as such they need the ability to run analytics on their oper-
ational data as soon as possible.
With these developments, there is now a lot of interest,
and research focus on providing HTAP solutions over big
data sets. In this tutorial, we plan to provide a quick histor-
ical perspective into the progression of different technologies
and discuss the current set of HTAP solutions. We will ex-
amine the different architectural aspects of the current so-
lutions, identifying their strengths and weaknesses. We will
categorize existing systems along many technological dimen-
sions, and provide deep dives into a few representative sys-
tems. Finally, we will discuss existing research challenges to
realize true HTAP, where a single transaction can contain
both insert/update/delete statements, as well as complex
OLAP queries.
2. HTAP SOLUTIONS: DESIGN OPTIONS
HTAP solutions today follow a variety of design practices.
This part of the tutorial goes over them to highlight their
main trade-offs while giving examples from industrial offer-
ings and academic solutions.
One of the major design decisions HTAP systems have
to make is whether or not to use the same engine for both
OLTP and OLAP requests.
2.1 Single System for OLTP and OLAP
The traditional relational databases (e.g., DB2, Oracle,
Microsoft SQL Server) have the ability to support OLTP and
OLAP in one engine using single type of data organization
(mainly row-stores). However, they are not very efficient for
either of these workloads.
Therefore, following the one size doesn’t fit all rule [34],
the past decade has seen the rise of specialized engines for
OLTP and OLAP exploiting the advances in modern hard-
ware (larger main-memory, multicores, etc.). Various ven-
dors and academic groups have built in-memory optimized
row-stores (e.g., VoltDB [35], Hekaton [13], MemSQL [24],
Silo [37], ...) and column-stores (e.g., MonetDB [7], Ver-
tica [23] , BLU [30], SAP HANA [14], ...) specialized for
transactional and analytical processing, respectively. These
systems departed from traditional code-bases of relational
databases, and built leaner engines from scratch to also avoid
the large instruction footprint of the traditional engines.
However, many of the systems optimized for one type of
processing, later started adding support for the other type in
order to support HTAP. These systems mainly differ based
on the data organization they use for their transactional and
analytical requests.
2.1.1 Using Separate Data Organization for OLTP
and OLAP
SAP HANA [14] or Oracle’s TimesTen [22] have engines
that are mainly optimized for in-memory columnar process-
ing, which is more beneficial for OLAP workloads. These
systems also support ACID transactions. However, they use
a different data organization for data ingestion (row-wise)
and analytics (columnar).
Conversely, MemSQL [24] has an engine that was pri-
marily designed for scalable in-memory OLTP, but today it
supports fast analytical queries as well. It ingests the data
in row format as well as keeping the in-memory portion of
the data in row format. When data is written to disk, it is
converted to columnar format for faster analytics. Similarly,
IBM dashDB [11] is the evolution of a traditional row store
into an HTAP system with hybrid row-wise and columnar
data organizations for OLTP and OLAP workloads, respec-
tively.
On the other hand, from the beginning, HyPer [19] was
designed to support both fast transactions and analytics us-
ing one engine. Even though, initially it used row-wise pro-
cessing of data for both OLTP and OLAP, today it also
provides the option for choosing a columnar format to be
able to run the analytical requests more efficiently.
Finally, the recent academic project Pelaton [28] aims to
build an autonomous in-memory HTAP system. It provides
adaptive data organization [4], which changes the data for-
mat at run-time based on the type of requests.
All these systems require converting the data between row
and columnar formats for transactions and analytics. Due
to these conversions, the latest committed data might not
be available to the analytical queries right away for these
types of systems.
2.1.2 Same Data Organization for both OLTP and
OLAP
H
2
TAP [2] is an academic project that aims to build an
HTAP system focusing mainly on the hardware utilization
of a single node when running on heterogeneous hardware.
It falls under this category since the system is designed as a
row-store.
Among the SQL-on-Hadoop systems, there has also been
HTAP solutions that extend existing OLAP systems with
the ability to update data. Hive, since version 0.13, has in-
troduced the transaction support (insert, update, and delete)
at the row level [18] for ORCFile, their columnar data for-
mat. However, the primary use cases are for updating di-
mension tables and streaming data ingest. The integration
of Impala [20] with the storage manager Kudu [21], also
allows the SQL-on-Hadoop engine to handle updates and
deletes. The same Kudu storage is also used for running
analytical queries.
Since these systems do not require conversion from one
data organization to another in order to perform transac-
tional and analytical requests, the OLAP queries can read
the latest committed data. However, they might face the
same shortcomings the traditional relational engines faced.
They do not have a data organization that is optimal for
both types of processing. Therefore, they may rely on batch-
ing of requests for fast transactions due to the overheads of
processing data over a non-row-wise format, or perform sub-
optimally for analytics due to non-columnar format.
2.2 Separate OLTP and OLAP Systems
The systems under this category can be further distin-
guished in the way they handle the underlying storage, i.e.,
whether they use the same storage for OLTP and OLAP.
2.2.1 Decoupling the Storage for OLTP and OLAP
Many applications loosely couple an OLTP and an OLAP
systems together for HTAP. It is up to the applications to
maintain the hybrid architecture. The operational data in
the OLTP system are aged to the OLAP system using stan-
dard ETL process. In fact, this is very common in the big
data world, where applications use a fast key-value store like
Cassandra for transactional workloads, and the operational
data are groomed into Parquet or ORC files on HDFS for
a SQL-on-Hadoop system for queries. As a result, there is
of 5
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜