
With these developments, there is now a lot of interest,
and research focus on providing HTAP solutions over big
data sets. In this tutorial, we plan to provide a quick histor-
ical perspective into the progression of different technologies
and discuss the current set of HTAP solutions. We will ex-
amine the different architectural aspects of the current so-
lutions, identifying their strengths and weaknesses. We will
categorize existing systems along many technological dimen-
sions, and provide deep dives into a few representative sys-
tems. Finally, we will discuss existing research challenges to
realize true HTAP, where a single transaction can contain
both insert/update/delete statements, as well as complex
OLAP queries.
2. HTAP SOLUTIONS: DESIGN OPTIONS
HTAP solutions today follow a variety of design practices.
This part of the tutorial goes over them to highlight their
main trade-offs while giving examples from industrial offer-
ings and academic solutions.
One of the major design decisions HTAP systems have
to make is whether or not to use the same engine for both
OLTP and OLAP requests.
2.1 Single System for OLTP and OLAP
The traditional relational databases (e.g., DB2, Oracle,
Microsoft SQL Server) have the ability to support OLTP and
OLAP in one engine using single type of data organization
(mainly row-stores). However, they are not very efficient for
either of these workloads.
Therefore, following the one size doesn’t fit all rule [34],
the past decade has seen the rise of specialized engines for
OLTP and OLAP exploiting the advances in modern hard-
ware (larger main-memory, multicores, etc.). Various ven-
dors and academic groups have built in-memory optimized
row-stores (e.g., VoltDB [35], Hekaton [13], MemSQL [24],
Silo [37], ...) and column-stores (e.g., MonetDB [7], Ver-
tica [23] , BLU [30], SAP HANA [14], ...) specialized for
transactional and analytical processing, respectively. These
systems departed from traditional code-bases of relational
databases, and built leaner engines from scratch to also avoid
the large instruction footprint of the traditional engines.
However, many of the systems optimized for one type of
processing, later started adding support for the other type in
order to support HTAP. These systems mainly differ based
on the data organization they use for their transactional and
analytical requests.
2.1.1 Using Separate Data Organization for OLTP
and OLAP
SAP HANA [14] or Oracle’s TimesTen [22] have engines
that are mainly optimized for in-memory columnar process-
ing, which is more beneficial for OLAP workloads. These
systems also support ACID transactions. However, they use
a different data organization for data ingestion (row-wise)
and analytics (columnar).
Conversely, MemSQL [24] has an engine that was pri-
marily designed for scalable in-memory OLTP, but today it
supports fast analytical queries as well. It ingests the data
in row format as well as keeping the in-memory portion of
the data in row format. When data is written to disk, it is
converted to columnar format for faster analytics. Similarly,
IBM dashDB [11] is the evolution of a traditional row store
into an HTAP system with hybrid row-wise and columnar
data organizations for OLTP and OLAP workloads, respec-
tively.
On the other hand, from the beginning, HyPer [19] was
designed to support both fast transactions and analytics us-
ing one engine. Even though, initially it used row-wise pro-
cessing of data for both OLTP and OLAP, today it also
provides the option for choosing a columnar format to be
able to run the analytical requests more efficiently.
Finally, the recent academic project Pelaton [28] aims to
build an autonomous in-memory HTAP system. It provides
adaptive data organization [4], which changes the data for-
mat at run-time based on the type of requests.
All these systems require converting the data between row
and columnar formats for transactions and analytics. Due
to these conversions, the latest committed data might not
be available to the analytical queries right away for these
types of systems.
2.1.2 Same Data Organization for both OLTP and
OLAP
H
2
TAP [2] is an academic project that aims to build an
HTAP system focusing mainly on the hardware utilization
of a single node when running on heterogeneous hardware.
It falls under this category since the system is designed as a
row-store.
Among the SQL-on-Hadoop systems, there has also been
HTAP solutions that extend existing OLAP systems with
the ability to update data. Hive, since version 0.13, has in-
troduced the transaction support (insert, update, and delete)
at the row level [18] for ORCFile, their columnar data for-
mat. However, the primary use cases are for updating di-
mension tables and streaming data ingest. The integration
of Impala [20] with the storage manager Kudu [21], also
allows the SQL-on-Hadoop engine to handle updates and
deletes. The same Kudu storage is also used for running
analytical queries.
Since these systems do not require conversion from one
data organization to another in order to perform transac-
tional and analytical requests, the OLAP queries can read
the latest committed data. However, they might face the
same shortcomings the traditional relational engines faced.
They do not have a data organization that is optimal for
both types of processing. Therefore, they may rely on batch-
ing of requests for fast transactions due to the overheads of
processing data over a non-row-wise format, or perform sub-
optimally for analytics due to non-columnar format.
2.2 Separate OLTP and OLAP Systems
The systems under this category can be further distin-
guished in the way they handle the underlying storage, i.e.,
whether they use the same storage for OLTP and OLAP.
2.2.1 Decoupling the Storage for OLTP and OLAP
Many applications loosely couple an OLTP and an OLAP
systems together for HTAP. It is up to the applications to
maintain the hybrid architecture. The operational data in
the OLTP system are aged to the OLAP system using stan-
dard ETL process. In fact, this is very common in the big
data world, where applications use a fast key-value store like
Cassandra for transactional workloads, and the operational
data are groomed into Parquet or ORC files on HDFS for
a SQL-on-Hadoop system for queries. As a result, there is
评论