暂无图片
暂无图片
1
暂无图片
暂无图片
暂无图片
循环往复-数据库的历史与未来-WhatGoesAround-2024.pdf
303
17页
4次
2024-07-10
5墨值下载
What Goes Around Comes Around... And Around...
Michael Stonebraker Andrew Pavlo
Massachusetts Institute of Technology Carnegie Mellon University
stonebraker@csail.mit.edu pavlo@cs.cmu.edu
ABSTRACT
Two decades ago, one of us co-authored a paper com-
menting on the previous 40 years of data modelling re-
search and development [188]. That paper demonstrated
that the relational model (RM) and SQL are the prevail-
ing choice for database management systems (DBMSs),
despite efforts to replace either them. Instead, SQL ab-
sorbed the best ideas from these alternative approaches.
We revisit this issue and argue that this same evolu-
tion has continued since 2005. Once again there have
been repeated efforts to replace either SQL or the RM.
But the RM continues to be the dominant data model
and SQL has been extended to capture the good ideas
from others. As such, we expect more of the same in
the future, namely the continued evolution of SQL and
relational DBMSs (RDBMSs). We also discuss DBMS
implementations and argue that the major advancements
have been in the RM systems, primarily driven by chang-
ing hardware characteristics.
1 Introduction
In 2005, one of the authors participated in writing a
chapter for the Red Book titled “What Goes Around
Comes Around” [188]. That paper examined the major
data modelling movements since the 1960s:
Hierarchical (e.g., IMS): late 1960s and 1970s
Network (e.g., CODASYL): 1970s
Relational: 1970s and early 1980s
Entity-Relationship: 1970s
Extended Relational: 1980s
Semantic: late 1970s and 1980s
Object-Oriented: late 1980s and early 1990s
Object-Relational: late 1980s and early 1990s
Semi-structured (e.g., XML): late 1990s and 2000s
Our conclusion was that the relational model with an
extendable type system (i.e., object-relational) has dom-
inated all comers, and nothing else has succeeded in
the marketplace. Although many of the non-relational
DBMSs covered in 2005 still exist today, their vendors
have relegated them to legacy maintenance mode and
nobody is building new applications on them. This per-
sistence is more of a testament to the “stickiness” of data
rather than the lasting power of these systems. In other
words, there still are many IBM IMS databases running
today because it is expensive and risky to switch them
to use a modern DBMS. But no start-up would willingly
choose to build a new application on IMS.
A lot has happened in the world of databases since our
2005 survey. During this time, DBMSs have expanded
from their roots in business data processing and are now
used for almost every kind of data. This led to the “Big
Data” era of the early 2010s and the current trend of inte-
grating machine learning (ML) with DBMS technology.
In this paper, we analyze the last 20 years of data
model and query language activity in databases. We
structure our commentary into the following areas: (1)
MapReduce Systems, (2) Key-value Stores, (3) Docu-
ment Databases, (4) Column Family / Wide-Column,
(5) Text Search Engines, (6) Array Databases, (7)
Vector Databases, and (8) Graph Databases.
We contend that most systems that deviated from
SQL or the RM have not dominated the DBMS land-
scape and often only serve niche markets. Many sys-
tems that started out rejecting the RM with much fanfare
(think NoSQL) now expose a SQL-like interface for RM
databases. Such systems are now on a path to conver-
gence with RDBMSs. Meanwhile, SQL incorporated
the best query language ideas to expand its support for
modern applications and remain relevant.
Although there has not been much change in RM
fundamentals, there were dramatic changes in RM sys-
tem implementations. The second part of this paper
discusses advancements in DBMS architectures that ad-
dress modern applications and hardware: (1) Columnar
Systems, (2) Cloud Databases, (3) Data Lakes / Lake-
houses, (4) NewSQL Systems, (5) Hardware Acceler-
ators, and (6) Blockchain Databases. Some of these
are profound changes to DBMS implementations, while
others are merely trends based on faulty premises.
We finish with a discussion of important considera-
tions for the next generation of DBMSs and provide part-
ing comments on our hope for the future of databases in
both research and commercial settings.
SIGMOD Record, June 2024 (Vol. 53, No. 2) 21
2 Data Models & Query Languages
For our discussion here, we group the research and de-
velopment thrusts in data models and query languages
for database into eight categories.
2.1 MapReduce Systems
Google constructed their MapReduce (MR) framework
in 2003 as a “point solution” for processing its periodic
crawl of the internet [122]. At the time, Google had
little expertise in DBMS technology, and they built MR
to meet their crawl needs. In database terms, Map is a
user-defined function (UDF) that performs computation
and/or filtering while Reduce is a GROUP BY operation.
To a first approximation, MR runs a single query:
SELECT map() FROM crawl
_
table GROUP BY reduce()
Google’s MR approach did not prescribe a specific
data model or query language. Rather, it was up to the
Map and Reduce functions written in a procedural MR
program to parse and decipher the contents of data files.
There was a lot of interest in MR-based systems at
other companies in the late 2000s. Yahoo! developed
an open-source version of MR in 2005, called Hadoop.
It ran on top of a distributed file system HDFS that was
a clone of the Google File System [134]. Several start-
ups were formed to support Hadoop in the commercial
marketplace. We will use MR to refer to the Google
implementation and Hadoop to refer to the open-source
version. They are functionally similar.
There was a controversy about the value of Hadoop
compared to RDBMSs designed for OLAP workloads.
This culminated in a 2009 study that showed that data
warehouse DBMSs outperformed Hadoop [172]. This
generated dueling articles from Google and the DBMS
community [123, 190]. Google argued that with care-
ful engineering, a MR system will beat DBMSs, and a
user does not have to load data with a schema before
running queries on it. Thus, MR is better for “one shot”
tasks, such as text processing and ETL operations. The
DBMS community argued that MR incurs performance
problems due to its design that existing parallel DBMSs
already solved. Furthermore, the use of higher-level
languages (SQL) operating over partitioned tables has
proven to be a good programming model [127].
A lot of the discussion in the two papers was on imple-
mentation issues (e.g., indexing, parsing, push vs. pull
query processing, failure recovery). From reading both
papers a reasonable conclusion would be that there is a
place for both kinds of systems. However, two changes
in the technology world rendered the debate moot.
The first event was that the Hadoop technology and
services market cratered in the 2010s. Many enterprises
spent a lot of money on Hadoop clusters, only to find
there was little interest in this functionality. Developers
found it difficult to shoehorn their application into the
restricted MR/Hadoop paradigm. There were consider-
able efforts to provide a SQL and RM interface on top
of Hadoop, most notable was Meta’s Hive [30, 197].
The next event occurred eight months after the CACM
article when Google announced that they were moving
their crawl processing from MR to BigTable [164]. The
reason was that Google needed to interactively update
its crawl database in real time but MR was a batch sys-
tem. Google finally announced in 2014 that MR had no
place in their technology stack and killed it off [194].
The first event left the three leading Hadoop vendors
(Cloudera, Hortonworks, MapR) without a viable prod-
uct to sell. Cloudera rebranded Hadoop to mean the
whole stack (application, Hadoop, HDFS). In a further
sleight-of-hand, Cloudera built a RDBMS, Impala [150],
on top of HDFS but not using Hadoop. They realized
that Hadoop had no place as an internal interface in a
SQL DBMS, and they configured it out of their stack
with software built directly on HDFS. In a similar vein,
MapR built Drill [22] directly on HDFS, and Meta cre-
ated Presto [185] to replace Hive.
Discussion: MR’s deficiencies were so significant that
it could not be saved despite the adoption and enthu-
siasm from the developer community. Hadoop died
about a decade ago, leaving a legacy of HDFS clusters
in enterprises and a collection of companies dedicated
to making money from them. At present, HDFS has
lost its luster, as enterprises realize that there are better
distributed storage alternatives [124]. Meanwhile, dis-
tributed RDBMSs are thriving, especially in the cloud.
Some aspects of MR system implementations related
to scalability, elasticity, and fault tolerance are carried
over into distributed RDBMSs. MR also brought about
the revival of shared-disk architectures with disaggre-
gated storage, subsequently giving rise to open-source
file formats and data lakes (see Sec. 3.3). Hadoop’s lim-
itations opened the door for other data processing plat-
forms, namely Spark [201] and Flink [109]. Both sys-
tems started as better implementations of MR with pro-
cedural APIs but have since added support for SQL [105].
2.2 Key/Value Stores
The key/value (KV) data model is the simplest model
possible. It represents the following binary relation:
(key,value)
A KV DBMS represents a collection of data as an as-
sociative array that maps a key to a value. The value is
typically an untyped array of bytes (i.e., a blob), and the
DBMS is unaware of its contents. It is up to the appli-
cation to maintain the schema and parse the value into
its corresponding parts. Most KV DBMSs only provide
get/set/delete operations on a single value.
In the 2000s, several new Internet companies built
their own shared-nothing, distributed KV stores for nar-
22 SIGMOD R ecord, June 2024 (Vol. 53, N o. 2)
of 17
5墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。
关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜