循环往复-数据库的历史与未来-WhatGoesAround-2024.pdf

盖国强

303

17页

4次

2024-07-10

5墨值下载

What Goes Around Comes Around... And Around...

Michael Stonebraker Andrew Pavlo

Massachusetts Institute of Technology Carnegie Mellon University

stonebraker@csail.mit.edu pavlo@cs.cmu.edu

ABSTRACT

Two decades ago, one of us co-authored a paper com-

menting on the previous 40 years of data modelling re-

search and development [188]. That paper demonstrated

that the relational model (RM) and SQL are the prevail-

ing choice for database management systems (DBMSs),

despite efforts to replace either them. Instead, SQL ab-

sorbed the best ideas from these alternative approaches.

We revisit this issue and argue that this same evolu-

tion has continued since 2005. Once again there have

been repeated efforts to replace either SQL or the RM.

But the RM continues to be the dominant data model

and SQL has been extended to capture the good ideas

from others. As such, we expect more of the same in

the future, namely the continued evolution of SQL and

relational DBMSs (RDBMSs). We also discuss DBMS

implementations and argue that the major advancements

have been in the RM systems, primarily driven by chang-

ing hardware characteristics.

1 Introduction

In 2005, one of the authors participated in writing a

chapter for the Red Book titled “What Goes Around

Comes Around” [188]. That paper examined the major

data modelling movements since the 1960s:

• Hierarchical (e.g., IMS): late 1960s and 1970s

• Network (e.g., CODASYL): 1970s

• Relational: 1970s and early 1980s

• Entity-Relationship: 1970s

• Extended Relational: 1980s

• Semantic: late 1970s and 1980s

• Object-Oriented: late 1980s and early 1990s

• Object-Relational: late 1980s and early 1990s

• Semi-structured (e.g., XML): late 1990s and 2000s

Our conclusion was that the relational model with an

extendable type system (i.e., object-relational) has dom-

inated all comers, and nothing else has succeeded in

the marketplace. Although many of the non-relational

DBMSs covered in 2005 still exist today, their vendors

have relegated them to legacy maintenance mode and

nobody is building new applications on them. This per-

sistence is more of a testament to the “stickiness” of data

rather than the lasting power of these systems. In other

words, there still are many IBM IMS databases running

today because it is expensive and risky to switch them

to use a modern DBMS. But no start-up would willingly

choose to build a new application on IMS.

A lot has happened in the world of databases since our

2005 survey. During this time, DBMSs have expanded

from their roots in business data processing and are now

used for almost every kind of data. This led to the “Big

Data” era of the early 2010s and the current trend of inte-

grating machine learning (ML) with DBMS technology.

In this paper, we analyze the last 20 years of data

model and query language activity in databases. We

structure our commentary into the following areas: (1)

MapReduce Systems, (2) Key-value Stores, (3) Docu-

ment Databases, (4) Column Family / Wide-Column,

(5) Text Search Engines, (6) Array Databases, (7)

Vector Databases, and (8) Graph Databases.

We contend that most systems that deviated from

SQL or the RM have not dominated the DBMS land-

scape and often only serve niche markets. Many sys-

tems that started out rejecting the RM with much fanfare

(think NoSQL) now expose a SQL-like interface for RM

databases. Such systems are now on a path to conver-

gence with RDBMSs. Meanwhile, SQL incorporated

the best query language ideas to expand its support for

modern applications and remain relevant.

Although there has not been much change in RM

fundamentals, there were dramatic changes in RM sys-

tem implementations. The second part of this paper

discusses advancements in DBMS architectures that ad-

dress modern applications and hardware: (1) Columnar

Systems, (2) Cloud Databases, (3) Data Lakes / Lake-

houses, (4) NewSQL Systems, (5) Hardware Acceler-

ators, and (6) Blockchain Databases. Some of these

are profound changes to DBMS implementations, while

others are merely trends based on faulty premises.

We ﬁnish with a discussion of important considera-

tions for the next generation of DBMSs and provide part-

ing comments on our hope for the future of databases in

both research and commercial settings.

SIGMOD Record, June 2024 (Vol. 53, No. 2) 21

2 Data Models & Query Languages

For our discussion here, we group the research and de-

velopment thrusts in data models and query languages

for database into eight categories.

2.1 MapReduce Systems

Google constructed their MapReduce (MR) framework

in 2003 as a “point solution” for processing its periodic

crawl of the internet [122]. At the time, Google had

little expertise in DBMS technology, and they built MR

to meet their crawl needs. In database terms, Map is a

user-deﬁned function (UDF) that performs computation

and/or ﬁltering while Reduce is a GROUP BY operation.

To a ﬁrst approximation, MR runs a single query:

SELECT map() FROM crawl

table GROUP BY reduce()

Google’s MR approach did not prescribe a speciﬁc

data model or query language. Rather, it was up to the

Map and Reduce functions written in a procedural MR

program to parse and decipher the contents of data ﬁles.

There was a lot of interest in MR-based systems at

other companies in the late 2000s. Yahoo! developed

an open-source version of MR in 2005, called Hadoop.

It ran on top of a distributed ﬁle system HDFS that was

a clone of the Google File System [134]. Several start-

ups were formed to support Hadoop in the commercial

marketplace. We will use MR to refer to the Google

implementation and Hadoop to refer to the open-source

version. They are functionally similar.

There was a controversy about the value of Hadoop

compared to RDBMSs designed for OLAP workloads.

This culminated in a 2009 study that showed that data

warehouse DBMSs outperformed Hadoop [172]. This

generated dueling articles from Google and the DBMS

community [123, 190]. Google argued that with care-

ful engineering, a MR system will beat DBMSs, and a

user does not have to load data with a schema before

running queries on it. Thus, MR is better for “one shot”

tasks, such as text processing and ETL operations. The

DBMS community argued that MR incurs performance

problems due to its design that existing parallel DBMSs

already solved. Furthermore, the use of higher-level

languages (SQL) operating over partitioned tables has

proven to be a good programming model [127].

A lot of the discussion in the two papers was on imple-

mentation issues (e.g., indexing, parsing, push vs. pull

query processing, failure recovery). From reading both

papers a reasonable conclusion would be that there is a

place for both kinds of systems. However, two changes

in the technology world rendered the debate moot.

The ﬁrst event was that the Hadoop technology and

services market cratered in the 2010s. Many enterprises

spent a lot of money on Hadoop clusters, only to ﬁnd

there was little interest in this functionality. Developers

found it difﬁcult to shoehorn their application into the

restricted MR/Hadoop paradigm. There were consider-

able efforts to provide a SQL and RM interface on top

of Hadoop, most notable was Meta’s Hive [30, 197].

The next event occurred eight months after the CACM

article when Google announced that they were moving

their crawl processing from MR to BigTable [164]. The

reason was that Google needed to interactively update

its crawl database in real time but MR was a batch sys-

tem. Google ﬁnally announced in 2014 that MR had no

place in their technology stack and killed it off [194].

The ﬁrst event left the three leading Hadoop vendors

(Cloudera, Hortonworks, MapR) without a viable prod-

uct to sell. Cloudera rebranded Hadoop to mean the

whole stack (application, Hadoop, HDFS). In a further

sleight-of-hand, Cloudera built a RDBMS, Impala [150],

on top of HDFS but not using Hadoop. They realized

that Hadoop had no place as an internal interface in a

SQL DBMS, and they conﬁgured it out of their stack

with software built directly on HDFS. In a similar vein,

MapR built Drill [22] directly on HDFS, and Meta cre-

ated Presto [185] to replace Hive.

Discussion: MR’s deﬁciencies were so signiﬁcant that

it could not be saved despite the adoption and enthu-

siasm from the developer community. Hadoop died

about a decade ago, leaving a legacy of HDFS clusters

in enterprises and a collection of companies dedicated

to making money from them. At present, HDFS has

lost its luster, as enterprises realize that there are better

distributed storage alternatives [124]. Meanwhile, dis-

tributed RDBMSs are thriving, especially in the cloud.

Some aspects of MR system implementations related

to scalability, elasticity, and fault tolerance are carried

over into distributed RDBMSs. MR also brought about

the revival of shared-disk architectures with disaggre-

gated storage, subsequently giving rise to open-source

ﬁle formats and data lakes (see Sec. 3.3). Hadoop’s lim-

itations opened the door for other data processing plat-

forms, namely Spark [201] and Flink [109]. Both sys-

tems started as better implementations of MR with pro-

cedural APIs but have since added support for SQL [105].

2.2 Key/Value Stores

The key/value (KV) data model is the simplest model

possible. It represents the following binary relation:

(key,value)

A KV DBMS represents a collection of data as an as-

sociative array that maps a key to a value. The value is

typically an untyped array of bytes (i.e., a blob), and the

DBMS is unaware of its contents. It is up to the appli-

cation to maintain the schema and parse the value into

its corresponding parts. Most KV DBMSs only provide

get/set/delete operations on a single value.

In the 2000s, several new Internet companies built

their own shared-nothing, distributed KV stores for nar-

22 SIGMOD R ecord, June 2024 (Vol. 53, N o. 2)

of 17

5墨值下载

stonebraker

文档被以下合辑收录

数据库简史（共33篇）

关于我的新书《数据库简史》中引用的一些资料，统一辑录在这里。

关注

文档被以下合辑收录

评论