Bigtable - A Distributed Storage System for Structured Data.pdf

盖国强

749

14页

33次

2021-01-22

免费下载

Bigtable: A Distributed Storage System for Structured Data

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach

Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

{fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,ﬁkes,gruber}@google.com

Google, Inc.

Abstract

Bigtable is a distributed storage system for managing

structured data that is designed to scale to a very large

size: petabytes of data across thousands of commodity

servers. Many projects at Google store data in Bigtable,

including web indexing, Google Earth, and Google Fi-

nance. These applications place very different demands

on Bigtable, both in terms of data size (from URLs to

web pages to satellite imagery) and latency requirements

(from backend bulk processing to real-time data serving).

Despite these varied demands, Bigtable has successfully

provided a ﬂexible, high-performance solution for all of

these Google products. In this paper we describe the sim-

ple data model provided by Bigtable, which gives clients

dynamic control over data layout and format, and we de-

scribe the design and implementation of Bigtable.

1 Introduction

Over the last two and a half years we have designed,

implemented, and deployed a distributed storage system

for managing structured data at Google called Bigtable.

Bigtable is designed to reliably scale to petabytes of

data and thousands of machines. Bigtable has achieved

several goals: wide applicability, scalability, high per-

formance, and high availability. Bigtable is used by

more than sixty Google products and projects, includ-

ing Google Analytics, Google Finance, Orkut, Person-

alized Search, Writely, and Google Earth. These prod-

ucts use Bigtable for a variety of demanding workloads,

which range from throughput-oriented batch-processing

jobs to latency-sensitive serving of data to end users.

The Bigtable clusters used by these products span a wide

range of conﬁgurations, from a handful to thousands of

servers, and store up to several hundred terabytes of data.

In many ways, Bigtable resembles a database: it shares

many implementation strategies with databases. Paral-

lel databases [14] and main-memory databases [13] have

achieved scalability and high performance, but Bigtable

provides a different interface than such systems. Bigtable

does not support a full relational data model; instead, it

provides clients with a simple data model that supports

dynamic control over data layout and format, and al-

lows clients to reason about the locality properties of the

data represented in the underlying storage. Data is in-

dexed using row and column names that can be arbitrary

strings. Bigtable also treats data as uninterpreted strings,

although clients often serialize various forms of struc-

tured and semi-structured data into these strings. Clients

can control the locality of their data through careful

choices in their schemas. Finally, Bigtable schema pa-

rameters let clients dynamically control whether to serve

data out of memory or from disk.

Section 2 describes the data model in more detail, and

Section 3 provides an overview of the client API. Sec-

tion 4 brieﬂy describes the underlying Google infrastruc-

ture on which Bigtable depends. Section 5 describes the

fundamentals of the Bigtable implementation, and Sec-

tion 6 describes some of the reﬁnements that we made

to improve Bigtable’s performance. Section 7 provides

measurements of Bigtable’s performance. We describe

several examples of how Bigtable is used at Google

in Section 8, and discuss some lessons we learned in

designing and supporting Bigtable in Section 9. Fi-

nally, Section 10 describes related work, and Section 11

presents our conclusions.

2 Data Model

A Bigtable is a sparse, distributed, persistent multi-

dimensional sorted map. The map is indexed by a row

key, column key, and a timestamp; each value in the map

is an uninterpreted array of bytes.

(row:string, column:string, time:int64) → string

To appear in OSDI 2006 1

"CNN.com"

"CNN"

"<html>..."

"anchor:cnnsi.com"

"com.cnn.www"

"anchor:my.look.ca""contents:"

Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con-

tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page

is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com

and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t

, t

, and t

We settled on this data model after examining a variety

of potential uses of a Bigtable-like system. As one con-

crete example that drove some of our design decisions,

suppose we want to keep a copy of a large collection of

web pages and related information that could be used by

many different projects; let us call this particular table

the Webtable. In Webtable, we would use URLs as row

keys, various aspects of web pages as column names, and

store the contents of the web pages in the contents: col-

umn under the timestamps when they were fetched, as

illustrated in Figure 1.

Rows

The row keys in a table are arbitrary strings (currently up

to 64KB in size, although 10-100 bytes is a typical size

for most of our users). Every read or write of data under

a single row key is atomic (regardless of the number of

different columns being read or written in the row), a

design decision that makes it easier for clients to reason

about the system’s behavior in the presence of concurrent

updates to the same row.

Bigtable maintains data in lexicographic order by row

key. The row range for a table is dynamically partitioned.

Each row range is called a tablet, which is the unit of dis-

tribution and load balancing. As a result, reads of short

row ranges are efﬁcient and typically require communi-

cation with only a small number of machines. Clients

can exploit this property by selecting their row keys so

that they get good locality for their data accesses. For

example, in Webtable, pages in the same domain are

grouped together into contiguous rows by reversing the

hostname components of the URLs. For example, we

store data for maps.google.com/index.html under the

key com.google.maps/index.html. Storing pages from

the same domain near each other makes some host and

domain analyses more efﬁcient.

Column Families

Column keys are grouped into sets called column fami-

lies, which form the basic unit of access control. All data

stored in a column family is usually of the same type (we

compress data in the same column family together). A

column family must be created before data can be stored

under any column key in that family; after a family has

been created, any column key within the family can be

used. It is our intent that the number of distinct column

families in a table be small (in the hundreds at most), and

that families rarely change during operation. In contrast,

a table may have an unbounded number of columns.

A column key is named using the following syntax:

family:qualiﬁer. Column family names must be print-

able, but qualiﬁers may be arbitrary strings. An exam-

ple column family for the Webtable is language, which

stores the language in which a web page was written. We

use only one column key in the language family, and it

stores each web page’s language ID. Another useful col-

umn family for this table is anchor; each column key in

this family represents a single anchor, as shown in Fig-

ure 1. The qualiﬁer is the name of the referring site; the

cell contents is the link text.

Access control and both disk and memory account-

ing are performed at the column-family level. In our

Webtable example, these controls allow us to manage

several different types of applications: some that add new

base data, some that read the base data and create derived

column families, and some that are only allowed to view

existing data (and possibly not even to view all of the

existing families for privacy reasons).

Timestamps

Each cell in a Bigtable can contain multiple versions of

the same data; these versions are indexed by timestamp.

Bigtable timestamps are 64-bit integers. They can be as-

signed by Bigtable, in which case they represent “real

time” in microseconds, or be explicitly assigned by client

To appear in OSDI 2006 2

of 14

免费下载

google bigtable 列存分布式存储

文档被以下合辑收录

数据库 | 经典论文（共9篇）

数据库领域里，明珠一样的创世论文。

精品研究报告/论文推荐（共44篇）

精选高品质研究报告和具有里程碑意义的论文供大家参考、学习，持续更新……

Google经典数据库论文（共8篇）

本合辑收录Google发布的对数据库技术产生深远影响的多篇经典论文。

关注

文档被以下合辑收录

评论