The Google File System.pdf - 墨天轮文档

The Google File System.pdf

盖国强

827

15页

48次

2021-01-22

免费下载

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Google

∗

ABSTRACT

We have designed and implemented the Google File Sys-

tem, a scalable distributed ﬁle system for large distributed

data-intensive applications. It provides fault tolerance while

running on inexpensive commodity hardware, and it delivers

high aggregate performance to a large number of clients.

While sharing many of the same goals as previous dis-

tributed ﬁle systems, our design has been driven by obser-

vations of our application workloads and technological envi-

ronment, both current and anticipated, that reﬂect a marked

departure from some earlier ﬁle system assumptions. This

has led us to reexamine traditional choices and explore rad-

ically diﬀerent design points.

The ﬁle system has successfully met our storage needs.

It is widely deployed within Google as the storage platform

for the generation and processing of data used by our ser-

vice as well as research and development eﬀorts that require

large data sets. The largest cluster to date provides hun-

dreds of terabytes of storage across thousands of disks on

over a thousand machines, and it is concurrently accessed

by hundreds of clients.

In this paper, we present ﬁle system interface extensions

designed to support distributed applications, discuss many

aspects of our design, and report measurements from both

micro-benchmarks and real world use.

Categories and Subject Descriptors

D[4]: 3—Distributed ﬁle systems

General Terms

Design, reliability, performance, measurement

Keywords

Fault tolerance, scalability, data storage, clustered storage

∗

The authors can be reached at the following addresses:

{sanjay,hgobioﬀ,shuntak}@google.com.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.

$5.00.

1. INTRODUCTION

We have designed and implemented the Google File Sys-

tem (GFS) to meet the rapidly growing demands of Google’s

data processing needs. GFS shares many of the same goals

as previous distributed ﬁle systems such as performance,

scalability, reliability, and availability. However, its design

has been driven by key observations of our application work-

loads and technological environment, both current and an-

ticipated, that reﬂect a marked departure from some earlier

ﬁle system design assumptions. We have reexamined tradi-

tional choices and explored radically diﬀerent points in the

design space.

First, component failures are the norm rather than the

exception. The ﬁle system consists of hundreds or even

thousands of storage machines built from inexpensive com-

modity parts and is accessed by a comparable number of

client machines. The quantity and quality of the compo-

nents virtually guarantee that some are not functional at

any given time and some will not recover from their cur-

rent failures. We have seen problems caused by application

bugs, operating system bugs, human errors, and the failures

of disks, memory, connectors, networking, and power sup-

plies. Therefore, constant monitoring, error detection, fault

tolerance, and automatic recovery must be integral to the

system.

Second, ﬁles are huge by traditional standards. Multi-GB

ﬁles are common. Each ﬁle typically contains many applica-

tion objects such as web documents. When we are regularly

working with fast growing data sets of many TBs comprising

billions of objects, it is unwieldy to manage billions of ap-

proximately KB-sized ﬁles even when the ﬁle system could

support it. As a result, design assumptions and parameters

such as I/O operation and block sizes have to be revisited.

Third, most ﬁles are mutated by appending new data

rather than overwriting existing data. Random writes within

a ﬁle are practically non-existent. Once written, the ﬁles

are only read, and often only sequentially. A variety of

data share these characteristics. Some may constitute large

repositories that data analysis programs scan through. Some

may be data streams continuously generated by running ap-

plications. Some may be archival data. Some may be in-

termediate results produced on one machine and processed

on another, whether simultaneously or later in time. Given

this access pattern on huge ﬁles, appending becomes the fo-

cus of performance optimization and atomicity guarantees,

while caching data blocks in the client loses its appeal.

Fourth, co-designing the applications and the ﬁle system

API beneﬁts the overall system by increasing our ﬂexibility.

For example, we have relaxed GFS’s consistency model to

vastly simplify the ﬁle system without imposing an onerous

burden on the applications. We have also introduced an

atomic append operation so that multiple clients can append

concurrently to a ﬁle without extra synchronization between

them. These will be discussed in more details later in the

paper.

Multiple GFS clusters are currently deployed for diﬀerent

purposes. The largest ones have over 1000 storage nodes,

over 300 TB of disk storage, and are heavily accessed by

hundreds of clients on distinct machines on a continuous

basis.

2. DESIGN OVERVIEW

2.1 Assumptions

In designing a ﬁle system for our needs, we have been

guided by assumptions that oﬀer both challenges and op-

portunities. We alluded to some key observations earlier

and now lay out our assumptions in more details.

• The system is built from many inexpensive commodity

components that often fail. It must constantly monitor

itself and detect, tolerate, and recover promptly from

component failures on a routine basis.

• The system stores a modest number of large ﬁles. We

expect a few million ﬁles, each typically 100 MB or

larger in size. Multi-GB ﬁles are the common case

and should be managed eﬃciently. Small ﬁles must be

supported, but we need not optimize for them.

• The workloads primarily consist of two kinds of reads:

large streaming reads and small random reads. In

large streaming reads, individual operations typically

read hundreds of KBs, more commonly 1 MB or more.

Successive operations from the same client often read

through a contiguous region of a ﬁle. A small ran-

dom read typically reads a few KBs at some arbitrary

oﬀset. Performance-conscious applications often batch

and sort their small reads to advance steadily through

the ﬁle rather than go back and forth.

• The workloads also have many large, sequential writes

that append data to ﬁles. Typical operation sizes are

similar to those for reads. Once written, ﬁles are sel-

dom modiﬁed again. Small writes at arbitrary posi-

tions in a ﬁle are supported but do not have to be

eﬃcient.

• The system must eﬃciently implement well-deﬁned se-

mantics for multiple clients that concurrently append

to the same ﬁle. Our ﬁles are often used as producer-

consumer queues or for many-way merging. Hundreds

of producers, running one per machine, will concur-

rently append to a ﬁle. Atomicity with minimal syn-

chronization overhead is essential. The ﬁle may be

read later, or a consumer may be reading through the

ﬁle simultaneously.

• High sustained bandwidth is more important than low

latency. Most of our target applications place a pre-

mium on processing data in bulk at a high rate, while

few have stringent response time requirements for an

individual read or write.

2.2 Interface

GFS provides a familiar ﬁle system interface, though it

does not implement a standard API such as POSIX. Files are

organized hierarchically in directories and identiﬁed by path-

names. We support the usual operations to create, delete,

open, close, read,andwrite ﬁles.

Moreover, GFS has snapshot and record append opera-

tions. Snapshot creates a copy of a ﬁle or a directory tree

at low cost. Record append allows multiple clients to ap-

pend data to the same ﬁle concurrently while guaranteeing

the atomicity of each individual client’s append. It is use-

ful for implementing multi-way merge results and producer-

consumer queues that many clients can simultaneously ap-

pend to without additional locking. We have found these

types of ﬁles to be invaluable in building large distributed

applications. Snapshot and record append are discussed fur-

ther in Sections 3.4 and 3.3 respectively.

2.3 Architecture

A GFS cluster consists of a single master and multiple

chunkservers and is accessed by multiple clients,asshown

in Figure 1. Each of these is typically a commodity Linux

machine running a user-level server process. It is easy to run

both a chunkserver and a client on the same machine, as long

as machine resources permit and the lower reliability caused

by running possibly ﬂaky application code is acceptable.

Files are divided into ﬁxed-size chunks. Each chunk is

identiﬁed by an immutable and globally unique 64 bit chunk

handle assigned by the master at the time of chunk creation.

Chunkservers store chunks on local disks as Linux ﬁles and

read or write chunk data speciﬁed by a chunk handle and

byte range. For reliability, each chunk is replicated on multi-

ple chunkservers. By default, we store three replicas, though

users can designate diﬀerent replication levels for diﬀerent

regions of the ﬁle namespace.

The master maintains all ﬁle system metadata. This in-

cludes the namespace, access control information, the map-

ping from ﬁles to chunks, and the current locations of chunks.

It also controls system-wide activities such as chunk lease

management, garbage collection of orphaned chunks, and

chunk migration between chunkservers. The master peri-

odically communicates with each chunkserver in HeartBeat

messages to give it instructions and collect its state.

GFS client code linked into each application implements

the ﬁle system API and communicates with the master and

chunkservers to read or write data on behalf of the applica-

tion. Clients interact with the master for metadata opera-

tions, but all data-bearing communication goes directly to

the chunkservers. We do not provide the POSIX API and

therefore need not hook into the Linux vnode layer.

Neither the client nor the chunkserver caches ﬁle data.

Client caches oﬀer little beneﬁt because most applications

stream through huge ﬁles or have working sets too large

to be cached. Not having them simpliﬁes the client and

the overall system by eliminating cache coherence issues.

(Clients do cache metadata, however.) Chunkservers need

not cache ﬁle data because chunks are stored as local ﬁles

and so Linux’s buﬀer cache already keeps frequently accessed

data in memory.

2.4 Single Master

Having a single master vastly simpliﬁes our design and

enables the master to make sophisticated chunk placement

of 15

免费下载

google

文档被以下合辑收录

数据库 | 经典论文（共9篇）

数据库领域里，明珠一样的创世论文。

精品研究报告/论文推荐（共44篇）

精选高品质研究报告和具有里程碑意义的论文供大家参考、学习，持续更新……

Google经典数据库论文（共8篇）

本合辑收录Google发布的对数据库技术产生深远影响的多篇经典论文。

关注

文档被以下合辑收录

评论