visualization systems presenting dashboards for human con-
sumption, or from human operators wishing to diagnose an
observed problem.
State transitions. We wish to identify issues that emerge
from a new software release, an unexpected side effect of a
configuration change, a network cut and other issues that re-
sult in a significant state transition. Thus, we wish for our
TSDB to support fine-grained aggregations over short-time
windows. The ability to display state transitions within tens
of seconds is particularly prized as it allows automation to
quickly remediate problems before they become wide spread.
High availability. Even if a network partition or other
failure leads to disconnection between different datacenters,
systems operating within any given datacenter ought to be
able to write data to local TSDB machines and be able to
retrieve this data on demand.
Fault tolerance. We wish to replicate all writes to multi-
ple regions so we can survive the loss of any given datacenter
or geographic region due to a disaster.
Gorilla is Facebook’s new TSDB that satisfies these con-
straints. Gorilla functions as a write-through cache of the
most recent data entering the monitoring system. We aim
to ensure that most queries run within 10’s of milliseconds.
The insight in Gorilla’s design is that users of monitor-
ing systems do not place much emphasis on individual data
points but rather on aggregate analysis. Additionally, these
systems do not store any user data so traditional ACID guar-
antees are not a core requirement for TSDBs. However, a
high percentage of writes must succeed at all times, even
in the face of disasters that might render entire datacenters
unreachable. Additionally, recent data points are of higher
value than older points given the intuition that knowing if
a particular system or service is broken right now is more
valuable to an operations engineer than knowing if it was
broken an hour ago. Gorilla optimizes for remaining highly
available for writes and reads, even in the face of failures, at
the expense of possibly dropping small amounts of data on
the write path.
The challenge then arises from high data insertion rate,
total data quantity, real-time aggregation, and reliability re-
quirements. We addressed each of these in turn. To address
the first couple requirements, we analyzed the Operational
Data Store (ODS) TSDB, an older monitoring system that
was widely used at Facebook. We noticed that at least 85%
of all queries to ODS was for data collected in the past 26
hours. Further analysis allowed us to determine that we
might be able to serve our users best if we could replace a
disk-based database with an in-memory database. Further,
by treating this in-memory database as a cache of the persis-
tent disk-based store, we could achieve the insertion speed
of an in-memory system with the persistence of a disk based
database.
As of Spring 2015, Facebook’s monitoring systems gener-
ate more than 2 billion unique time series of counters, with
about 12 million data points added per second. This repre-
sents over 1 trillion points per day. At 16 bytes per point,
the resulting 16TB of RAM would be too resource intensive
for practical deployment. We addressed this by repurposing
an existing XOR based floating point compression scheme to
work in a streaming manner that allows us to compress time
series to an average of 1.37 bytes per point, a 12x reduction
in size.
We addressed the reliability requirements by running mul-
tiple instances of Gorilla in different datacenter regions and
streaming data to each without attempting to guarantee
consistency. Read queries are directed at the closest avail-
able Gorilla instance. Note that this design leverages our
observation that individual data points can be lost without
compromising data aggregation unless there’s significant dis-
crepancy between the Gorilla instances.
Gorilla is currently running in production at Facebook
and is used daily by engineers for real-time firefighting and
debugging in conjunction with other monitoring and analy-
sis systems like Hive [27] and Scuba [3] to detect and diag-
nose problems.
2. BACKGROUND & REQUIREMENTS
2.1 Operational Data Store (ODS)
Operating and managing Facebook’s large infrastructure
comprised of hundreds of systems distributed across mul-
tiple data centers would be very difficult without a moni-
toring system that can track their health and performance.
The Operational Data Store (ODS) is an important portion
of the monitoring system at Facebook. ODS comprises of
a time series database (TSDB), a query service, and a de-
tection and alerting system. ODS’s TSDB is built atop the
HBase storage system as described in [26]. Figure 1 repre-
sents a high-level view of how ODS is organized. Time series
data from services running on Facebook hosts is collected by
the ODS write service and written to HBase.
There are two consumers of ODS time series data. The
first consumers are engineers who rely on a charting system
that generates graphs and other visual representations of
time series data from ODS for interactive analysis. The
second consumer is our automated alerting system that read
counters off ODS, compares them to preset thresholds for
health, performance and diagnostic metrics and fires alarms
to oncall engineers and automated remediation systems.
2.1.1 Monitoring system read performance issues
In early 2013, Facebook’s monitoring team realized that
its HBase time series storage system couldn’t scale handle
future read loads. While the average read latency was ac-
ceptable for interactive charts, the 90
th
percentile query
time had increased to multiple seconds blocking our au-
tomation. Additionally, users were self-censoring their us-
age as interactive analysis of even medium-sized queries of
a few thousand time series took tens of seconds to execute.
Larger queries executing over sparse datasets would time-
out as the HBase data store was tuned to prioritize writes.
While our HBase-based TSDB was inefficient, we quickly re-
jected wholesale replacement of the storage system as ODS’s
HBase store held about 2 PB of data [5]. Facebook’s data
warehouse solution, Hive, was also unsuitable due to its al-
ready orders of magnitude higher query latency comparing
to ODS, and query latency and efficiency were our main
concerns [27].
We next turned our attention to in-memory caching. ODS
already used a simple read-through cache but it was pri-
marily targeted at charting systems where multiple dash-
boards shared the same time series. A particularly difficult
scenario was when dashboards queried for the most recent
data point, missed in the cache, and then issued requests
1817
评论