HyperLogLog： the analysis of a near-optimal cardinality estimation algorithm.pdf

Libria

242

20页

0次

2021-02-22

50墨值下载

2007 Conference on Analysis of Algorithms, AofA 07 DMTCS proc. AH, 2007, 127–146

HyperLogLog: the analysis of a near-optimal

cardinality estimation algorithm

Philippe Flajolet

and Éric Fusy

and Olivier Gandouet

and Frédéric

Meunier

Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France)

LIRMM, 161 rue Ada, 34392 Montpellier (France)

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to

estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory

of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate

of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/

√

m. This improves on

the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64%

of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10

with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and

adapts to the sliding window model.

Introduction

The purpose of this note is to present and analyse an efﬁcient algorithm for estimating the number of

distinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisets

and are usually massive streams (read-once sequences). This problem has received a great deal of attention

over the past two decades, ﬁnding an ever growing number of applications in networking and trafﬁc

monitoring, such as the detection of worm propagation, of network attacks (e.g., by Denial of Service),

and of link-based spam on the web [3]. For instance, a data stream over a network consists of a sequence

of packets, each packet having a header, which contains a pair (source–destination) of addresses, followed

by a body of speciﬁc data; the number of distinct header pairs (the cardinality of the multiset) in various

time slices is an important indication for detecting attacks and monitoring trafﬁc, as it records the number

of distinct active ﬂows. Indeed, worms and viruses typically propagate by opening a large number of

different connections, and though they may well pass unnoticed amongst a huge trafﬁc, their activity

becomes exposed once cardinalities are measured (see the lucid exposition by Estan and Varghese in [11]).

Other applications of cardinality estimators include data mining of massive data sets of sorts—natural

language texts [4, 5], biological data [17, 18], very large structured databases, or the internet graph, where

the authors of [22] report computational gains by a factor of 500

attained by probabilistic cardinality

estimators.

1365–8050

 2007 Discrete Mathematics and Theoretical Computer Science (DMTCS), Nancy, France

128 P. Flajolet, É. Fusy, O. Gandouet, F. Meunier

Clearly, the cardinality of a multiset can be exactly determined with a storage complexity essentially

proportional to its number of elements. However, in most applications, the multiset to be treated is far too

large to be kept in core memory. A crucial idea is then to relax the constraint of computing the value n of

the cardinality exactly, and to develop probabilistic algorithms dedicated to estimating n approximately.

(In many practical applications, a tolerance of a few percents on the result is acceptable.) A whole range

of algorithms have been developed that only require a sublinear memory [2, 6, 10, 11, 15, 16], or, at worst

a linear memory, but with a small implied constant [24].

All known efﬁcient cardinality estimators rely on randomization, which is ensured by the use of hash

functions. The elements to be counted belonging to a certain data domain D, we assume given a hash

function, h : D → {0, 1}

∞

; that is, we assimilate hashed values to inﬁnite binary strings of {0, 1}

∞

, or

equivalently to real numbers of the unit interval. (In practice, hashing on 32 bits will sufﬁce to estimate

cardinalities in excess of 10

; see Section 4 for a discussion.) We postulate that the hash function has been

designed in such a way that the hashed values closely resemble a uniform model of randomness, namely,

bits of hashed values are assumed to be independent and to have each probability

of occurring—

practical methods are known [20], which vindicate this assumption, based on cyclic redundancy codes

(CRC), modular arithmetics, or a simpliﬁed cryptographic use of boolean algebra (e.g., sha1).

The best known cardinality estimators rely on making suitable, concise enough, observations on the

hashed values h(M) of the input multiset M, then inferring a plausible estimate of the unknown cardi-

nality n. Deﬁne an observable of a multiset S ≡ h(M) of {0, 1}

∞

strings (or, equivalently, of real [0, 1]

numbers) to be a function that only depends on the set underlying S, that is, a quantity independent of

replications. Then two broad categories of cardinality observables have been studied.

— Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of

the (binary) S-values. For instance, observing in the stream S at the beginning of a string a bit-

pattern 0

ρ−1

1 is more or less a likely indication that the cardinality n of S is at least 2

. The

algorithms known as Probabilistic Counting, due to Flajolet-Martin [15], together with the more

recent LOGLOG of Durand-Flajolet [10] belong to this category.

— Order statistics observables: these are based on order statistics, like the smallest (real) values, that

appear in S. For instance, if X = min(S), we may legitimately hope that n is roughly of the order

of 1/X, since, as regards expectations, one has E(X) = 1/(n + 1). The algorithms of Bar-Yossef

et al. [2] and Giroire’s MINCOUNT [16, 18] are of this type.

The observables just described can be maintained with just one or a few registers. However, as such,

they only provide a rough indication of the sought cardinality n, via log

n or 1/n. One difﬁculty is due

to a rather high variability, so that one observation, corresponding to the maintenance of a single variable,

cannot sufﬁce to obtain accurate predictions. An immediate idea is then to perform several experiments

in parallel: if each of a collection of m random variables has standard deviation σ, then their arithmetic

mean has standard deviation σ/

√

m, which can be made as small as we please by increasing m. That

simplistic strategy has however two major drawbacks: it is costly in terms of computation time (we would

need to compute m hashed values per element scanned), and, worse, it would necessitate a large set of

independent hashing functions, for which no construction is known [1].

The solution introduced in [15] under the name of stochastic averaging, consists in emulating the

effect of m experiments with a single hash function. Roughly speaking, we divide the input stream

h(M) into m substreams, corresponding to a partition of the unit interval of hashed values into [0,

of 20

50墨值下载

database

关注

评论