
128 P. Flajolet, É. Fusy, O. Gandouet, F. Meunier
Clearly, the cardinality of a multiset can be exactly determined with a storage complexity essentially
proportional to its number of elements. However, in most applications, the multiset to be treated is far too
large to be kept in core memory. A crucial idea is then to relax the constraint of computing the value n of
the cardinality exactly, and to develop probabilistic algorithms dedicated to estimating n approximately.
(In many practical applications, a tolerance of a few percents on the result is acceptable.) A whole range
of algorithms have been developed that only require a sublinear memory [2, 6, 10, 11, 15, 16], or, at worst
a linear memory, but with a small implied constant [24].
All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash
functions. The elements to be counted belonging to a certain data domain D, we assume given a hash
function, h : D → {0, 1}
∞
; that is, we assimilate hashed values to infinite binary strings of {0, 1}
∞
, or
equivalently to real numbers of the unit interval. (In practice, hashing on 32 bits will suffice to estimate
cardinalities in excess of 10
9
; see Section 4 for a discussion.) We postulate that the hash function has been
designed in such a way that the hashed values closely resemble a uniform model of randomness, namely,
bits of hashed values are assumed to be independent and to have each probability
1
2
of occurring—
practical methods are known [20], which vindicate this assumption, based on cyclic redundancy codes
(CRC), modular arithmetics, or a simplified cryptographic use of boolean algebra (e.g., sha1).
The best known cardinality estimators rely on making suitable, concise enough, observations on the
hashed values h(M) of the input multiset M, then inferring a plausible estimate of the unknown cardi-
nality n. Define an observable of a multiset S ≡ h(M) of {0, 1}
∞
strings (or, equivalently, of real [0, 1]
numbers) to be a function that only depends on the set underlying S, that is, a quantity independent of
replications. Then two broad categories of cardinality observables have been studied.
— Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of
the (binary) S-values. For instance, observing in the stream S at the beginning of a string a bit-
pattern 0
ρ−1
1 is more or less a likely indication that the cardinality n of S is at least 2
ρ
. The
algorithms known as Probabilistic Counting, due to Flajolet-Martin [15], together with the more
recent LOGLOG of Durand-Flajolet [10] belong to this category.
— Order statistics observables: these are based on order statistics, like the smallest (real) values, that
appear in S. For instance, if X = min(S), we may legitimately hope that n is roughly of the order
of 1/X, since, as regards expectations, one has E(X) = 1/(n + 1). The algorithms of Bar-Yossef
et al. [2] and Giroire’s MINCOUNT [16, 18] are of this type.
The observables just described can be maintained with just one or a few registers. However, as such,
they only provide a rough indication of the sought cardinality n, via log
2
n or 1/n. One difficulty is due
to a rather high variability, so that one observation, corresponding to the maintenance of a single variable,
cannot suffice to obtain accurate predictions. An immediate idea is then to perform several experiments
in parallel: if each of a collection of m random variables has standard deviation σ, then their arithmetic
mean has standard deviation σ/
√
m, which can be made as small as we please by increasing m. That
simplistic strategy has however two major drawbacks: it is costly in terms of computation time (we would
need to compute m hashed values per element scanned), and, worse, it would necessitate a large set of
independent hashing functions, for which no construction is known [1].
The solution introduced in [15] under the name of stochastic averaging, consists in emulating the
effect of m experiments with a single hash function. Roughly speaking, we divide the input stream
h(M) into m substreams, corresponding to a partition of the unit interval of hashed values into [0,
1
m
[,
评论