暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm.pdf
217
20页
0次
2021-02-22
50墨值下载
2007 Conference on Analysis of Algorithms, AofA 07 DMTCS proc. AH, 2007, 127–146
HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm
Philippe Flajolet
1
and Éric Fusy
1
and Olivier Gandouet
2
and Frédéric
Meunier
1
1
Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France)
2
LIRMM, 161 rue Ada, 34392 Montpellier (France)
This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to
estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory
of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate
of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/
m. This improves on
the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64%
of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10
9
with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and
adapts to the sliding window model.
Introduction
The purpose of this note is to present and analyse an efficient algorithm for estimating the number of
distinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisets
and are usually massive streams (read-once sequences). This problem has received a great deal of attention
over the past two decades, finding an ever growing number of applications in networking and traffic
monitoring, such as the detection of worm propagation, of network attacks (e.g., by Denial of Service),
and of link-based spam on the web [3]. For instance, a data stream over a network consists of a sequence
of packets, each packet having a header, which contains a pair (source–destination) of addresses, followed
by a body of specific data; the number of distinct header pairs (the cardinality of the multiset) in various
time slices is an important indication for detecting attacks and monitoring traffic, as it records the number
of distinct active flows. Indeed, worms and viruses typically propagate by opening a large number of
different connections, and though they may well pass unnoticed amongst a huge traffic, their activity
becomes exposed once cardinalities are measured (see the lucid exposition by Estan and Varghese in [11]).
Other applications of cardinality estimators include data mining of massive data sets of sorts—natural
language texts [4, 5], biological data [17, 18], very large structured databases, or the internet graph, where
the authors of [22] report computational gains by a factor of 500
+
attained by probabilistic cardinality
estimators.
1365–8050
c
2007 Discrete Mathematics and Theoretical Computer Science (DMTCS), Nancy, France
128 P. Flajolet, É. Fusy, O. Gandouet, F. Meunier
Clearly, the cardinality of a multiset can be exactly determined with a storage complexity essentially
proportional to its number of elements. However, in most applications, the multiset to be treated is far too
large to be kept in core memory. A crucial idea is then to relax the constraint of computing the value n of
the cardinality exactly, and to develop probabilistic algorithms dedicated to estimating n approximately.
(In many practical applications, a tolerance of a few percents on the result is acceptable.) A whole range
of algorithms have been developed that only require a sublinear memory [2, 6, 10, 11, 15, 16], or, at worst
a linear memory, but with a small implied constant [24].
All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash
functions. The elements to be counted belonging to a certain data domain D, we assume given a hash
function, h : D {0, 1}
; that is, we assimilate hashed values to infinite binary strings of {0, 1}
, or
equivalently to real numbers of the unit interval. (In practice, hashing on 32 bits will suffice to estimate
cardinalities in excess of 10
9
; see Section 4 for a discussion.) We postulate that the hash function has been
designed in such a way that the hashed values closely resemble a uniform model of randomness, namely,
bits of hashed values are assumed to be independent and to have each probability
1
2
of occurring
practical methods are known [20], which vindicate this assumption, based on cyclic redundancy codes
(CRC), modular arithmetics, or a simplified cryptographic use of boolean algebra (e.g., sha1).
The best known cardinality estimators rely on making suitable, concise enough, observations on the
hashed values h(M) of the input multiset M, then inferring a plausible estimate of the unknown cardi-
nality n. Define an observable of a multiset S h(M) of {0, 1}
strings (or, equivalently, of real [0, 1]
numbers) to be a function that only depends on the set underlying S, that is, a quantity independent of
replications. Then two broad categories of cardinality observables have been studied.
Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of
the (binary) S-values. For instance, observing in the stream S at the beginning of a string a bit-
pattern 0
ρ1
1 is more or less a likely indication that the cardinality n of S is at least 2
ρ
. The
algorithms known as Probabilistic Counting, due to Flajolet-Martin [15], together with the more
recent LOGLOG of Durand-Flajolet [10] belong to this category.
Order statistics observables: these are based on order statistics, like the smallest (real) values, that
appear in S. For instance, if X = min(S), we may legitimately hope that n is roughly of the order
of 1/X, since, as regards expectations, one has E(X) = 1/(n + 1). The algorithms of Bar-Yossef
et al. [2] and Giroire’s MINCOUNT [16, 18] are of this type.
The observables just described can be maintained with just one or a few registers. However, as such,
they only provide a rough indication of the sought cardinality n, via log
2
n or 1/n. One difficulty is due
to a rather high variability, so that one observation, corresponding to the maintenance of a single variable,
cannot suffice to obtain accurate predictions. An immediate idea is then to perform several experiments
in parallel: if each of a collection of m random variables has standard deviation σ, then their arithmetic
mean has standard deviation σ/
m, which can be made as small as we please by increasing m. That
simplistic strategy has however two major drawbacks: it is costly in terms of computation time (we would
need to compute m hashed values per element scanned), and, worse, it would necessitate a large set of
independent hashing functions, for which no construction is known [1].
The solution introduced in [15] under the name of stochastic averaging, consists in emulating the
effect of m experiments with a single hash function. Roughly speaking, we divide the input stream
h(M) into m substreams, corresponding to a partition of the unit interval of hashed values into [0,
1
m
[,
of 20
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜