SIGMOD 2025_Efficient and Accurate Differentially Private Cardinality Continual Releases_OceanBase.pdf

迹部景吾

27页

0次

2025-07-30

免费下载

Eicient and Accurate Dierentially Private Cardinality

Continual Releases

DONGDONG XIE, MOE KLINNS Lab, Xi’an Jiaotong University, OceanBase, Ant Group, China

PINGHUI WANG

∗

, MOE KLINNS Lab, Xi’an Jiaotong University, China

QUANQING XU, OceanBase, Ant Group, China

CHUANHUI YANG, OceanBase, Ant Group, China

RUNDONG LI, MOE KLINNS Lab, Xi’an Jiaotong University, China

Accurately estimating the number of unique elements that appear in data streams in real time is a funda-

mental problem with applications including network trac monitoring and real-time social media analytics.

Traditional sketch-based algorithms such as FM Sketch and HyperLogLog oer memory-friendly solutions

for cardinality estimation but fall short in scenarios where the stream elements are privacy-sensitive and

require dierential privacy. Although recent approaches have incorporated dierential privacy into the above

cardinality estimators, they are limited to single-query settings, restricting their applicability. Previous methods

for private cardinality continual release settings—i.e., releasing the cardinality after each new element in the

stream—demand large memory resources and are thus dicult to apply in practice. In this paper, we present

a novel cardinality estimation framework, FC, which ensures dierential privacy under continual releases

while simultaneously achieving low memory usage, high accuracy, and ecient computation. Our approach

innovatively leverages an ecient cardinality estimator and privacy-preserving mechanisms to overcome the

limitations of existing methods. Comprehensive experiments demonstrate that our method reduces memory

usage by up to 504 times compared to the best previous method while maintaining nearly the same accuracy.

Additionally, under identical memory constraints, our method improves the estimation accuracy by orders of

magnitude.

CCS Concepts: • Security and privacy

→

Database and storage security; • Information systems

→

Data management systems; • Theory of computation → Approximation algorithms analysis.

Additional Key Words and Phrases: Cardinality Estimation, Dierential Privacy, Stream Processing

ACM Reference Format:

Dongdong Xie, Pinghui Wang, Quanqing Xu, Chuanhui Yang, and Rundong Li. 2025. Ecient and Accurate

Dierentially Private Cardinality Continual Releases. Proc. ACM Manag. Data 3, 3 (SIGMOD), Article 151

(June 2025), 27 pages. https://doi.org/10.1145/3725288

1 Introduction

Releasing query results based on sensitive data while preserving privacy is a fundamental task that

has raised much attention. For example, [

] demonstrates that a database can be reconstructed

∗

Pinghui Wang is the corresponding author.

Authors’ Contact Information: Dongdong Xie, MOE KLINNS Lab, Xi’an Jiaotong University, OceanBase, Ant Group,

Xi’an, China, xdddd1@stu.xjtu.edu.cn; Pinghui Wang, MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, China,

wangpinghui369@gmail.com; Quanqing Xu, OceanBase, Ant Group, Hangzhou, China, xuquanqing.xqq@oceanbase.com;

Chuanhui Yang, OceanBase, Ant Group, Hangzhou, China, rizhao.ych@oceanbase.com; Rundong Li, MOE KLINNS Lab,

Xi’an Jiaotong University, Xi’an, China, xjtulirundong@stu.xjtu.edu.cn.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the

full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

ACM 2836-6573/2025/6-ART151

https://doi.org/10.1145/3725288

Proc. ACM Manag. Data, Vol. 3, No. 3 (SIGMOD), Article 151. Publication date: June 2025.

BBAAD9C20180234D78A0072836F0BBD0B2B9B20A1D186BB0A1D9893CB13D2BA93B49B438015B7B0F22392208984657EB6AE9215AE1D01BB11BBFC21E724E39D7241D99AD552D898764AD2FC763F743F4560CE4C5735D62A508FC319F10D29C08DFB6269FFE3

151:2 Dongdong Xie, Pinghui Wang, anqing Xu, Chuanhui Yang, and Rundong Li

by observing sucient query results over the database and provides perturbations for avoiding

privacy leakage. In recent years, dierential privacy [

] (DP) has become the de facto standard for

privacy-preserving query releases and has been applied by Apple, Google, Microsoft, etc. Since it

oers an elegant and rigorous denition of privacy. When releasing the query result of a database,

the best way to preserve the privacy of an element is to just remove that element from the database

and then make queries. That leads to the principle idea of DP: deleting or altering any element

in the dataset does not signicantly change the probability distribution of the query outputs.

Consequently, observers of these outputs cannot reliably infer information about the presence of

any individual element within the dataset.

Instead of querying static datasets [

], in many real-world applications, queries are computed

over ever-evolving data. For example, a webpage may display the number of its visitors in real

time. Therefore, it is essential to consider dierential privacy under continual releases [

]. In this

setting, time is divided into discrete periods, with no more than one element coming at each period.

We can model the input as a data stream

𝒆 = (𝑒

, . . . , 𝑒

𝑇

)

. A new query result is given at every time

step

𝑡

, which is computed over the seen stream prex

(𝑒

, 𝑒

, . . . , 𝑒

𝑡

)

. Similar to the static setting,

DP under continual releases should ensure that for any pair of data streams that dier only at one

time period, the probability distributions of the pair of query results should be hard to distinguish.

In recent years, DP under continual releases has been extensively studied [

–

]. Researchers have

applied DP under continual releases for tasks such as computing private sums [

], answering range

queries [

], answering join queries on databases [

], and more. Among these tasks, dierentially

private cardinality continual releases (DPCCR) - which involves continually releasing the number of

unique elements in a stream where duplicates may appear - is one of the most fundamental queries

with many applications. For example, e-commerce platforms may track unique user visits to product

pages, helping store owners predict product popularity and optimize inventory management. Online

video-sharing platforms like YouTube might monitor the number of unique viewers of a video,

providing a more accurate measure of a video’s popularity than simply tracking the number of

views. The results can help users assess the quality of a video. This can also assist advertisers in

estimating the reach - the number of distinct ad exposures - of their advertisements within the

video content. Additionally, understanding the number of dierent connections can inform security

measures and resource allocation in network trac monitoring.

Directly revealing the exact cardinalities would violate DP under continual releases and leak

privacy. An adversary who observes the output cardinalities

{𝑛

, . . . ,𝑛

𝑇

}

could potentially deduce

the existence of specic elements in the stream, especially when combined with additional side

information. A trivial example is, suppose

𝑛

𝑇

exactly equals the element universe size, observers

can deduce that any element in the universe is in the stream. There is another example. Suppose a

webpage displays the exact number of distinct viewers in real time. The attacker sends the link of

the webpage to another one and requires the link receiver to click on the link. If the number of

distinct viewers does not grow, then the attacker deduces that the receiver has already visited the

webpage before, compromising the privacy of the link receiver. Therefore, to preserve privacy, the

disturbed cardinalities {

𝑛

, . . . ,

𝑛

𝑇

} should be released instead of the exact cardinalities.

Sketch-based methods like the Flajolet-Martin (FM) sketch [

] and HyperLogLog (HLL) [

]

are widely used for their compact memory usage and accurate cardinality estimations. They adopt

a compact data structure (called a sketch) to record summaries of the stream elements and give

estimations based on the updated data structure. Previous works [

–

] have proved that some of

them are inherently or can be modied to preserve DP. However, their results are restricted in the

single query settings, i.e., only output the estimation once. One may execute the above methods

at every time step. However, such a solution is not viable. In DP, the privacy budget

𝜀

serves as

a metric for quantifying the level of privacy guarantee. The formal denition of

𝜀

is provided in

Proc. ACM Manag. Data, Vol. 3, No. 3 (SIGMOD), Article 151. Publication date: June 2025.

of 27

免费下载

文档被以下合辑收录

数据库顶会 SIGMOD 2025 论文下载（共12篇）

数据库顶会 SIGMOD 2025 论文下载

关注

文档被以下合辑收录

评论