暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
SIGMOD 2024_Homomorphic Compression Making Text Processing on Compression Unlimited_腾讯云.pdf
91
28页
1次
2025-05-28
免费下载
271
Homomorphic Compression: Making Text Processing on
Compression Unlimited
JIAWEI GUAN, Key laboratory of Data Engineering and Knowledge Engineering (MOE), and School of
Information, Renmin University of China, China
FENG ZHANG
, Key laboratory of Data Engineering and Knowledge Engineering (MOE), and School of
Information, Renmin University of China, China
SIQI MA, School of Systems and Computing, University of New South Wales, Australia
KUANGYU CHEN, Key laboratory of Data Engineering and Knowledge Engineering (MOE), and School
of Information, Renmin University of China, China
YIHUA HU, Key laboratory of Data Engineering and Knowledge Engineering (MOE), and School of Infor-
mation, Renmin University of China, China
YUXING CHEN, Tencent Inc., China
ANQUN PAN, Tencent Inc., China
XIAOYONG DU, Key laboratory of Data Engineering and Knowledge Engineering (MOE), and School of
Information, Renmin University of China, China
Lossless data compression is an eective way to handle the huge transmission and storage overhead of massive
text data. Its utility is even more signicant today when data volumes are skyrocketing. The concept of
operating on compressed data infuses new blood into ecient text management by enabling mainly access-
oriented text processing tasks to be done directly on compressed data without decompression. Facing limitations
of the existing compressed text processing schemes such as limited types of operations supported, low eciency,
and high space occupation, we address these problems by proposing a homomorphic compression theory. It
enables the generalization and characterization of algorithms with compression processing capabilities. On
this basis, we develop HOCO, an ecient text data management engine that supports a variety of processing
tasks on compressed text. We select three representative compression schemes and implement them combined
with homomorphism in HOCO. HOCO supports the extension of homomorphic compression schemes through
a modular and object-oriented design and has convenient interfaces for text processing tasks. We evaluate
HOCO on six real-world datasets. The three schemes implemented in HOCO show trade-os in terms of
compression ratio, supported operation types, and eciency. Experiments also show that HOCO can achieve
Feng Zhang is the corresponding author of this paper.
Authors’ addresses: Jiawei Guan, guanjw@ruc.edu.cn, Key laboratory of Data Engineering and Knowledge Engineering
(MOE), and School of Information, Renmin University of China, China; Feng Zhang, Key laboratory of Data Engineering
and Knowledge Engineering (MOE), and School of Information, Renmin University of China, China, fengzhang@ruc.edu.cn;
Siqi Ma, School of Systems and Computing, University of New South Wales, Australia, siqi.ma@adfa.edu.au; Kuangyu Chen,
Key laboratory of Data Engineering and Knowledge Engineering (MOE), and School of Information, Renmin University
of China, China, kuangyuchen@ruc.edu.cn; Yihua Hu, Key laboratory of Data Engineering and Knowledge Engineering
(MOE), and School of Information, Renmin University of China, China, yh3485@columbia.edu; Yuxing Chen, Tencent Inc.,
China, axingguchen@tencent.com; Anqun Pan, Tencent Inc., China, aaronpan@tencent.com; Xiaoyong Du, Key laboratory
of Data Engineering and Knowledge Engineering (MOE), and School of Information, Renmin University of China, China,
duyong@ruc.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2836-6573/2023/12-ART271 $15.00
https://doi.org/10.1145/3626765
Proc. ACM Manag. Data, Vol. 1, No. 4 (SIGMOD), Article 271. Publication date: December 2023.
271:2 Jiawei Guan, et al.
higher throughput in random access and modication operations (averagely 9.18
×
than the state-of-the-art)
and lower latency in text analytic tasks (averagely 7.16
×
than processing on uncompressed text) without
compromising compression ecacy.
CCS Concepts: Information systems
Data compression; Theory of computation
Data com-
pression.
Additional Key Words and Phrases: compression, homomorphism, operating on compressed data
ACM Reference Format:
Jiawei Guan, Feng Zhang, Siqi Ma, Kuangyu Chen, Yihua Hu, Yuxing Chen, Anqun Pan, and Xiaoyong Du.
2023. Homomorphic Compression: Making Text Processing on Compression Unlimited. Proc. ACM Manag.
Data 1, 4 (SIGMOD), Article 271 (December 2023), 28 pages. https://doi.org/10.1145/3626765
1 INTRODUCTION
Data plays a crucial role in providing information in modern society. A vast amount of data is
generated every day for purposes of recording, analyzing, and decision-making. However, the
rapid increase in data has created challenges in terms of storing and transmitting it eciently. To
cope with the problem, compression techniques have become more important than ever [
9
,
13
,
24
,
28
,
29
,
31
,
37
,
39
,
42
,
44
,
46
,
47
,
51
,
53
58
,
60
,
69
,
77
,
78
,
80
,
82
,
90
,
91
]. It is not advisable to
select a compression algorithm based solely on its compression ratio, as a high ratio often comes
with a high overhead for compression and decompression [
43
]. This can negatively impact the
performance of data processing, as the compressed data must be decompressed before it can be
used by applications [
75
]. Therefore, a system should consider both the benets of reduced storage
and transmission costs and the additional computational cost of the compression algorithm to
achieve an optimal balance.
Approaches of operating on compressed data have been proposed to bypass the expensive compres-
sion and decompression steps [
6
,
67
,
85
89
]. Although these solutions show promise in achieving
both high compression ratios and performance improvements in data analytic applications, they
have the following limitations:
First, most existing solutions focus on read-only text processing tasks but lack support for
text modication operations, such as insertion and deletion. As a result, these solutions have to
decompress and re-compress to process modication tasks every time, which is costly and hinders
practical implementation. For example, Succinct [
6
], a database that supports queries on compressed
data using an index and sux array-based compression technique, experiences extremely inecient
data updates for small units due to the interdependent nature of its compressed elements.
Second, existing solutions usually record modications to the text data in auxiliary data structures
instead of reecting modications directly in the data. For example, TADOC [
87
89
] provides
an insert operation on the compressed text data, which records the information in another data
structure without modifying the compressed data. Then the subsequent operations need to pay
additional visits to the recording structure for correctness.
Third, existing solutions store the compressed text data in structures that are hard to reect text
modications directly in them. It is crucial to redesign data structures for the compressed text
data that are conducive to direct modications in order to avoid involving complex auxiliary data
structures.
To address these limitations, we propose homomorphic compression (HC) to generalize and
characterize compression algorithms with direct processing capability in a principled way. Based on
this, we present
Ho
momorphic
Co
mpression System (HOCO), a text data management system
that supports operating on compressed data. HOCO includes three homomorphic compression
schemes with dierent compression ratios and direct processing abilities. It provides unied
Proc. ACM Manag. Data, Vol. 1, No. 4 (SIGMOD), Article 271. Publication date: December 2023.
of 28
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜