BDMasker: Dynamic Data Protection System for Open Big Data Environment.pdf

章芋文

150

29页

3次

2023-07-27

免费下载

International Journal of Software and Informatics, ISSN 1673-7288

http://www.ijsi.org, ijsi@iscas.ac.cn, +86-10-62661048

IJSI, 2023, 13(1): 87–115, doi: 10.21655/ijsi.1673-7288.00297

Research

Article

BDMasker: Dynamic Data Protection System for

Open Big Data Environment

Yaofeng Tu (屠要峰)

1,2

, Jiahao Niu (牛家浩)

, Dezheng Wang (王德政)

1,2

Hong Gao (高洪)

, Jin Xu (徐进)

, Ke Hong (洪科)

, Fang Yang (阳方)

(State Key Laboratory of Mobile Network and Mobile Multimedia Technology, Shenzhen 518057, China)

(ZTE Corporation, Nanjing 210014, China)

Corresponding author: Jiahao Niu, niu.jiahao@zte.com.cn

Abstract Big data has become a national basic strategic resource, and the opening and sharing

of data is the core of China’s big data strategy. Cloud native technology and lake-house

architecture are reconstructing the big data infrastructure and promoting data sharing and value

dissemination. The development of the big data industry and technology requires stronger data

security and data sharing capabilities. However, data security in an open environment has

become a bottleneck, which restricts the development and utilization of big data technology.

The issues of data security and privacy protection have become increasingly prominent both

in the open source big data ecosystem and the commercial big data system. Dynamic data

protection system under the open big data environment is now facing challenges in regards

such as data availability, processing eﬃciency, and system scalability. This paper proposes

the dynamic data protection system BDMasker for the open big data environment. Through a

precise query analysis and query rewriting technology based on the query dependency model,

it can accurately perceive but does not change the original business request, which indicates

that the whole process of dynamic masking has zero impact on the business. Furthermore,

its multi-engine-oriented uniﬁed security strategy framework realizes the vertical expansion of

dynamic data protection capabilities and the horizontal expansion among multiple computing

engines. The distributed computing capability of the big data execution engine can be used to

improve the data protection processing performance of the system. The experimental results

show that the precise SQL analysis and rewriting technology proposed by BDMasker is eﬀective.

The system has good scalability and performance, and the overall performance ﬂuctuates within

3% in the TPC-DS and YCSB benchmark tests.

Keywords big data; data masking; dynamic data masking; SQL rewriting; query dependency

Citation Tu YF, Niu JH, Wang DZ, Gao H, Xu J, Hong K, Yang F. BDMasker: Dynamic data protection

system for open big data environment, International Journal of Software and Informatics, 2023, 13(1):

87–115. http://www.ijsi.org/1673-7288/297.htm

In the era of big data, big data serves as a national basic strategic resource. Attaching great

importance to the development of big data, China has begun to put in place the national big

This is the English version of Chinese article “面向开放大数据环境的动态数据保护系统, 2022, 34(3):

1213–1235. doi: 10.13328/j.cnki.jos.006783”

Funding items: National Key R&D Program of China (2021YFB3101100)

Received 2022-05-14; Revised 2022-07-29, 2022-09-07; Accepted 2022-09-23; IJSI published online 2023-03-30

88 International Journal of Software and Informatics, 2023, 13(1)

data strategy in an all-round manner, in which the opening and sharing of data lies at the core

of the big data competition strategy. From the perspective of the technological development

trend, new technical architectures and big data support platforms are emerging, among which the

cloud native and lake-house architecture are reconstructing the big data infrastructure. Stronger

capabilities of data security and data sharing are required from access to the data lake and

data warehouse to cross-database and cross-domain sharing. Both the open source big data

ecosystem and the commercial big data system, however, fall wor ryingly behind the business

development in the security protection capability of big data in an open environment. It is

shown by the privacy disclosures in recent years that the release or sharing of unmasked data

is highly prone to reveal private data, especially individual sensitive information. In 2018, the

data of 87 million users of Facebook, a social media in the US, was illegally used by Cambridge

Analytica, a consulting company, and Facebook paid a $5 billion ﬁne for such an event. Again

in 2021, there was a data leak involving another 533 million individual Facebook users. The

security problem in the open environment has become a bottleneck in the development and

utilization of big data technology. Accordingly, it has become one of the research focuses in

big data security as to how to protect the privacy of sensitive data in an open and complex

environment while ensuring good data availability and computing eﬃciency

[1, 2]

Data security in the open big data environment diﬀers greatly from traditional data secur ity,

with changes seen in the protection method, protection object, and the relationship between

management and technology. The application scenario of open big data is committed to the

opening and sharing of data, with more diverse roles involved in data processing and the ﬂow of

data as normal, which sets higher standards for data security protection. Traditional measures for

data security such as data encryption and static masking, thereby, are outmoded. According to

relevant research, privacy protection and dynamic data masking technologies represent important

means for safe data ﬂow and sharing and credible big data services

[

3, 4]

. By maintaining the

availability of data sources without the leak of sensitive information in the data ﬂow, dynamic

data masking technology boasts good utility and a broad prospect of application. In an open big

data environment, it is a complex problem requiring prompt solutions as to how to dynamically

protect sensitive data in an automated, eﬃcient, and scalable manner while softening the impact

on nor mal businesses amid massive multimodal data and highly concurrent access requests

[5–7]

The following challenges are mainly involved.

(1) Scalability of heterogeneous environments. To satisfy the timeliness requirement of

diﬀerent data queries and data computing under an open big data scenario, many kinds of big

data computing engines are often deployed simultaneously on the same cluster. For instance,

Apache Spark

[8]

is suitable for batch processing of static data with high latency, and Apache

Flink

[9]

is for low-latency or real-time streaming data processing. Faced with complex and

diverse business scenarios and multiple computing engines, we should explore how to create,

manage, and maintain a uniform data protection strategy for heterogeneous engines and provide

standardized access methods for the horizontal expansion of heterogeneous environments. In

addition to the capability of dynamic data masking, it is necessary to study how to ﬂexibly

support multiple capabilities of dynamic data protection under one framework and support the

vertical expansion of the dynamic data protection capability of a single engine.

(2) High eﬃciency of processing performance. In an open big data environment, data is

generated at a faster pace, with its size on the exponential rise. To meet the response time

requirements in high-performance real-time protection of massive data, data security protection

must be able to operate automatically under the rules and optimize the load of the whole

processing process. In this way, it can make full use of the distributed computing capability of

the big data execution engine to enhance the processing performance.

of 29

免费下载

goldendb ijsi

关注

评论