
88 International Journal of Software and Informatics, 2023, 13(1)
data strategy in an all-round manner, in which the opening and sharing of data lies at the core
of the big data competition strategy. From the perspective of the technological development
trend, new technical architectures and big data support platforms are emerging, among which the
cloud native and lake-house architecture are reconstructing the big data infrastructure. Stronger
capabilities of data security and data sharing are required from access to the data lake and
data warehouse to cross-database and cross-domain sharing. Both the open source big data
ecosystem and the commercial big data system, however, fall wor ryingly behind the business
development in the security protection capability of big data in an open environment. It is
shown by the privacy disclosures in recent years that the release or sharing of unmasked data
is highly prone to reveal private data, especially individual sensitive information. In 2018, the
data of 87 million users of Facebook, a social media in the US, was illegally used by Cambridge
Analytica, a consulting company, and Facebook paid a $5 billion fine for such an event. Again
in 2021, there was a data leak involving another 533 million individual Facebook users. The
security problem in the open environment has become a bottleneck in the development and
utilization of big data technology. Accordingly, it has become one of the research focuses in
big data security as to how to protect the privacy of sensitive data in an open and complex
environment while ensuring good data availability and computing efficiency
[1, 2]
.
Data security in the open big data environment differs greatly from traditional data secur ity,
with changes seen in the protection method, protection object, and the relationship between
management and technology. The application scenario of open big data is committed to the
opening and sharing of data, with more diverse roles involved in data processing and the flow of
data as normal, which sets higher standards for data security protection. Traditional measures for
data security such as data encryption and static masking, thereby, are outmoded. According to
relevant research, privacy protection and dynamic data masking technologies represent important
means for safe data flow and sharing and credible big data services
[
3, 4]
. By maintaining the
availability of data sources without the leak of sensitive information in the data flow, dynamic
data masking technology boasts good utility and a broad prospect of application. In an open big
data environment, it is a complex problem requiring prompt solutions as to how to dynamically
protect sensitive data in an automated, efficient, and scalable manner while softening the impact
on nor mal businesses amid massive multimodal data and highly concurrent access requests
[5–7]
.
The following challenges are mainly involved.
(1) Scalability of heterogeneous environments. To satisfy the timeliness requirement of
different data queries and data computing under an open big data scenario, many kinds of big
data computing engines are often deployed simultaneously on the same cluster. For instance,
Apache Spark
[8]
is suitable for batch processing of static data with high latency, and Apache
Flink
[9]
is for low-latency or real-time streaming data processing. Faced with complex and
diverse business scenarios and multiple computing engines, we should explore how to create,
manage, and maintain a uniform data protection strategy for heterogeneous engines and provide
standardized access methods for the horizontal expansion of heterogeneous environments. In
addition to the capability of dynamic data masking, it is necessary to study how to flexibly
support multiple capabilities of dynamic data protection under one framework and support the
vertical expansion of the dynamic data protection capability of a single engine.
(2) High efficiency of processing performance. In an open big data environment, data is
generated at a faster pace, with its size on the exponential rise. To meet the response time
requirements in high-performance real-time protection of massive data, data security protection
must be able to operate automatically under the rules and optimize the load of the whole
processing process. In this way, it can make full use of the distributed computing capability of
the big data execution engine to enhance the processing performance.
评论