VLDB2024_EncChain：Enhancing Large Language Model Applications with Advanced Privacy Preservation Techniques_阿里云.pdf

迹部景吾

182

4页

4次

2024-09-09

免费下载

EncChain: Enhancing Large Language Model Applications with

Advanced Privacy Preservation Techniques

Zhe Fu

Alibaba Cloud

je.fz@alibaba-inc.com

Mo Sha

Alibaba Cloud

shamo.sm@alibaba-inc.com

Yiran Li

Alibaba Cloud

yiranli.lyr@alibaba-inc.com

Huorong Li

Alibaba Cloud

huorong.lhr@alibaba-inc.com

Yubing Ma

Alibaba Cloud

yubing.myb@alibaba-inc.com

Sheng Wang

Alibaba Cloud

sh.wang@alibaba-inc.com

Feifei Li

Alibaba Cloud

lifeifei@alibaba-inc.com

ABSTRACT

In response to escalating concerns about data privacy in the Large

Language Model (LLM) domain, we demonstrate EncChain, a pi-

oneering solution designed to bolster data security in LLM ap-

plications. EncChain presents an all-encompassing approach to

data protection, encrypting both the knowledge bases and user

interactions. It empowers condential computing and implements

stringent access controls, oering a signicant leap in securing

LLM usage. Designed as an accessible Python package, EncChain

ensures straightforward integration into existing systems, bolstered

by its operation within secure environments and the utilization of

remote attestation technologies to verify its security measures. The

eectiveness of EncChain in fortifying data privacy and security

in LLM technologies underscores its importance, positioning it as a

critical advancement for the secure and private utilization of LLMs.

PVLDB Reference Format:

Zhe Fu, Mo Sha, Yiran Li, Huorong Li, Yubing Ma, Sheng Wang, and Feifei

Li. EncChain: Enhancing Large Language Model Applications with

Advanced Privacy Preservation Techniques. PVLDB, 17(12): 4413 - 4416,

2024.

doi:10.14778/3685800.3685888

1 INTRODUCTION

Since late 2022, interest in Large Language Models (LLMs) [

] has

surged dramatically. ChatGPT, for instance, amassed over 100 mil-

lion active users within just two months of its launch, representing

an unprecedented technological uptake. The profound capabilities

of LLMs across diverse domains have catalyzed their widespread

adoption, integration eorts in various use cases, and demonstrated

substantial benets in augmenting productivity and eciency.

However, the rapid advancement of LLMs has highlighted sig-

nicant data security and privacy issues. These concerns are not

merely theoretical. In March 2023, the Italian Data Protection Au-

thority banned ChatGPT due to privacy concerns. In April, Samsung

was accused of leaking sensitive semiconductor data to ChatGPT

in three incidents over 20 days. By November, Microsoft prohib-

ited employees from using ChatGPT at work, blocking related AI

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 17, No. 12 ISSN 2150-8097.

doi:10.14778/3685800.3685888

tools on company devices. These instances indicate a shift from

initial enthusiasm to a more measured approach, recognizing the

pronounced issues with LLMs in practical applications.

The aggregation of extensive knowledge bases and user queries,

often containing sensitive data, introduces substantial security vul-

nerabilities when processed by LLMs. Typical LLM applications,

such as third-party tailored domain-specic APIs, require signi-

cant computational resources and specialized hardware, often fa-

voring cloud deployments. This setup introduces various security

threats, including data exposure due to negligence or malicious

service providers, multi-tenant architecture risks, and the potential

misuse of sensitive user data for model renement. The lack of the-

oretical tools to mitigate the risk of LLMs inadvertently revealing

sensitive content further complicates the issue. As technological

applications deepen, data security emerges as a pivotal constraint

to the advancement of LLM technologies.

In this paper, we demonstrate the proposed EncChain—a novel

privacy preservation solution tailored for LLM applications, under-

pinned by condential data handling practices. The strategic appli-

cation of EncChain signicantly enhances data security measures

within LLM frameworks, diminishing the likelihood of unautho-

rized data access and exploitation. More specically, EncChain

exhibits the following key attributes:

•

Encrypted Knowledge Base and User Interactions: All knowl-

edge base and interaction records are encrypted using distinct keys

before leaving the secure perimeter, which ensures that information

remains perpetually in ciphered form, thereby precluding access to

its unencrypted counterpart, even for application architects.

•

Condential Data Computing Capability: EncChain provides

a suite of core functionalities, including condential knowledge

base loading, condential similarity search, condential prompt

generation, and condential large model inference. These capabili-

ties enable developers to handle and process encrypted data without

accessing plaintext, meeting the requirements for constructing busi-

ness logic while protecting data privacy and security.

•

Fine-grained Access Control: Through rigorous access control,

EncChain enforces precise user permissions for knowledge bases.

By dening roles like “questioner” and “knowledge base owner”

and assigning access based on unique identiers for these roles, it

mitigates unauthorized data access and potential exltration.

•

Streamlined Integration and Application: As a Python pack-

age, EncChain oers straightforward integration into third-party

applications, facilitating adoption by allowing developers to easily

incorporate its features. This ease of use, combined with support

4413

for both encrypted and plaintext queries, signicantly reduces the

complexity for developers new to the system.

•

Execution Safety in Trusted Environments: EncChain and its

associated LLMs are deployable within trusted execution environ-

ments, leveraging advanced hardware security features to safeguard

virtual machine memory privacy and integrity. This setup ensures

that sensitive data is shielded from both the host operating system

and the virtual machine manager, enhancing operational security.

•

Remote Attestation for Enhanced Trust: EncChain enables

the use of remote attestation technologies to conrm the security

and trustworthiness of the execution environments for itself and

the deployed LLM, providing users with additional condence in

the security measures of LLM applications.

2 PRELIMINARIES

Retrieval Augmented Generation. RAG [

] architecture rep-

resents a signicant advancement in addressing the challenge of

hallucination in LLMs, emerging as a dominant pattern in devel-

oping LLM applications, particularly enhancing logical reasoning

and data comprehension from private knowledge bases to augment

question-answering (QA) capabilities. It is pivotal in scenarios like

knowledge-based questioning and intelligent assistance. The RAG

framework involves segmenting private knowledge into embedding

vectors stored in a database. Upon receiving a question, the system

converts it into a vector, retrieves the most relevant knowledge via

vector similarity search, and merges this with the question to form

a comprehensive prompt for LLMs.

Trusted Execution Environment. TEEs [

] provide a corner-

stone technology by oering secure and isolated execution spaces

within processors, enhancing the security of data and code against

potential threats from compromised operating systems or hypervi-

sors in the complex landscape of cybersecurity and data privacy.

Within this spectrum, Intel’s Trust Domain Extensions [

] (TDX)

serve as an evolved form of TEEs, tailored to bring their benets

into the realm of virtualization. TDX introduces the concept of

trusted domains, in which virtual machines operate in isolation

with hardware-level protections. This innovation directly addresses

the intricate challenges of maintaining data privacy and security in

environments such as cloud computing and data centers.

3 EncChain SOLUTION

3.1 Threat Model

The RAG architecture in QA leads to two primary threats: unautho-

rized access and data exltration. Firstly, its reliance on plaintext

storage of knowledge bases and user queries permits developers un-

fettered access, creating a vector for data leaks in cases of malicious

intent or system compromise. Secondly, the architecture lacks rigor-

ous access controls, enabling users to potentially retrieve sensitive

information beyond their clearance through intentionally designed

queries. These threats collectively jeopardize data integrity and

condentiality, necessitating an immediate implementation of en-

hanced security protocols to mitigate the risks of unauthorized

access and ensure the privacy protection of LLM applications.

3.2 Architecture Overview

The EncChain architecture, delineated in Figure 1 for LLM appli-

cation deployment, emphasizes security and operational integrity.

Web Browser

Confidential VM

EncChain

Service

LLM

Service

Guest OS

3rd-party

Application

Legacy VM

Guest OS

Firmware

Other

Hardware

Intel TDX

CPU

Hypervisor

Host OS

Client Terminal

GPT-4

Chatbot

Figure 1: The architecture of the EncChain demonstration.

It treats the client terminal as secure, encrypting data before it ex-

its, protecting it during transmission. Third-party applications are

hosted on virtual machines (VMs), establishing a clear operational

divide. EncChain and its models operate within secure virtual

environments utilizing advanced VM technologies like TDX for

enhanced runtime security. These environments are reinforced by

hardware security extensions, safeguarding virtual memory from

unauthorized access by the host OS and hypervisor. Third-party

applications leverage EncChain’s APIs for encrypted data interac-

tions and secure business logic development. Remote attestation

technology allows users to verify the security of EncChain and

LLM environments, adding a layer of trust. EncChain’s security

protocol includes data encryption at domain entry and exit, strict

access control, and the synergistic use of secure VMs and remote at-

testation, providing a robust framework for secure LLM application

deployment, addressing the critical need for data security.

3.3 Fine-grained Knowledge Control

EncChain enhances privacy attributes in LLM applications using

RAG-based private knowledge base inference through the key ac-

tion of leveraging ne-grained knowledge control. This innovation,

derived from Operon’s privacy-protected data management [

embodies the concept of the Behavior Control List (BCL). Speci-

cally, EncChain allows “knowledge owners” to establish a binary

relationship between the “questioners” and the “knowledge bases.”

Upon the questioner posing a question, triggering the LLM’s infer-

ence, EncChain ensures that the search for relevant knowledge

vectors occurs exclusively within an authorized subset of vector

databases, generating answers based on this relationship. It solves

the issue traditionally addressed either by employing multiple dis-

tinguished LLM instances to segregate knowledge for privacy pro-

tection (sacricing eciency and increasing costs) or by utilizing

a single system but facing privacy risks. EncChain’s innovation

lies in its ability to protect privacy while optimizing the retrieval

and integration process of knowledge, thereby nding an eective

equilibrium between privacy security and knowledge utilization.

3.4 System Workow

We present the procedural workow of EncChain through a spe-

cic example, as illustrated in Figure 2. In this scenario, we assume

four distinct roles:

knowledge base data owners;

question-

ers;

third-party software developers providing QA applications;

and

TEEs (e.g., cloud infrastructure) for deploying LLMs with

EncChain. We note that, in practical scenarios,

and

might

represent the same entity, or

could be a controlled party of

(for

4414