SIGMOD 2023 - FEAST, A Communication-efficient Federated Feature Selection Framework for Relational Data.pdf

波风水门

267

28页

7次

2023-05-31

免费下载

107

FEAST: A Communication-eicient Federated Feature

Selection Framework for Relational Data

RUI FU, Beijing Institute of Technology, China

YUNCHENG WU, National University of Singapore, Singapore

QUANQING XU, OceanBase, Ant Group, China

MEIHUI ZHANG

∗

, Beijing Institute of Technology, China

Vertical federated learning (VFL) is an emerging paradigm for cross-silo organizations to build more accurate

machine learning (ML) models. In this setting, multiple organizations (i.e., parties) hold the same set of samples

with dierent features. However, dierent parties may have redundant or highly correlated features, leading

to inecient and ineective VFL model training. Eective feature selection in VFL is therefore essential to

mitigate such a problem and improve model eectiveness, as well as computation and communication eciency.

To this end, in this paper, we propose a federated feature selection framework, called FEAST, which leverages

conditional mutual information (CMI) to select more informative features while having low redundancy.

Furthermore, we design a communication-ecient method to reduce the information exchanged among the

parties while protecting the parties’ raw data. Extensive experiments on four real-world datasets demonstrate

that the proposed framework achieves state-of-the-art performance in terms of accuracy, communication and

computation costs.

CCS Concepts: • Computing methodologies

→

Feature selection; Cooperation and coordination; Supervised

learning by classication; • Mathematics of computing → Information theory.

Additional Key Words and Phrases: feature selection, vertical federated learning, communication-ecient,

conditional mutual information

ACM Reference Format:

Rui Fu, Yuncheng Wu, Quanqing Xu, and Meihui Zhang. 2023. FEAST: A Communication-ecient Federated

Feature Selection Framework for Relational Data. Proc. ACM Manag. Data 1, 1, Article 107 (May 2023), 28 pages.

https://doi.org/10.1145/3588961

1 INTRODUCTION

Recent years have witnessed a growing interest in exploiting data from cross-silo organizations

to design more accurate machine learning (ML) [

] models and provide better customer

services [

]. However, the raw data held by the distributed organizations cannot be shared

with each other due to privacy concerns. To this end, the federated learning (FL) [

] paradigm

is proposed, which enables cross-silo organizations to collaboratively build ML models without

disclosing their raw data. FL can be categorized into dierent settings based on the data partitioning.

In this paper, we consider the vertically-partitioned setting (aka. VFL), where the organizations

∗

Meihui Zhang is the corresponding author.

Authors’ addresses: Rui Fu, Beijing Institute of Technology, Beijing, China, 3120201016@bit.edu.cn; Yuncheng Wu, National

University of Singapore, Singapore, Singapore, wuyc@comp.nus.edu.sg; Quanqing Xu, OceanBase, Ant Group, Hangzhou,

China, xuquanqing.xqq@antgroup.com; Meihui Zhang, Beijing Institute of Technology, Beijing, China, meihui_zhang@bit.

edu.cn.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the

full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specic permission and/or a fee. Request permissions from permissions@acm.org.

2836-6573/2023/5-ART107 $15.00

https://doi.org/10.1145/3588961

Proc. ACM Manag. Data, Vol. 1, No. 1, Article 107. Publication date: May 2023.

107:2 Rui Fu, Yuncheng Wu, anqing Xu, & Meihui Zhang

Fig. 1. An illustration of VFL and overlapping features.

(aka. parties) hold the same set of samples but with dierent features, and only one party owns

the labels. We call the party which holds the labels as active party and the other parties passive

parties. VFL targets feature-level collaborative learning among parties; therefore, it is especially

useful for structured data analytics [

], which has gained growing interests in the

database community, and can be adopted in a wide spectrum of applications, such as healthcare

and economics analytics.

Figure 1 illustrates a VFL example, where a bank (i.e., the active party) aims to build a model for

predicting whether it should approve a customer’s loan application by consolidating more features

from an insurance company (i.e., the passive party). Feature selection is particularly important

in VFL because the participating organizations may collect similar or highly correlated customer

information and process the information dierently. For example, we can observe three types of

overlapping features in Figure 1. First, the ‘gender’ features of the two parties are duplicated. Second,

the ‘age’ feature at the bank and the ‘birth’ feature at the insurance company enfold the same

information. Third, the ‘income’ feature reects a customer’s monthly salary, and the ‘package’

feature reects the annual earnings. Although they are not the same, they are highly correlated.

These overlapping features are likely to contribute less useful information in totality and may aect

the model eectiveness, and computation and communication eciency [9, 28, 58].

Feature selection in VFL has special requirements in two aspects. The rst is privacy concerns.

For organizations such as the bank and insurance company in the above example, their data are

related to user privacy (e.g., income), and companies cannot undertake the risk of information

leakage. Therefore, the parties typically are not willing to share their raw data, leading the cen-

tralized feature selection methods hardly applicable in VFL. The second is communication and

computation eciency. VFL naturally expands the feature dimensionality in the analytical tasks as

each organization nowadays may have hundreds of features. This poses a challenge to eciency

since both the communication cost and computation time will be signicant with more parties and

features.

While most recent works in VFL [

] focus on model training and model

prediction, several open-source FL systems (e.g., FATE [

]) support several feature selection

algorithms for VFL, such as information value [

] and Pearson correlation coecient [

]. These

algorithms however only consider the relationship between each feature and the label, without

examining the correlation among the features. As a result, they may select overlapping features

from dierent parties. One way to remove overlapping features in the VFL setting is to apply secure

schema matching techniques [

]. However, they are still inadequate to identify correlated

features (e.g., income vs. package) and select informative features. Although it is possible to utilize

Proc. ACM Manag. Data, Vol. 1, No. 1, Article 107. Publication date: May 2023.

of 28

免费下载

关注

评论