暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
SIGMOD 2023 - FEAST, A Communication-efficient Federated Feature Selection Framework for Relational Data.pdf
267
28页
7次
2023-05-31
免费下载
107
FEAST: A Communication-eicient Federated Feature
Selection Framework for Relational Data
RUI FU, Beijing Institute of Technology, China
YUNCHENG WU, National University of Singapore, Singapore
QUANQING XU, OceanBase, Ant Group, China
MEIHUI ZHANG
, Beijing Institute of Technology, China
Vertical federated learning (VFL) is an emerging paradigm for cross-silo organizations to build more accurate
machine learning (ML) models. In this setting, multiple organizations (i.e., parties) hold the same set of samples
with dierent features. However, dierent parties may have redundant or highly correlated features, leading
to inecient and ineective VFL model training. Eective feature selection in VFL is therefore essential to
mitigate such a problem and improve model eectiveness, as well as computation and communication eciency.
To this end, in this paper, we propose a federated feature selection framework, called FEAST, which leverages
conditional mutual information (CMI) to select more informative features while having low redundancy.
Furthermore, we design a communication-ecient method to reduce the information exchanged among the
parties while protecting the parties’ raw data. Extensive experiments on four real-world datasets demonstrate
that the proposed framework achieves state-of-the-art performance in terms of accuracy, communication and
computation costs.
CCS Concepts: Computing methodologies
Feature selection; Cooperation and coordination; Supervised
learning by classication; Mathematics of computing Information theory.
Additional Key Words and Phrases: feature selection, vertical federated learning, communication-ecient,
conditional mutual information
ACM Reference Format:
Rui Fu, Yuncheng Wu, Quanqing Xu, and Meihui Zhang. 2023. FEAST: A Communication-ecient Federated
Feature Selection Framework for Relational Data. Proc. ACM Manag. Data 1, 1, Article 107 (May 2023), 28 pages.
https://doi.org/10.1145/3588961
1 INTRODUCTION
Recent years have witnessed a growing interest in exploiting data from cross-silo organizations
to design more accurate machine learning (ML) [
46
,
59
] models and provide better customer
services [
16
,
39
,
66
]. However, the raw data held by the distributed organizations cannot be shared
with each other due to privacy concerns. To this end, the federated learning (FL) [
6
,
41
,
62
] paradigm
is proposed, which enables cross-silo organizations to collaboratively build ML models without
disclosing their raw data. FL can be categorized into dierent settings based on the data partitioning.
In this paper, we consider the vertically-partitioned setting (aka. VFL), where the organizations
Meihui Zhang is the corresponding author.
Authors’ addresses: Rui Fu, Beijing Institute of Technology, Beijing, China, 3120201016@bit.edu.cn; Yuncheng Wu, National
University of Singapore, Singapore, Singapore, wuyc@comp.nus.edu.sg; Quanqing Xu, OceanBase, Ant Group, Hangzhou,
China, xuquanqing.xqq@antgroup.com; Meihui Zhang, Beijing Institute of Technology, Beijing, China, meihui_zhang@bit.
edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2836-6573/2023/5-ART107 $15.00
https://doi.org/10.1145/3588961
Proc. ACM Manag. Data, Vol. 1, No. 1, Article 107. Publication date: May 2023.
107:2 Rui Fu, Yuncheng Wu, anqing Xu, & Meihui Zhang
Fig. 1. An illustration of VFL and overlapping features.
(aka. parties) hold the same set of samples but with dierent features, and only one party owns
the labels. We call the party which holds the labels as active party and the other parties passive
parties. VFL targets feature-level collaborative learning among parties; therefore, it is especially
useful for structured data analytics [
10
,
23
,
44
,
63
], which has gained growing interests in the
database community, and can be adopted in a wide spectrum of applications, such as healthcare
and economics analytics.
Figure 1 illustrates a VFL example, where a bank (i.e., the active party) aims to build a model for
predicting whether it should approve a customer’s loan application by consolidating more features
from an insurance company (i.e., the passive party). Feature selection is particularly important
in VFL because the participating organizations may collect similar or highly correlated customer
information and process the information dierently. For example, we can observe three types of
overlapping features in Figure 1. First, the ‘gender’ features of the two parties are duplicated. Second,
the ‘age’ feature at the bank and the ‘birth’ feature at the insurance company enfold the same
information. Third, the ‘income’ feature reects a customer’s monthly salary, and the ‘package’
feature reects the annual earnings. Although they are not the same, they are highly correlated.
These overlapping features are likely to contribute less useful information in totality and may aect
the model eectiveness, and computation and communication eciency [9, 28, 58].
Feature selection in VFL has special requirements in two aspects. The rst is privacy concerns.
For organizations such as the bank and insurance company in the above example, their data are
related to user privacy (e.g., income), and companies cannot undertake the risk of information
leakage. Therefore, the parties typically are not willing to share their raw data, leading the cen-
tralized feature selection methods hardly applicable in VFL. The second is communication and
computation eciency. VFL naturally expands the feature dimensionality in the analytical tasks as
each organization nowadays may have hundreds of features. This poses a challenge to eciency
since both the communication cost and computation time will be signicant with more parties and
features.
While most recent works in VFL [
12
,
14
,
18
,
19
,
36
,
38
] focus on model training and model
prediction, several open-source FL systems (e.g., FATE [
35
]) support several feature selection
algorithms for VFL, such as information value [
26
] and Pearson correlation coecient [
53
]. These
algorithms however only consider the relationship between each feature and the label, without
examining the correlation among the features. As a result, they may select overlapping features
from dierent parties. One way to remove overlapping features in the VFL setting is to apply secure
schema matching techniques [
15
,
52
]. However, they are still inadequate to identify correlated
features (e.g., income vs. package) and select informative features. Although it is possible to utilize
Proc. ACM Manag. Data, Vol. 1, No. 1, Article 107. Publication date: May 2023.
of 28
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜