暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
EDBT2024_Taste:Towards Practical Deep Learning-based Approaches for Semantic Type Detection in the Cloud_天翼云.pdf
142
13页
1次
2024-10-10
免费下载
Taste: Towards Practical Deep Learning-based Approaches for
Semantic Type Detection in the Cloud
Tao Li
Database Group
State Cloud, China Telecom
lit51@chinatelecom.cn
Feng Liang
AI Research Institute
Shenzhen MSU-BIT University
iang@smbu.edu.cn
Jinqi Quan
Database Group
State Cloud, China Telecom
quanjq1@chinatelecom.cn
Chuang Huang
Database Group
State Cloud, China Telecom
huangc41@chinatelecom.cn
Teng Wang
Database Group
State Cloud, China Telecom
wangt_5@chinatelecom.cn
Runhuai Huang
Intelligent Edge Department
State Cloud, China Telecom
huangrh@chinatelecom.cn
Jie Wu
Cloud Computing Research Institute
China Telecom
wujie@chinatelecom.cn
Xiping Hu
AI Research Institute
Shenzhen MSU-BIT University
huxp@smbu.edu.cn
ABSTRACT
In recent years, we have witnessed more and more data manage-
ment, preparation, and wrangling services appearing in the cloud.
Semantic type detection is important for these services that rely
on semantic types to interpret data and provide useful functions
accordingly. Meanwhile, deep learning (DL) has been introduced
for semantic type detection and transforming the eld. However,
existing DL-based approaches, albeit successfully achieving high
F1 scores, are not practical in the real cloud environment because
they suer from issues like low eciency and high intrusiveness
to user data sources.
To address these issues, we present Taste, a novel semantic
type detection framework with two phases, each associated with
a DL-driven detection task. The intuition behind this framework
is that metadata (e.g., column name, statistics) contain rich tech-
nical and business information, which can be leveraged to detect
semantic types eectively while only incurring lightweight impact
on user data sources. Thus, we design the detection task in the
rst phase purely using native metadata from user data sources
as input. In contrast, the second phase is optional and only acti-
vated when there is a high uncertainty with the rst task’s result. It
then needs to retrieve both metadata and column content to derive
semantic types more reliably. Furthermore, we adopt multi-task
These authors contributed equally to this work, ordered alphabetically by surname.
Corresponding authors.
Also aliated with Guangdong-Hong Kong-Macao Joint Laboratory for Emotional
Intelligence and Pervasive Computing, Shenzhen MSU-BIT University, China
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
EDBT, March 25–28, 2025, Barcelona, Spain
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-XXXX-X/18/06
https://doi.org/10.1145/nnnnnnn.nnnnnnn
learning and develop a novel DL model, called Asymmetric Double-
Tower Detection (ADTD), to support the two tasks simultaneously.
This design enables caching and reuse of the latent representations
from the rst task to reduce inference time. In the implementation,
we further introduce a pipelined execution mechanism to boost
performance for massive user table processing. Evaluation results
show that Taste achieves state-of-the-art performance in execution
time and F1 score, and is more robust under dierent data privacy
settings, demonstrating its potential for real application in cloud
environment.
1
ACM Reference Format:
Tao Li, Feng Liang, Jinqi Quan, Chuang Huang, Teng Wang, Runhuai Huang,
Jie Wu, and Xiping Hu. 2024. Taste: Towards Practical Deep Learning-based
Approaches for Semantic Type Detection in the Cloud. In Proceedings of 28th
International Conference on Extending Database Technology (EDBT). ACM,
New York, NY, USA, 13 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
The importance of semantic type detection has been well recognized
in the domain of data management [
5
,
18
], data preparation [
3
,
19
],
data wrangling [
23
] and web table understanding [
12
]. Semantic
types uncover the semantic meaning of data and usually map to real-
world concepts or entities. Thus, they can help human understand
data, and more importantly, can be utilized by data management
and preparation systems to provide proper functions (like search-
ing, transformation, and cleaning) or automate certain tasks. For
example, data with semantic type “credit card number” can be auto-
matically identied as sensitive by data protection systems, which
further can provide data masking functions.
Relying on human eorts to detect semantic types is obviously
impractical in the big data era. Instead, we resort to various al-
gorithms, from simple regular expressions to complex machine
learning models, to achieve automatic semantic type detection. In re-
cent years, many deep learning-based (DL-based) approaches have
been proposed to detect semantic types, such as [
14
,
17
,
19
,
30
].
1
The code is available at https://github.com/ncols-bytes/taste.git
EDBT, March 25–28, 2025, Barcelona, Spain Tao Li et al.
Table Data
Inference
(Metadata-based)
Complete/Sampled
content
User data sources (e.g., OLTP systems)
Metadata
Metadata
Uncertain
columns
Phase 1 (P1) Phase 2 (P2)
Resolved columns (with admitted types)
P1's Output P2's Output
Final output of
admitted types
Table/Column names,
comments, data types,
...
Metadata
Data preparation Data preparation
Inference
(Full data-based)
Figure 1: Overview of the Taste framework.
They have shown great potential and established new state-of-the-
art performance (in terms of F1 score). For example, Sherlock [
19
]
extracts 1
,
588 features from column content as the input for a deep
neural network. It achieves an F1 score of 0
.
89 and outperforms a
wide range of traditional methods, including dictionary lookup, reg-
ular expression matching and decision tree. Subsequent work, such
as TURL [
17
] and Doduo [
30
], have raised an F1 score to another
level by further incorporating pre-trained language models.
Meanwhile, as the cloud computing paradigm has become ma-
ture, traditional on-premise data management and preparation
systems have been migrating to the cloud. Major cloud service
providers are also providing similar services, such as Microsoft
Purview [
5
], AWS Glue Data Catalog [
2
], Google Looker Studio [
3
],
and Alibaba Cloud Data Security Center [
1
]. Unfortunately, exist-
ing DL-based approaches like [
14
,
17
,
19
,
30
], when applied in the
cloud context, can suer from two major drawbacks. First, they are
intrusive to user data sources and thus may not be acceptable to
users. These DL models rely on features extracted from column
content, and thus they must scan user data sources to retrieve data.
For example, the model in [
30
] needs to get all column content and
concatenate them as model input. On one hand, scanning column
content causes increased I/O and connections on user data sources
(e.g., OLTP systems), which can potentially disrupt their business.
On the other hand, users with strong concerns about data leak-
age, auditing, and compliance tend to disallow cloud services to
access and examine their column content [
9
]. Second, the existing
approaches are not execution ecient. Column scanning opera-
tions are costly and abundant column features can increase the
complexity of the DL model. Moreover, these approaches generally
run in sequential execution mode, which processes batches of tables
one by one without utilizing cross-batch concurrency. Considering
the large number (as high as billions) of tables and columns from
diverse tenants to deal with and the high detection frequency in the
cloud, eciency is a primary concern to cloud service providers.
Even small savings in time for one column can collectively bring
signicant benets. In this study, we try to explore an ecient
DL-based semantic type detection solution that can be practically
employed in cloud services, especially for relational data (tables)
which are major enterprise data forms in the cloud.
To circumvent the above drawbacks, we develop a
t
wo-ph
a
se
s
emantic
t
ype d
e
tection framework, called Taste. The execution
ow of the Taste framework is depicted in Fig. 1. We divide the
traditional one-shot prediction into two phases, each of which
runs a separate DL-based detection task. The rst phase (i.e., P1),
purely leverages native metadata from the data sources to predict
semantic types, whereas the second phase (i.e., P2) utilizes full table
information, including both metadata and column content. This
design is motivated by the observation that metadata commonly
contains rich information in the technical level (e.g., data types),
business level (e.g., comments and names), and content level (e.g.,
histogram), which can be exploited to infer correct semantic types.
For example, cloud tenants tend to use meaningful table/column
names and comments, and it is common for enterprises to enforce
various standards for dening table schema. Meanwhile, retrieving
metadata is a much less costly and intrusive operation compared
to data scan. Therefore, we can quickly get a preliminary under-
standing about the columns through a metadata-based model in
P1. Based on the output of P1, we are able to assess the relevance
between each semantic type and each column. P2 is on-demand,
and only activated if we need to involve more data to verify un-
certain semantic types. For example, for a column with the name
“num” without additional comments, P1 is uncertain that it belongs
to type “phone number” or “credit card number”. It is necessary
to launch P2 to check the column content to verify. Through the
combination of P1 and P2, Taste can not only run eciently and
mitigate the negative impact on user data sources, but also achieve
robust semantic type detection when metadata quality is not satis-
factory. Furthermore, the Taste framework is exible in the sense
that cloud tenants concerned about data exposure to cloud service
providers can choose to disable P2 completely.
To t Taste’s two-phase execution framework, we further de-
sign a novel Asymmetric Double-Tower Detection (ADTD) model by
incorporating multi-task learning for the two phases. This asym-
metric Transformer-based architecture has been widely adopted
in modern large language models (LLMs) including GPT-3 and its
variants and competitors [
11
,
15
,
28
], and we adapt it to facilitate
the execution of the two-phase framework. ADTD consists of two
towers of multiple layers of Transformer blocks, named metadata
tower and content tower, respectively. These two towers can be con-
sidered as two expert submodels that are responsible for encoding
latent representations of metadata and column content, respec-
tively. The nal prediction of ADTD is based on the combination
of the two towers, using the automatic weighted loss for multi-task
learning as the objective. During inference, the metadata tower is
extracted from the ADTD model and serves as the model for P1,
while the whole ADTD model with both towers is used to serve
P2. In this way, the ADTD model is trained once but applied for
dierent tasks in P1 and P2. The asymmetry of ADTD’s structure
lies in that the content tower relies on the metadata tower, but not
vice versa. The input of every layer of content tower requires not
only content latent representations, but also metadata latent rep-
resentations from the metadata tower. Exploiting this asymmetric
dependency, we build latent caches in the metadata tower to store
metadata latent representations, which can be reused by P2. This
thus avoids duplicate computation of these values and improves
inference eciency. For capturing textual and tabular context in
the Transformer blocks of the two towers, they share the same
parameters and are pre-trained on a table corpus before they are
later trained for the semantic type detection task.
of 13
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜