EDBT2024_Taste：Towards Practical Deep Learning-based Approaches for Semantic Type Detection in the Cloud_天翼云.pdf

迹部景吾

144

13页

1次

2024-10-10

免费下载

Taste: Towards Practical Deep Learning-based Approaches for

Semantic Type Detection in the Cloud

Tao Li

∗†

Database Group

State Cloud, China Telecom

lit51@chinatelecom.cn

Feng Liang

∗‡

AI Research Institute

Shenzhen MSU-BIT University

iang@smbu.edu.cn

Jinqi Quan

Database Group

State Cloud, China Telecom

quanjq1@chinatelecom.cn

Chuang Huang

Database Group

State Cloud, China Telecom

huangc41@chinatelecom.cn

Teng Wang

Database Group

State Cloud, China Telecom

wangt_5@chinatelecom.cn

Runhuai Huang

Intelligent Edge Department

State Cloud, China Telecom

huangrh@chinatelecom.cn

Jie Wu

Cloud Computing Research Institute

China Telecom

wujie@chinatelecom.cn

Xiping Hu

†‡

AI Research Institute

Shenzhen MSU-BIT University

huxp@smbu.edu.cn

ABSTRACT

In recent years, we have witnessed more and more data manage-

ment, preparation, and wrangling services appearing in the cloud.

Semantic type detection is important for these services that rely

on semantic types to interpret data and provide useful functions

accordingly. Meanwhile, deep learning (DL) has been introduced

for semantic type detection and transforming the eld. However,

existing DL-based approaches, albeit successfully achieving high

F1 scores, are not practical in the real cloud environment because

they suer from issues like low eciency and high intrusiveness

to user data sources.

To address these issues, we present Taste, a novel semantic

type detection framework with two phases, each associated with

a DL-driven detection task. The intuition behind this framework

is that metadata (e.g., column name, statistics) contain rich tech-

nical and business information, which can be leveraged to detect

semantic types eectively while only incurring lightweight impact

on user data sources. Thus, we design the detection task in the

rst phase purely using native metadata from user data sources

as input. In contrast, the second phase is optional and only acti-

vated when there is a high uncertainty with the rst task’s result. It

then needs to retrieve both metadata and column content to derive

semantic types more reliably. Furthermore, we adopt multi-task

∗

These authors contributed equally to this work, ordered alphabetically by surname.

†

Corresponding authors.

‡

Also aliated with Guangdong-Hong Kong-Macao Joint Laboratory for Emotional

Intelligence and Pervasive Computing, Shenzhen MSU-BIT University, China

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specic permission

and/or a fee. Request permissions from permissions@acm.org.

EDBT, March 25–28, 2025, Barcelona, Spain

ACM ISBN 978-1-4503-XXXX-X/18/06

https://doi.org/10.1145/nnnnnnn.nnnnnnn

learning and develop a novel DL model, called Asymmetric Double-

Tower Detection (ADTD), to support the two tasks simultaneously.

This design enables caching and reuse of the latent representations

from the rst task to reduce inference time. In the implementation,

we further introduce a pipelined execution mechanism to boost

performance for massive user table processing. Evaluation results

show that Taste achieves state-of-the-art performance in execution

time and F1 score, and is more robust under dierent data privacy

settings, demonstrating its potential for real application in cloud

environment.

ACM Reference Format:

Tao Li, Feng Liang, Jinqi Quan, Chuang Huang, Teng Wang, Runhuai Huang,

Jie Wu, and Xiping Hu. 2024. Taste: Towards Practical Deep Learning-based

Approaches for Semantic Type Detection in the Cloud. In Proceedings of 28th

International Conference on Extending Database Technology (EDBT). ACM,

New York, NY, USA, 13 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION

The importance of semantic type detection has been well recognized

in the domain of data management [

], data preparation [

data wrangling [

] and web table understanding [

]. Semantic

types uncover the semantic meaning of data and usually map to real-

world concepts or entities. Thus, they can help human understand

data, and more importantly, can be utilized by data management

and preparation systems to provide proper functions (like search-

ing, transformation, and cleaning) or automate certain tasks. For

example, data with semantic type “credit card number” can be auto-

matically identied as sensitive by data protection systems, which

further can provide data masking functions.

Relying on human eorts to detect semantic types is obviously

impractical in the big data era. Instead, we resort to various al-

gorithms, from simple regular expressions to complex machine

learning models, to achieve automatic semantic type detection. In re-

cent years, many deep learning-based (DL-based) approaches have

been proposed to detect semantic types, such as [

The code is available at https://github.com/ncols-bytes/taste.git

EDBT, March 25–28, 2025, Barcelona, Spain Tao Li et al.

Table Data

Inference

(Metadata-based)

Complete/Sampled

content

User data sources (e.g., OLTP systems)

Metadata

Uncertain

columns

Phase 1 (P1) Phase 2 (P2)

Resolved columns (with admitted types)

P1's Output P2's Output

Final output of

admitted types

Table/Column names,

comments, data types,

...

Metadata

Data preparation Data preparation

Inference

(Full data-based)

Figure 1: Overview of the Taste framework.

They have shown great potential and established new state-of-the-

art performance (in terms of F1 score). For example, Sherlock [

]

extracts 1

588 features from column content as the input for a deep

neural network. It achieves an F1 score of 0

89 and outperforms a

wide range of traditional methods, including dictionary lookup, reg-

ular expression matching and decision tree. Subsequent work, such

as TURL [

] and Doduo [

], have raised an F1 score to another

level by further incorporating pre-trained language models.

Meanwhile, as the cloud computing paradigm has become ma-

ture, traditional on-premise data management and preparation

systems have been migrating to the cloud. Major cloud service

providers are also providing similar services, such as Microsoft

Purview [

], AWS Glue Data Catalog [

], Google Looker Studio [

and Alibaba Cloud Data Security Center [

]. Unfortunately, exist-

ing DL-based approaches like [

], when applied in the

cloud context, can suer from two major drawbacks. First, they are

intrusive to user data sources and thus may not be acceptable to

users. These DL models rely on features extracted from column

content, and thus they must scan user data sources to retrieve data.

For example, the model in [

] needs to get all column content and

concatenate them as model input. On one hand, scanning column

content causes increased I/O and connections on user data sources

(e.g., OLTP systems), which can potentially disrupt their business.

On the other hand, users with strong concerns about data leak-

age, auditing, and compliance tend to disallow cloud services to

access and examine their column content [

]. Second, the existing

approaches are not execution ecient. Column scanning opera-

tions are costly and abundant column features can increase the

complexity of the DL model. Moreover, these approaches generally

run in sequential execution mode, which processes batches of tables

one by one without utilizing cross-batch concurrency. Considering

the large number (as high as billions) of tables and columns from

diverse tenants to deal with and the high detection frequency in the

cloud, eciency is a primary concern to cloud service providers.

Even small savings in time for one column can collectively bring

signicant benets. In this study, we try to explore an ecient

DL-based semantic type detection solution that can be practically

employed in cloud services, especially for relational data (tables)

which are major enterprise data forms in the cloud.

To circumvent the above drawbacks, we develop a

wo-ph

emantic

ype d

tection framework, called Taste. The execution

ow of the Taste framework is depicted in Fig. 1. We divide the

traditional one-shot prediction into two phases, each of which

runs a separate DL-based detection task. The rst phase (i.e., P1),

purely leverages native metadata from the data sources to predict

semantic types, whereas the second phase (i.e., P2) utilizes full table

information, including both metadata and column content. This

design is motivated by the observation that metadata commonly

contains rich information in the technical level (e.g., data types),

business level (e.g., comments and names), and content level (e.g.,

histogram), which can be exploited to infer correct semantic types.

For example, cloud tenants tend to use meaningful table/column

names and comments, and it is common for enterprises to enforce

various standards for dening table schema. Meanwhile, retrieving

metadata is a much less costly and intrusive operation compared

to data scan. Therefore, we can quickly get a preliminary under-

standing about the columns through a metadata-based model in

P1. Based on the output of P1, we are able to assess the relevance

between each semantic type and each column. P2 is on-demand,

and only activated if we need to involve more data to verify un-

certain semantic types. For example, for a column with the name

“num” without additional comments, P1 is uncertain that it belongs

to type “phone number” or “credit card number”. It is necessary

to launch P2 to check the column content to verify. Through the

combination of P1 and P2, Taste can not only run eciently and

mitigate the negative impact on user data sources, but also achieve

robust semantic type detection when metadata quality is not satis-

factory. Furthermore, the Taste framework is exible in the sense

that cloud tenants concerned about data exposure to cloud service

providers can choose to disable P2 completely.

To t Taste’s two-phase execution framework, we further de-

sign a novel Asymmetric Double-Tower Detection (ADTD) model by

incorporating multi-task learning for the two phases. This asym-

metric Transformer-based architecture has been widely adopted

in modern large language models (LLMs) including GPT-3 and its

variants and competitors [

], and we adapt it to facilitate

the execution of the two-phase framework. ADTD consists of two

towers of multiple layers of Transformer blocks, named metadata

tower and content tower, respectively. These two towers can be con-

sidered as two expert submodels that are responsible for encoding

latent representations of metadata and column content, respec-

tively. The nal prediction of ADTD is based on the combination

of the two towers, using the automatic weighted loss for multi-task

learning as the objective. During inference, the metadata tower is

extracted from the ADTD model and serves as the model for P1,

while the whole ADTD model with both towers is used to serve

P2. In this way, the ADTD model is trained once but applied for

dierent tasks in P1 and P2. The asymmetry of ADTD’s structure

lies in that the content tower relies on the metadata tower, but not

vice versa. The input of every layer of content tower requires not

only content latent representations, but also metadata latent rep-

resentations from the metadata tower. Exploiting this asymmetric

dependency, we build latent caches in the metadata tower to store

metadata latent representations, which can be reused by P2. This

thus avoids duplicate computation of these values and improves

inference eciency. For capturing textual and tabular context in

the Transformer blocks of the two towers, they share the same

parameters and are pre-trained on a table corpus before they are

later trained for the semantic type detection task.

of 13

免费下载

关注

评论