
EDBT, March 25–28, 2025, Barcelona, Spain Tao Li et al.
Table Data
Inference
(Metadata-based)
Complete/Sampled
content
User data sources (e.g., OLTP systems)
Metadata
Metadata
Uncertain
columns
Phase 1 (P1) Phase 2 (P2)
Resolved columns (with admitted types)
P1's Output P2's Output
Final output of
admitted types
Table/Column names,
comments, data types,
...
Metadata
Data preparation Data preparation
Inference
(Full data-based)
Figure 1: Overview of the Taste framework.
They have shown great potential and established new state-of-the-
art performance (in terms of F1 score). For example, Sherlock [
19
]
extracts 1
,
588 features from column content as the input for a deep
neural network. It achieves an F1 score of 0
.
89 and outperforms a
wide range of traditional methods, including dictionary lookup, reg-
ular expression matching and decision tree. Subsequent work, such
as TURL [
17
] and Doduo [
30
], have raised an F1 score to another
level by further incorporating pre-trained language models.
Meanwhile, as the cloud computing paradigm has become ma-
ture, traditional on-premise data management and preparation
systems have been migrating to the cloud. Major cloud service
providers are also providing similar services, such as Microsoft
Purview [
5
], AWS Glue Data Catalog [
2
], Google Looker Studio [
3
],
and Alibaba Cloud Data Security Center [
1
]. Unfortunately, exist-
ing DL-based approaches like [
14
,
17
,
19
,
30
], when applied in the
cloud context, can suer from two major drawbacks. First, they are
intrusive to user data sources and thus may not be acceptable to
users. These DL models rely on features extracted from column
content, and thus they must scan user data sources to retrieve data.
For example, the model in [
30
] needs to get all column content and
concatenate them as model input. On one hand, scanning column
content causes increased I/O and connections on user data sources
(e.g., OLTP systems), which can potentially disrupt their business.
On the other hand, users with strong concerns about data leak-
age, auditing, and compliance tend to disallow cloud services to
access and examine their column content [
9
]. Second, the existing
approaches are not execution ecient. Column scanning opera-
tions are costly and abundant column features can increase the
complexity of the DL model. Moreover, these approaches generally
run in sequential execution mode, which processes batches of tables
one by one without utilizing cross-batch concurrency. Considering
the large number (as high as billions) of tables and columns from
diverse tenants to deal with and the high detection frequency in the
cloud, eciency is a primary concern to cloud service providers.
Even small savings in time for one column can collectively bring
signicant benets. In this study, we try to explore an ecient
DL-based semantic type detection solution that can be practically
employed in cloud services, especially for relational data (tables)
which are major enterprise data forms in the cloud.
To circumvent the above drawbacks, we develop a
t
wo-ph
a
se
s
emantic
t
ype d
e
tection framework, called Taste. The execution
ow of the Taste framework is depicted in Fig. 1. We divide the
traditional one-shot prediction into two phases, each of which
runs a separate DL-based detection task. The rst phase (i.e., P1),
purely leverages native metadata from the data sources to predict
semantic types, whereas the second phase (i.e., P2) utilizes full table
information, including both metadata and column content. This
design is motivated by the observation that metadata commonly
contains rich information in the technical level (e.g., data types),
business level (e.g., comments and names), and content level (e.g.,
histogram), which can be exploited to infer correct semantic types.
For example, cloud tenants tend to use meaningful table/column
names and comments, and it is common for enterprises to enforce
various standards for dening table schema. Meanwhile, retrieving
metadata is a much less costly and intrusive operation compared
to data scan. Therefore, we can quickly get a preliminary under-
standing about the columns through a metadata-based model in
P1. Based on the output of P1, we are able to assess the relevance
between each semantic type and each column. P2 is on-demand,
and only activated if we need to involve more data to verify un-
certain semantic types. For example, for a column with the name
“num” without additional comments, P1 is uncertain that it belongs
to type “phone number” or “credit card number”. It is necessary
to launch P2 to check the column content to verify. Through the
combination of P1 and P2, Taste can not only run eciently and
mitigate the negative impact on user data sources, but also achieve
robust semantic type detection when metadata quality is not satis-
factory. Furthermore, the Taste framework is exible in the sense
that cloud tenants concerned about data exposure to cloud service
providers can choose to disable P2 completely.
To t Taste’s two-phase execution framework, we further de-
sign a novel Asymmetric Double-Tower Detection (ADTD) model by
incorporating multi-task learning for the two phases. This asym-
metric Transformer-based architecture has been widely adopted
in modern large language models (LLMs) including GPT-3 and its
variants and competitors [
11
,
15
,
28
], and we adapt it to facilitate
the execution of the two-phase framework. ADTD consists of two
towers of multiple layers of Transformer blocks, named metadata
tower and content tower, respectively. These two towers can be con-
sidered as two expert submodels that are responsible for encoding
latent representations of metadata and column content, respec-
tively. The nal prediction of ADTD is based on the combination
of the two towers, using the automatic weighted loss for multi-task
learning as the objective. During inference, the metadata tower is
extracted from the ADTD model and serves as the model for P1,
while the whole ADTD model with both towers is used to serve
P2. In this way, the ADTD model is trained once but applied for
dierent tasks in P1 and P2. The asymmetry of ADTD’s structure
lies in that the content tower relies on the metadata tower, but not
vice versa. The input of every layer of content tower requires not
only content latent representations, but also metadata latent rep-
resentations from the metadata tower. Exploiting this asymmetric
dependency, we build latent caches in the metadata tower to store
metadata latent representations, which can be reused by P2. This
thus avoids duplicate computation of these values and improves
inference eciency. For capturing textual and tabular context in
the Transformer blocks of the two towers, they share the same
parameters and are pre-trained on a table corpus before they are
later trained for the semantic type detection task.
评论