暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
A刊ICDE 2025-LIFTus_An_Adaptive_Multi-Aspect_Column_Representation_Learning_for_Table_Union_Search.pdf
37
14页
1次
2025-10-28
免费下载
LIFTus: An Adaptive Multi-aspect Column
Representation Learning for Table Union Search
Ermu Qiu
, Jun Gao
, Yaofeng Tu
, and Jingru Yang
§
Key Laboratory of High Confidence Software Technologies, CS, Peking University, China
ZTE Corporation,
§
National Key Laboratory of Data Space Technology and System
qem@stu.pku.edu.cn, gaojun@pku.edu.cn, tu.yaofeng@zte.com.cn, okiyang@pku.edu.cn
Abstract—Table union search (TUS) represents a fundamental
operation in data lakes to find tables unionable to the given
one. Recent approaches to TUS mainly learn column representa-
tions for searching by introducing Pre-trained Language Models
(PLMs), especially on columns with linguistic data. However, a
significant amount of non-linguistic data, notably represented
by domain-specific strings and numerical data in the data lake,
are still under-explored in the existing methods. To address
this issue, we propose LIFTus, an adaptive multi-aspect column
representation for table unionable search, where aspect refers to
a concept more flexible than data types, so that a single column
can exhibit multiple aspects simultaneously. LIFTus aims at
combining different aspects of a column (including both linguistic
and non-linguistic aspects) to promote the effectiveness and
generalization of TUS in a self-supervised manner. Specifically,
besides employing PLMs to extract the linguistic aspects from
an individual column, LIFTus trains a pattern encoder to
learn possible character-level sequential patterns for the column,
and builds a number encoder to capture numerical aspects of
the column, including the distribution and magnitude features.
LIFTus further utilizes a hierarchical cross-attention aided by
aspect-relevant statistics to combine these aspects adaptively in
producing the final column representations, which are indexed by
vector retrieval techniques to achieve efficient search. Extensive
experimental results demonstrate that LIFTus has outperformed
the current state-of-the-art methods in terms of effectiveness, and
achieved much better generalization capability to support unseen
data.
Index Terms—Table Union Search, Data Lake, Aspect, Non-
Linguistic Column, Generalization
I. INTRODUCTION
With the rapid growth of data from various sources, data
lake [1]–[3] becomes an attractive infrastructure to accommo-
date flexible types of data in their raw form directly. Unlike
data warehouses [4], data lakes allow storing data including
relational tables (sometimes with metadata unavailable) and
other types without transforming into a predefined schema [5].
Thus, a fundamental step for an analytic operation in data
lake is to locate the relevant data. TUS is such an operation
to identify a set of unionable tables in a data lake that share
similar columns with the given query Table [6].
TUS methods attempt to achieve effectiveness, efficiency
and generalization. Data instances collected in tables within a
data lake often exhibit considerable heterogeneity, encompass-
ing linguistic data (e.g. words, long texts, dates), non-linguistic
* Corresponding authors
data (e.g. domain-specific strings, numerical values), and even
their combinations [2], [7], [8]. It is challenging to find the
desired unionable target tables effectively. In addition, the
column representation learning, a key step in the TUS, needs to
capture features of flexible data instances and utilize external
knowledge, indicating high time consumption. Moreover, TUS
methods should be adaptive to new and evolving data, as the
volume of data in the lake grows rapidly. Consequently, a gen-
eralized model for column representation is highly desirable.
In the context of the TUS problem, unionability refers to the
potential for attributes or tables to be unioned based on shared
domain features [9]. Early approaches to TUS have focused
on designing rules to measure the unionability between two
tables [2], [9], [10], but face challenges in capturing latent
linguistic information. With the development of PLMs such
as BERT [11] and its variants [12], [13], recent works [14]–
[17] have employed these models into TUS, as the data in the
lake have flexible format including the textual features [18].
Besides PLMs, knowledge bases (KBs) [19], [20] can also
provide information for the determination of unionability [7].
Some studies further notice the roles of the inter-column
relationships and compose a token sequence by sampling data
from different columns into the BERT-like models, thereby
learning the context-aware column representations [14], [15].
The learned column representations are then fed into the vector
index for fast similarity search [14]. Some studies calculate
pairwise similarities of columns to build a semantic graph for
TUS [19], achieving promising performance but at the cost of
additional preprocessing and online search time [7].
While PLM-based methods become mainstream in TUS,
their preference for linguistic data imposes potential limita-
tions on the performances, especially in the context of the
relational tables. We illustrate these issues in Fig. 1. Table
A and Table B can be unionable as they both pertain to
the entities of washing machines manufactured by the same
company and their corresponding columns are unionable,
while Table C serves as an example that is prone to model
misjudgment. We can see that three tables possess linguistic
data in their Location and Description columns, and Table
C also includes the Color column as part of the linguistic
data. Besides, Fig. 1 illustrates the widespread presence of
non-linguistic data in the data lake, including but not limited
to identifiers (e.g. Product Code column) and numerical data
(e.g. ID column, Capacity column, Price column, etc.). For
2174
2025 IEEE 41st International Conference on Data Engineering (ICDE)
2375-026X/25/$31.00 ©2025 IEEE
DOI 10.1109/ICDE65448.2025.00165
2025 IEEE 41st International Conference on Data Engineering (ICDE) | 979-8-3315-3603-9/25/$31.00 ©2025 IEEE | DOI: 10.1109/ICDE65448.2025.00165
Authorized licensed use limited to: ZTE Corporation (One Dept). Downloaded on October 28,2025 at 06:52:46 UTC from IEEE Xplore. Restrictions apply.
example, the strings from the Product Code and Function Abbr
columns (e.g. SDL/JHX/BP/ZH” and MB65V35E”) are not
tokens in natural language but highly domain-specific, as they
are codes designed by washing machine manufacturers.
ID
Product Code
Capacity(L)
Function Abbr
Location
Price
Description
20001
EMS75B37mate6
7.5
SDL/JHX/BP/ZH
WalMart
3100
This is a top-loading washing machine that deli...
20002
ES100B35Mate5
10
DDL/JHX/DP
WalMart
2950
Featuring a 10L capacity, it can handle large loa...
20003
ES90B26Mate6
9
SDL/BP/FCR
WalMart
2200
The machine's smart sensors adjust the water...
ID
Capacity(L)
Function Abbr
Location
Price
Description
20981
6.5
DDL/DP/DLCJ
Carrefour
1850
With a variety of wash cycles, including a quick...
20982
8
DDL/DP
Carrefour
2100
Its durable stainless steel drum is gentle on fab...
20983
9.5
SDL/BP/FCR/ZH
Carrefour
3500
Equip with a self-cleaning function, eliminating...
ID
Product Code
Power(W)
Color
Location
Price
Description
301
MB65V35E
340
White
WalMart
1900
The machine's sleek design and compact size m...
302
MB75V35E
400
Silver
WalMart
2150
For those with allergies, the UltraClean 3000's...
303
MB65V55E
360
Black
WalMart
2350
With a quiet operation and a 5-year warranty, t...
Table A
Table B
Table C
Fig. 1. An example of data lake. In this data lake, Table A and Table B
are unionable with each other, while Table C is easily misjudged as being
unionable with Table A. This misjudgment often stems from an inadequate
understanding of non-linguistic columns (e.g. ID, Product Code, Function
Abbr), which are widely distributed across the data lake and play a crucial
role in determining the unionability of tables. The metadata present here is
for easy understanding, but not used in the following processing.
We illustrate the issues faced by the existing PLM-based
methods. First, the domain-specific strings, represented by
commonly used identifiers, fall outside the typical token set
in the PLMs [11]–[13], and cannot be captured by PLMs
effectively. Taking Fig. 1 as an example, as the token sets
in the PLMs do not cover the values in Product Code column
and Function Abbr, PLMs handle them using the frequent
sub-strings, like BPE [21], computed from the large common
corpus, which may be different from the key sub-parts in these
domain-specific strings. We can see that the similarity between
EMS100B37mate6 and EMS65B26Mate6 should be
significantly greater than that between EMS100B37mate6
and “MB65V35E”. However, PLMs may decompose the string
into different parts (e.g. E”, MS1”), which damages the
understanding of the domain-specific string. It is imperative
to employ character-level sequential modeling for domain-
specific strings to capture their similarities.
Second, the accurate perception of the numerical values to
match columns is an open problem [22]. It is a live issue for
PLMs to adequately understand numerical values [23]. PLMs’
token sets can not cover infinite numerical values distributed
in continuous space. Thus, the PLM-based methods [14]–[16]
struggle to fully understand real numbers, which leads to a
loss in the similarity computation over arbitrary numbers in
real-life applications. The set-coverage-based methods [9] may
yield Jaccard similarity scores nearly zero (e.g. {7.5, 10, 9}
in Capacity column in Table A, {6.5, 8, 9.5} in Capacity
column in Table B in Fig. 1), despite the fact that these
numbers differ only slightly. We also notice that there has
been research on encoding numerical types in tables [24].
However, there remains a significant gap between encoding
individual numbers and encoding entire numerical columns in
TUS, where the latter needs to capture the crucial criteria to
determine the similarity among columns, including distribution
and magnitude features.
Third, columns may include combined features, and relying
solely on data-type-specific approaches may be too coarse
to accurately capture column features. As we can see from
Fig. 1, some columns in a data lake are rich in semantic
information, characterized by a high proportion of natural
language words, such as the Description columns, which rely
more on external corpora or PLMs. In contrast, the Function
Abbr columns in Table A and Table B should guide the model
to prioritize character-level similarity during unionable search.
Additionally, some columns are predominantly numerical,
such as the ID and Price columns across the three tables. In
such a case, the determination should focus on the distribution
and magnitude features of the numbers. Moreover, certain
columns contain a mixture of numerical data and non-language
strings, such as the Product Code columns in the three tables.
We can see that even if we can obtain the data type for the
column, the data-type-specific methods are not sufficient to
capture the flexible features inside the column.
To overcome the aforementioned challenges and issues of
the existing methods, we make the following design choices.
(1) We introduce a notation of column aspect, which is finer
than column data type, as each column has only one data
type but with multiple aspects. Thus, instead of invoking one
method after determining the column data type, we apply
multiple aspect extractors on one column simultaneously, so
as to handle the different forms of the same typed column
or mixed-typed data. We also expect that the full exploration
of aspects in an individual column can enhance the model’s
generalization to unseen data. (2) We do not solely rely on
PLM to handle all kinds of columns, but introduce effective
methods to capture distinct features in columns, including
the linguistic data, domain-specific strings, and numerical
values. For aggregating these aspects, we design a learning
method rather than heuristic rules to enable robust adaptation
to scenarios involving tables with diverse information. (3) We
mainly study the exploration of features in one column, and
leave the inter-column relationships into future work, since
different features inside one single column, we believe, are
still underexplored. (4) The method will yield the column
embedding, instead of the pairwise similarity model in order
to leverage the existing vector index like HNSW (Hierarchical
Navigable Small World) [25] to improve the efficiency of
online search.
This paper proposes LIFTus, an adaptive mul
ti-aspect col-
umn representation for table unionable search. The contribu-
tions are as follows:
We design methods to capture the key non-linguistic as-
pects in LIFTus, including a pattern encoder for domain-
specific strings to infer similarities between strings, and
a number encoder to convert the numerical aspect into
embedding that preserves the distribution and magnitude
features.
We devise a hierarchical cross-attention integrator, to
adaptively combine different aspects of columns, in which
2175
Authorized licensed use limited to: ZTE Corporation (One Dept). Downloaded on October 28,2025 at 06:52:46 UTC from IEEE Xplore. Restrictions apply.
of 14
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜