A刊ICDE 2025-LIFTus_An_Adaptive_Multi-Aspect_Column_Representation_Learning_for_Table_Union_Search.pdf

Whill

14页

1次

2025-10-28

免费下载

LIFTus: An Adaptive Multi-aspect Column

Representation Learning for Table Union Search

Ermu Qiu

†

, Jun Gao

†∗

, Yaofeng Tu

‡

, and Jingru Yang

§∗

†

Key Laboratory of High Conﬁdence Software Technologies, CS, Peking University, China

‡

ZTE Corporation,

National Key Laboratory of Data Space Technology and System

qem@stu.pku.edu.cn, gaojun@pku.edu.cn, tu.yaofeng@zte.com.cn, okiyang@pku.edu.cn

Abstract—Table union search (TUS) represents a fundamental

operation in data lakes to ﬁnd tables unionable to the given

one. Recent approaches to TUS mainly learn column representa-

tions for searching by introducing Pre-trained Language Models

(PLMs), especially on columns with linguistic data. However, a

signiﬁcant amount of non-linguistic data, notably represented

by domain-speciﬁc strings and numerical data in the data lake,

are still under-explored in the existing methods. To address

this issue, we propose LIFTus, an adaptive multi-aspect column

representation for table unionable search, where aspect refers to

a concept more ﬂexible than data types, so that a single column

can exhibit multiple aspects simultaneously. LIFTus aims at

combining different aspects of a column (including both linguistic

and non-linguistic aspects) to promote the effectiveness and

generalization of TUS in a self-supervised manner. Speciﬁcally,

besides employing PLMs to extract the linguistic aspects from

an individual column, LIFTus trains a pattern encoder to

learn possible character-level sequential patterns for the column,

and builds a number encoder to capture numerical aspects of

the column, including the distribution and magnitude features.

LIFTus further utilizes a hierarchical cross-attention aided by

aspect-relevant statistics to combine these aspects adaptively in

producing the ﬁnal column representations, which are indexed by

vector retrieval techniques to achieve efﬁcient search. Extensive

experimental results demonstrate that LIFTus has outperformed

the current state-of-the-art methods in terms of effectiveness, and

achieved much better generalization capability to support unseen

data.

Index Terms—Table Union Search, Data Lake, Aspect, Non-

Linguistic Column, Generalization

I. INTRODUCTION

With the rapid growth of data from various sources, data

lake [1]–[3] becomes an attractive infrastructure to accommo-

date ﬂexible types of data in their raw form directly. Unlike

data warehouses [4], data lakes allow storing data including

relational tables (sometimes with metadata unavailable) and

other types without transforming into a predeﬁned schema [5].

Thus, a fundamental step for an analytic operation in data

lake is to locate the relevant data. TUS is such an operation

to identify a set of unionable tables in a data lake that share

similar columns with the given query Table [6].

TUS methods attempt to achieve effectiveness, efﬁciency

and generalization. Data instances collected in tables within a

data lake often exhibit considerable heterogeneity, encompass-

ing linguistic data (e.g. words, long texts, dates), non-linguistic

* Corresponding authors

data (e.g. domain-speciﬁc strings, numerical values), and even

their combinations [2], [7], [8]. It is challenging to ﬁnd the

desired unionable target tables effectively. In addition, the

column representation learning, a key step in the TUS, needs to

capture features of ﬂexible data instances and utilize external

knowledge, indicating high time consumption. Moreover, TUS

methods should be adaptive to new and evolving data, as the

volume of data in the lake grows rapidly. Consequently, a gen-

eralized model for column representation is highly desirable.

In the context of the TUS problem, unionability refers to the

potential for attributes or tables to be unioned based on shared

domain features [9]. Early approaches to TUS have focused

on designing rules to measure the unionability between two

tables [2], [9], [10], but face challenges in capturing latent

linguistic information. With the development of PLMs such

as BERT [11] and its variants [12], [13], recent works [14]–

[17] have employed these models into TUS, as the data in the

lake have ﬂexible format including the textual features [18].

Besides PLMs, knowledge bases (KBs) [19], [20] can also

provide information for the determination of unionability [7].

Some studies further notice the roles of the inter-column

relationships and compose a token sequence by sampling data

from different columns into the BERT-like models, thereby

learning the context-aware column representations [14], [15].

The learned column representations are then fed into the vector

index for fast similarity search [14]. Some studies calculate

pairwise similarities of columns to build a semantic graph for

TUS [19], achieving promising performance but at the cost of

additional preprocessing and online search time [7].

While PLM-based methods become mainstream in TUS,

their preference for linguistic data imposes potential limita-

tions on the performances, especially in the context of the

relational tables. We illustrate these issues in Fig. 1. Table

A and Table B can be unionable as they both pertain to

the entities of washing machines manufactured by the same

company and their corresponding columns are unionable,

while Table C serves as an example that is prone to model

misjudgment. We can see that three tables possess linguistic

data in their Location and Description columns, and Table

C also includes the Color column as part of the linguistic

data. Besides, Fig. 1 illustrates the widespread presence of

non-linguistic data in the data lake, including but not limited

to identiﬁers (e.g. Product Code column) and numerical data

(e.g. ID column, Capacity column, Price column, etc.). For

2174

2025 IEEE 41st International Conference on Data Engineering (ICDE)

DOI 10.1109/ICDE65448.2025.00165

Authorized licensed use limited to: ZTE Corporation (One Dept). Downloaded on October 28,2025 at 06:52:46 UTC from IEEE Xplore. Restrictions apply.

example, the strings from the Product Code and Function Abbr

columns (e.g. “SDL/JHX/BP/ZH” and “MB65V35E”) are not

tokens in natural language but highly domain-speciﬁc, as they

are codes designed by washing machine manufacturers.

Product Code

Capacity(L)

Function Abbr

Location

Price

Description

20001

EMS75B37mate6

7.5

SDL/JHX/BP/ZH

WalMart

3100

This is a top-loading washing machine that deli...

20002

ES100B35Mate5

DDL/JHX/DP

WalMart

2950

Featuring a 10L capacity, it can handle large loa...

20003

ES90B26Mate6

SDL/BP/FCR

WalMart

2200

The machine's smart sensors adjust the water...

Product Code

Capacity(L)

Function Abbr

Location

Price

Description

20981

EMS65B26Mate6

6.5

DDL/DP/DLCJ

Carrefour

1850

With a variety of wash cycles, including a quick...

20982

EB80B37Mate5

DDL/DP

Carrefour

2100

Its durable stainless steel drum is gentle on fab...

20983

ES95B36PLUS5

9.5

SDL/BP/FCR/ZH

Carrefour

3500

Equip with a self-cleaning function, eliminating...

Product Code

Power(W)

Color

Location

Price

Description

301

MB65V35E

340

White

WalMart

1900

The machine's sleek design and compact size m...

302

MB75V35E

400

Silver

WalMart

2150

For those with allergies, the UltraClean 3000's...

303

MB65V55E

360

Black

WalMart

2350

With a quiet operation and a 5-year warranty, t...

Table A

Table B

Table C

Fig. 1. An example of data lake. In this data lake, Table A and Table B

are unionable with each other, while Table C is easily misjudged as being

unionable with Table A. This misjudgment often stems from an inadequate

understanding of non-linguistic columns (e.g. ID, Product Code, Function

Abbr), which are widely distributed across the data lake and play a crucial

role in determining the unionability of tables. The metadata present here is

for easy understanding, but not used in the following processing.

We illustrate the issues faced by the existing PLM-based

methods. First, the domain-speciﬁc strings, represented by

commonly used identiﬁers, fall outside the typical token set

in the PLMs [11]–[13], and cannot be captured by PLMs

effectively. Taking Fig. 1 as an example, as the token sets

in the PLMs do not cover the values in Product Code column

and Function Abbr, PLMs handle them using the frequent

sub-strings, like BPE [21], computed from the large common

corpus, which may be different from the key sub-parts in these

domain-speciﬁc strings. We can see that the similarity between

“EMS100B37mate6” and “EMS65B26Mate6” should be

signiﬁcantly greater than that between “EMS100B37mate6”

and “MB65V35E”. However, PLMs may decompose the string

into different parts (e.g. “E”, “MS1”), which damages the

understanding of the domain-speciﬁc string. It is imperative

to employ character-level sequential modeling for domain-

speciﬁc strings to capture their similarities.

Second, the accurate perception of the numerical values to

match columns is an open problem [22]. It is a live issue for

PLMs to adequately understand numerical values [23]. PLMs’

token sets can not cover inﬁnite numerical values distributed

in continuous space. Thus, the PLM-based methods [14]–[16]

struggle to fully understand real numbers, which leads to a

loss in the similarity computation over arbitrary numbers in

real-life applications. The set-coverage-based methods [9] may

yield Jaccard similarity scores nearly zero (e.g. {7.5, 10, 9}

in Capacity column in Table A, {6.5, 8, 9.5} in Capacity

column in Table B in Fig. 1), despite the fact that these

numbers differ only slightly. We also notice that there has

been research on encoding numerical types in tables [24].

However, there remains a signiﬁcant gap between encoding

individual numbers and encoding entire numerical columns in

TUS, where the latter needs to capture the crucial criteria to

determine the similarity among columns, including distribution

and magnitude features.

Third, columns may include combined features, and relying

solely on data-type-speciﬁc approaches may be too coarse

to accurately capture column features. As we can see from

Fig. 1, some columns in a data lake are rich in semantic

information, characterized by a high proportion of natural

language words, such as the Description columns, which rely

more on external corpora or PLMs. In contrast, the Function

Abbr columns in Table A and Table B should guide the model

to prioritize character-level similarity during unionable search.

Additionally, some columns are predominantly numerical,

such as the ID and Price columns across the three tables. In

such a case, the determination should focus on the distribution

and magnitude features of the numbers. Moreover, certain

columns contain a mixture of numerical data and non-language

strings, such as the Product Code columns in the three tables.

We can see that even if we can obtain the data type for the

column, the data-type-speciﬁc methods are not sufﬁcient to

capture the ﬂexible features inside the column.

To overcome the aforementioned challenges and issues of

the existing methods, we make the following design choices.

(1) We introduce a notation of column aspect, which is ﬁner

than column data type, as each column has only one data

type but with multiple aspects. Thus, instead of invoking one

method after determining the column data type, we apply

multiple aspect extractors on one column simultaneously, so

as to handle the different forms of the same typed column

or mixed-typed data. We also expect that the full exploration

of aspects in an individual column can enhance the model’s

generalization to unseen data. (2) We do not solely rely on

PLM to handle all kinds of columns, but introduce effective

methods to capture distinct features in columns, including

the linguistic data, domain-speciﬁc strings, and numerical

values. For aggregating these aspects, we design a learning

method rather than heuristic rules to enable robust adaptation

to scenarios involving tables with diverse information. (3) We

mainly study the exploration of features in one column, and

leave the inter-column relationships into future work, since

different features inside one single column, we believe, are

still underexplored. (4) The method will yield the column

embedding, instead of the pairwise similarity model in order

to leverage the existing vector index like HNSW (Hierarchical

Navigable Small World) [25] to improve the efﬁciency of

online search.

This paper proposes LIFTus, an adaptive mul

ti-aspect col-

umn representation for table unionable search. The contribu-

tions are as follows:

• We design methods to capture the key non-linguistic as-

pects in LIFTus, including a pattern encoder for domain-

speciﬁc strings to infer similarities between strings, and

a number encoder to convert the numerical aspect into

embedding that preserves the distribution and magnitude

features.

• We devise a hierarchical cross-attention integrator, to

adaptively combine different aspects of columns, in which

2175

Authorized licensed use limited to: ZTE Corporation (One Dept). Downloaded on October 28,2025 at 06:52:46 UTC from IEEE Xplore. Restrictions apply.

of 14

免费下载

关注

评论