
example, the strings from the Product Code and Function Abbr
columns (e.g. “SDL/JHX/BP/ZH” and “MB65V35E”) are not
tokens in natural language but highly domain-specific, as they
are codes designed by washing machine manufacturers.
This is a top-loading washing machine that deli...
Featuring a 10L capacity, it can handle large loa...
The machine's smart sensors adjust the water...
With a variety of wash cycles, including a quick...
Its durable stainless steel drum is gentle on fab...
Equip with a self-cleaning function, eliminating...
The machine's sleek design and compact size m...
For those with allergies, the UltraClean 3000's...
With a quiet operation and a 5-year warranty, t...
Fig. 1. An example of data lake. In this data lake, Table A and Table B
are unionable with each other, while Table C is easily misjudged as being
unionable with Table A. This misjudgment often stems from an inadequate
understanding of non-linguistic columns (e.g. ID, Product Code, Function
Abbr), which are widely distributed across the data lake and play a crucial
role in determining the unionability of tables. The metadata present here is
for easy understanding, but not used in the following processing.
We illustrate the issues faced by the existing PLM-based
methods. First, the domain-specific strings, represented by
commonly used identifiers, fall outside the typical token set
in the PLMs [11]–[13], and cannot be captured by PLMs
effectively. Taking Fig. 1 as an example, as the token sets
in the PLMs do not cover the values in Product Code column
and Function Abbr, PLMs handle them using the frequent
sub-strings, like BPE [21], computed from the large common
corpus, which may be different from the key sub-parts in these
domain-specific strings. We can see that the similarity between
“EMS100B37mate6” and “EMS65B26Mate6” should be
significantly greater than that between “EMS100B37mate6”
and “MB65V35E”. However, PLMs may decompose the string
into different parts (e.g. “E”, “MS1”), which damages the
understanding of the domain-specific string. It is imperative
to employ character-level sequential modeling for domain-
specific strings to capture their similarities.
Second, the accurate perception of the numerical values to
match columns is an open problem [22]. It is a live issue for
PLMs to adequately understand numerical values [23]. PLMs’
token sets can not cover infinite numerical values distributed
in continuous space. Thus, the PLM-based methods [14]–[16]
struggle to fully understand real numbers, which leads to a
loss in the similarity computation over arbitrary numbers in
real-life applications. The set-coverage-based methods [9] may
yield Jaccard similarity scores nearly zero (e.g. {7.5, 10, 9}
in Capacity column in Table A, {6.5, 8, 9.5} in Capacity
column in Table B in Fig. 1), despite the fact that these
numbers differ only slightly. We also notice that there has
been research on encoding numerical types in tables [24].
However, there remains a significant gap between encoding
individual numbers and encoding entire numerical columns in
TUS, where the latter needs to capture the crucial criteria to
determine the similarity among columns, including distribution
and magnitude features.
Third, columns may include combined features, and relying
solely on data-type-specific approaches may be too coarse
to accurately capture column features. As we can see from
Fig. 1, some columns in a data lake are rich in semantic
information, characterized by a high proportion of natural
language words, such as the Description columns, which rely
more on external corpora or PLMs. In contrast, the Function
Abbr columns in Table A and Table B should guide the model
to prioritize character-level similarity during unionable search.
Additionally, some columns are predominantly numerical,
such as the ID and Price columns across the three tables. In
such a case, the determination should focus on the distribution
and magnitude features of the numbers. Moreover, certain
columns contain a mixture of numerical data and non-language
strings, such as the Product Code columns in the three tables.
We can see that even if we can obtain the data type for the
column, the data-type-specific methods are not sufficient to
capture the flexible features inside the column.
To overcome the aforementioned challenges and issues of
the existing methods, we make the following design choices.
(1) We introduce a notation of column aspect, which is finer
than column data type, as each column has only one data
type but with multiple aspects. Thus, instead of invoking one
method after determining the column data type, we apply
multiple aspect extractors on one column simultaneously, so
as to handle the different forms of the same typed column
or mixed-typed data. We also expect that the full exploration
of aspects in an individual column can enhance the model’s
generalization to unseen data. (2) We do not solely rely on
PLM to handle all kinds of columns, but introduce effective
methods to capture distinct features in columns, including
the linguistic data, domain-specific strings, and numerical
values. For aggregating these aspects, we design a learning
method rather than heuristic rules to enable robust adaptation
to scenarios involving tables with diverse information. (3) We
mainly study the exploration of features in one column, and
leave the inter-column relationships into future work, since
different features inside one single column, we believe, are
still underexplored. (4) The method will yield the column
embedding, instead of the pairwise similarity model in order
to leverage the existing vector index like HNSW (Hierarchical
Navigable Small World) [25] to improve the efficiency of
online search.
This paper proposes LIFTus, an adaptive mul
ti-aspect col-
umn representation for table unionable search. The contribu-
tions are as follows:
• We design methods to capture the key non-linguistic as-
pects in LIFTus, including a pattern encoder for domain-
specific strings to infer similarities between strings, and
a number encoder to convert the numerical aspect into
embedding that preserves the distribution and magnitude
features.
• We devise a hierarchical cross-attention integrator, to
adaptively combine different aspects of columns, in which
2175
Authorized licensed use limited to: ZTE Corporation (One Dept). Downloaded on October 28,2025 at 06:52:46 UTC from IEEE Xplore. Restrictions apply.
评论