暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
【康奈尔大学2024VLDB】Can Large Language Models Predict Data Correlations from Column Names.pdf
95
14页
0次
2025-04-17
免费下载
Can Large Language Models Predict Data Correlations from
Column Names?
Immanuel Trummer
Cornell Database Group
Ithaca, NY, USA
itrummer@cornell.edu
ABSTRACT
Recent publications suggest using natural language analysis on
database schema elements to guide tuning and proling eorts. The
underlying hypothesis is that state-of-the-art language processing
methods, so-called language models, are able to extract information
on data properties from schema text.
This paper examines that hypothesis in the context of data cor-
relation analysis: is it possible to nd column pairs with correlated
data by analyzing their names via language models? First, the paper
introduces a novel benchmark for data correlation analysis, cre-
ated by analyzing thousands of Kaggle data sets (and available for
download). Second, it uses that data to study the ability of language
models to predict correlation, based on column names. The analysis
covers dierent language models, various correlation metrics, and
a multitude of accuracy metrics. It pinpoints factors that contribute
to successful predictions, such as the length of column names as
well as the ratio of words. Finally, the study analyzes the impact
of column types on prediction performance. The results show that
schema text can be a useful source of information and inform future
research eorts, targeted at NLP-enhanced database tuning and
data proling.
PVLDB Reference Format:
Immanuel Trummer. Can Large Language Models Predict Data
Correlations from Column Names?. PVLDB, 16(13): 4310 - 4323, 2023.
doi:10.14778/3625054.3625066
PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at
https://github.com/itrummer/DataCorrelationPredictionWithNLP.
1 INTRODUCTION
Consider a table named “cars” with columns named “maker” and
“model”. Most people would assume, based on column names and
commonsense knowledge, that maker and model columns are cor-
related (i.e., knowing the maker will restrict options for the model).
Such reasoning is possible if column names are meaningful. Assign-
ing meaningful column names is good practice, but of course there
are rare exceptions which we are not concerned with here. In this
paper, we study the question of whether automated tuning tools
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 16, No. 13 ISSN 2150-8097.
doi:10.14778/3625054.3625066
could apply a similar kind of reasoning, exploiting recent innova-
tions in the domain of natural language analysis (NLP): pre-trained
language models [10].
This research question is motivated by my recent work [
48
,
52
],
suggesting to use NLP on database schema elements to inform
database tuning, in particular, to help prioritizing data proling
operations. The underlying hypothesis behind those suggestions,
namely, whether language models are able to infer relevant infor-
mation with suciently high reliability, has not been investigated
in detail. This paper closes that gap, focusing on extracting infor-
mation about data correlations.
Detecting correlations in data has been a topic of signicant
interest in the database research community [
5
,
16
]. Knowing data
correlation is useful in many scenarios. For instance, query opti-
mizers [
43
] (as well as other tuning tools) often depend on accurate
predictions of intermediate result sizes. Classical prediction models
assume uncorrelated data, thereby being misled in practice [
24
].
As pointed out in prior work [
16
], knowing about correlations
can help to correct cardinality estimates. Alternatively, knowing
about possible data correlations can help to prune options with
correlation-related uncertainty from the search space (e.g., the opti-
mizer can favor join orders where intermediate result sizes do not
depend on columns that are likely correlated).
Detecting data correlations requires comparing data in dierent
columns, often making correlation detection more expensive than
operations that focus on dierent columns in separation. This has
motivated dedicated research on algorithms that make correlation
detection more ecient [
5
,
16
]. Typically, those prior algorithms do
not exploit information gained via analysis of the database schema,
using language models. However, as suggested in my prior work [
48
,
52
], such analysis could be helpful in order to better allocate and
prioritize proling eorts. For instance, given a limited proling
budget, the analysis scope could be restricted to column subsets that
are more likely to be correlated, based on the results of NLP. Within
those column subsets, any of the existing algorithms for correlation
detection could be used. This assumes, however, that NLP is indeed
useful to extract relevant information form the database schema.
Whether or not that is actually the case, is the subject of the current
study.
The hope of extracting useful information from database schema
names alone is lled by recent advances in the eld of natural lan-
guage processing. Primarily, those advances are due to two key
developments: a novel neural network architecture, the so-called
Transformer [
55
], as well as new training methods that exploit
large amounts of unlabeled training data [
40
]. Among other advan-
tages, Transformer models enable ecient training of large neural
network models with hundreds of millions [
10
] to hundreds of
4310
billions [
8
,
11
] of trainable parameters. Generating task-specic
training data at suciently large scale is often prohibitively ex-
pensive. Fortunately, it is typically possible to reduce the required
amount of task-specic training signicantly by a pre-training
stage that uses large amounts of unlabeled data (e.g., Web text) [
13
].
This study evaluates pre-trained Transformer models, ne-tuned
with a moderate amount of training data that is specic to the
task of correlation detection. Whereas large Transformer models
with hundreds of billions of parameters are nowadays available,
typically hosted remotely by providers such as OpenAI [
11
], this
study focuses on much smaller models (with parameter counts in
the hundreds of millions “only”) that can be run with moderate
overheads on commodity machines. This seems reasonable as over-
heads due to using large language models may otherwise eclipse
data proling overheads altogether.
This study is based on a newly generated benchmark for data
correlation detection. Prior benchmarks of algorithms for correla-
tion detection typically use a small number of data sets [
16
]. This is
reasonable, as long as performance depends on data properties but
not on data semantics. When analyzing column names via language
models, however, the data domain may have signicant impact on
prediction performance (e.g., beneting application domains that
appear more frequently in the pre-training data). Hence, to evalu-
ate language models under realistic conditions, this study uses a
benchmark generated from around 4,000 tabular data sets, down-
loaded from the Kaggle platform. For those data sets, the benchmark
analyzes correlation between column pairs according to multiple
popular correlation metrics, namely Pearson correlation [
56
], Spear-
man’s correlation coecient [
4
], and Theil’s U [
37
]. While this data
is useful to test the primary hypothesis evaluated in this paper, i.e.
that relevant information can be extracted from schema elements
via language models, it can also be used to test NLP-enhanced data
proling approaches. We will see one example of that in Section
??
.
In summary, the original scientic contributions in this experi-
mental paper are the following.
The paper introduces a new benchmark, useful to test corre-
lation prediction, based on column names, and to evaluate
approaches for NLP-enhanced database tuning.
The paper tests the ability of language models to infer infor-
mation on data correlation from column names, considering
dierent correlation metrics, scenarios, and models.
The paper evaluates a simple baseline algorithm for ecient
correlation prediction, exploiting information gained via
natural language analysis.
The remainder of this paper is organized as follows. Section 2
provides background on the techniques used throughout the paper
and discusses related work. Section 3 describes the generation of the
benchmark, used to evaluate correlation detection methods. Next,
Section 4 analyzes the benchmark data set, in terms of data statistics
and correlation properties. Section 5 compares dierent methods
for predicting data correlations from column names, including pre-
trained models and simpler baselines. Section 6 studies the impact
of several scenario properties, including the amount and quality
of training data, to study the impact on prediction performance.
Section 7 analyzes prediction performance for dierent data subsets
separately, breaking down, for instance, by column name length
among other properties. Section 8 considers dierent correlation
metrics, thereby obtaining insights into how well the prior ndings
generalize. Finally, Section 9 evaluates the impact of column types
on prediction performance.
2 BACKGROUND AND RELATED WORK
This section discusses prior work, related to this study. Section 2.1
discusses prior work on data proling, a primary application do-
main for the approaches evaluated in this paper. Section 2.2 dis-
cusses, more specically, prior work on data correlation analysis.
Section 2.3 discusses the technology that this study is based upon:
pre-trained language models. Finally, Section 2.4 discusses prior
work applying such or similar technology in the context of data
management.
2.1 Data Proling
The goal of data proling is to generate statistics and meta-data
about a given data set [
31
]. Specialized tools have been developed
for data proling, including systems from industry [
15
,
17
] as well
as academia [
5
,
16
,
32
,
36
]. Typically, users specify a target data set
for proling as well as specic types of meta-data to consider. Data
proling is expensive and may have to be repeated periodically as
the data changes. Hence, proling tools often allow users to restrict
proling overheads, e.g. by setting time limits [32, 45].
Proling methods have been proposed for mining dierent kinds
of meta-data, ranging from statistics over single columns [
9
] to
more expensive operations such as unique column combination
discovery [
1
,
34
], detecting inclusion dependencies [
33
], foreign
keys [
39
], order dependencies [
18
,
23
], or statistical data correla-
tions [5, 16], the focus of this study.
2.2 Detecting Correlations
The fact that data correlations are important has motivated work
aimed at nding correlations in data sets [
5
,
16
]. To guide prol-
ing eorts, such tools typically analyze data samples. The sample
size is often chosen as a function of total data size. In contrast, the
time for predicting correlation based on column names does not
depend on the data size. Signicant work has been dedicated to the
problem of selectivity estimation with correlations [
6
,
29
,
54
]. Here,
correlations play an important role in estimating aggregate selectiv-
ity of predicate groups. More recently, machine learning has been
proposed as a method to solve various types of tuning problems
in the context of databases [
28
,
35
,
53
,
57
,
59
]. Correlated data is a
primary reason to replace more traditional cost models, often based
on the independence assumption, via learned models. This stream
of work connects to this study as it applies machine learning for
predicting correlations. However, this study uses machine learning
in the form of NLP-based analysis of database schema elements.
2.3 Language Models
Pre-trained language models, based on the Transformer architec-
ture [55], have recently led to signicant advances on a multitude
of NLP tasks [
58
]. Pre-trained language models are based on the
idea of “transfer learning”. For many specialized NLP tasks, it is
dicult to accumulate a large enough body of training data. Also,
overheads related to the training of large neural networks from
4311
of 14
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜