billions [
8
,
11
] of trainable parameters. Generating task-specic
training data at suciently large scale is often prohibitively ex-
pensive. Fortunately, it is typically possible to reduce the required
amount of task-specic training signicantly by a pre-training
stage that uses large amounts of unlabeled data (e.g., Web text) [
13
].
This study evaluates pre-trained Transformer models, ne-tuned
with a moderate amount of training data that is specic to the
task of correlation detection. Whereas large Transformer models
with hundreds of billions of parameters are nowadays available,
typically hosted remotely by providers such as OpenAI [
11
], this
study focuses on much smaller models (with parameter counts in
the hundreds of millions “only”) that can be run with moderate
overheads on commodity machines. This seems reasonable as over-
heads due to using large language models may otherwise eclipse
data proling overheads altogether.
This study is based on a newly generated benchmark for data
correlation detection. Prior benchmarks of algorithms for correla-
tion detection typically use a small number of data sets [
16
]. This is
reasonable, as long as performance depends on data properties but
not on data semantics. When analyzing column names via language
models, however, the data domain may have signicant impact on
prediction performance (e.g., beneting application domains that
appear more frequently in the pre-training data). Hence, to evalu-
ate language models under realistic conditions, this study uses a
benchmark generated from around 4,000 tabular data sets, down-
loaded from the Kaggle platform. For those data sets, the benchmark
analyzes correlation between column pairs according to multiple
popular correlation metrics, namely Pearson correlation [
56
], Spear-
man’s correlation coecient [
4
], and Theil’s U [
37
]. While this data
is useful to test the primary hypothesis evaluated in this paper, i.e.
that relevant information can be extracted from schema elements
via language models, it can also be used to test NLP-enhanced data
proling approaches. We will see one example of that in Section
??
.
In summary, the original scientic contributions in this experi-
mental paper are the following.
•
The paper introduces a new benchmark, useful to test corre-
lation prediction, based on column names, and to evaluate
approaches for NLP-enhanced database tuning.
•
The paper tests the ability of language models to infer infor-
mation on data correlation from column names, considering
dierent correlation metrics, scenarios, and models.
•
The paper evaluates a simple baseline algorithm for ecient
correlation prediction, exploiting information gained via
natural language analysis.
The remainder of this paper is organized as follows. Section 2
provides background on the techniques used throughout the paper
and discusses related work. Section 3 describes the generation of the
benchmark, used to evaluate correlation detection methods. Next,
Section 4 analyzes the benchmark data set, in terms of data statistics
and correlation properties. Section 5 compares dierent methods
for predicting data correlations from column names, including pre-
trained models and simpler baselines. Section 6 studies the impact
of several scenario properties, including the amount and quality
of training data, to study the impact on prediction performance.
Section 7 analyzes prediction performance for dierent data subsets
separately, breaking down, for instance, by column name length
among other properties. Section 8 considers dierent correlation
metrics, thereby obtaining insights into how well the prior ndings
generalize. Finally, Section 9 evaluates the impact of column types
on prediction performance.
2 BACKGROUND AND RELATED WORK
This section discusses prior work, related to this study. Section 2.1
discusses prior work on data proling, a primary application do-
main for the approaches evaluated in this paper. Section 2.2 dis-
cusses, more specically, prior work on data correlation analysis.
Section 2.3 discusses the technology that this study is based upon:
pre-trained language models. Finally, Section 2.4 discusses prior
work applying such or similar technology in the context of data
management.
2.1 Data Proling
The goal of data proling is to generate statistics and meta-data
about a given data set [
31
]. Specialized tools have been developed
for data proling, including systems from industry [
15
,
17
] as well
as academia [
5
,
16
,
32
,
36
]. Typically, users specify a target data set
for proling as well as specic types of meta-data to consider. Data
proling is expensive and may have to be repeated periodically as
the data changes. Hence, proling tools often allow users to restrict
proling overheads, e.g. by setting time limits [32, 45].
Proling methods have been proposed for mining dierent kinds
of meta-data, ranging from statistics over single columns [
9
] to
more expensive operations such as unique column combination
discovery [
1
,
34
], detecting inclusion dependencies [
33
], foreign
keys [
39
], order dependencies [
18
,
23
], or statistical data correla-
tions [5, 16], the focus of this study.
2.2 Detecting Correlations
The fact that data correlations are important has motivated work
aimed at nding correlations in data sets [
5
,
16
]. To guide prol-
ing eorts, such tools typically analyze data samples. The sample
size is often chosen as a function of total data size. In contrast, the
time for predicting correlation based on column names does not
depend on the data size. Signicant work has been dedicated to the
problem of selectivity estimation with correlations [
6
,
29
,
54
]. Here,
correlations play an important role in estimating aggregate selectiv-
ity of predicate groups. More recently, machine learning has been
proposed as a method to solve various types of tuning problems
in the context of databases [
28
,
35
,
53
,
57
,
59
]. Correlated data is a
primary reason to replace more traditional cost models, often based
on the independence assumption, via learned models. This stream
of work connects to this study as it applies machine learning for
predicting correlations. However, this study uses machine learning
in the form of NLP-based analysis of database schema elements.
2.3 Language Models
Pre-trained language models, based on the Transformer architec-
ture [55], have recently led to signicant advances on a multitude
of NLP tasks [
58
]. Pre-trained language models are based on the
idea of “transfer learning”. For many specialized NLP tasks, it is
dicult to accumulate a large enough body of training data. Also,
overheads related to the training of large neural networks from
4311
评论