【康奈尔大学2024VLDB】Can Large Language Models Predict Data Correlations from Column Names.pdf

胖酒精灯

14页

0次

2025-04-17

免费下载

Can Large Language Models Predict Data Correlations from

Column Names?

Immanuel Trummer

Cornell Database Group

Ithaca, NY, USA

itrummer@cornell.edu

ABSTRACT

Recent publications suggest using natural language analysis on

database schema elements to guide tuning and proling eorts. The

underlying hypothesis is that state-of-the-art language processing

methods, so-called language models, are able to extract information

on data properties from schema text.

This paper examines that hypothesis in the context of data cor-

relation analysis: is it possible to nd column pairs with correlated

data by analyzing their names via language models? First, the paper

introduces a novel benchmark for data correlation analysis, cre-

ated by analyzing thousands of Kaggle data sets (and available for

download). Second, it uses that data to study the ability of language

models to predict correlation, based on column names. The analysis

covers dierent language models, various correlation metrics, and

a multitude of accuracy metrics. It pinpoints factors that contribute

to successful predictions, such as the length of column names as

well as the ratio of words. Finally, the study analyzes the impact

of column types on prediction performance. The results show that

schema text can be a useful source of information and inform future

research eorts, targeted at NLP-enhanced database tuning and

data proling.

PVLDB Reference Format:

Immanuel Trummer. Can Large Language Models Predict Data

Correlations from Column Names?. PVLDB, 16(13): 4310 - 4323, 2023.

doi:10.14778/3625054.3625066

PVLDB Artifact Availability:

The source code, data, and/or other artifacts have been made available at

https://github.com/itrummer/DataCorrelationPredictionWithNLP.

1 INTRODUCTION

Consider a table named “cars” with columns named “maker” and

“model”. Most people would assume, based on column names and

commonsense knowledge, that maker and model columns are cor-

related (i.e., knowing the maker will restrict options for the model).

Such reasoning is possible if column names are meaningful. Assign-

ing meaningful column names is good practice, but of course there

are rare exceptions which we are not concerned with here. In this

paper, we study the question of whether automated tuning tools

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 16, No. 13 ISSN 2150-8097.

doi:10.14778/3625054.3625066

could apply a similar kind of reasoning, exploiting recent innova-

tions in the domain of natural language analysis (NLP): pre-trained

language models [10].

This research question is motivated by my recent work [

suggesting to use NLP on database schema elements to inform

database tuning, in particular, to help prioritizing data proling

operations. The underlying hypothesis behind those suggestions,

namely, whether language models are able to infer relevant infor-

mation with suciently high reliability, has not been investigated

in detail. This paper closes that gap, focusing on extracting infor-

mation about data correlations.

Detecting correlations in data has been a topic of signicant

interest in the database research community [

]. Knowing data

correlation is useful in many scenarios. For instance, query opti-

mizers [

] (as well as other tuning tools) often depend on accurate

predictions of intermediate result sizes. Classical prediction models

assume uncorrelated data, thereby being misled in practice [

As pointed out in prior work [

], knowing about correlations

can help to correct cardinality estimates. Alternatively, knowing

about possible data correlations can help to prune options with

correlation-related uncertainty from the search space (e.g., the opti-

mizer can favor join orders where intermediate result sizes do not

depend on columns that are likely correlated).

Detecting data correlations requires comparing data in dierent

columns, often making correlation detection more expensive than

operations that focus on dierent columns in separation. This has

motivated dedicated research on algorithms that make correlation

detection more ecient [

]. Typically, those prior algorithms do

not exploit information gained via analysis of the database schema,

using language models. However, as suggested in my prior work [

], such analysis could be helpful in order to better allocate and

prioritize proling eorts. For instance, given a limited proling

budget, the analysis scope could be restricted to column subsets that

are more likely to be correlated, based on the results of NLP. Within

those column subsets, any of the existing algorithms for correlation

detection could be used. This assumes, however, that NLP is indeed

useful to extract relevant information form the database schema.

Whether or not that is actually the case, is the subject of the current

study.

The hope of extracting useful information from database schema

names alone is lled by recent advances in the eld of natural lan-

guage processing. Primarily, those advances are due to two key

developments: a novel neural network architecture, the so-called

Transformer [

], as well as new training methods that exploit

large amounts of unlabeled training data [

]. Among other advan-

tages, Transformer models enable ecient training of large neural

network models with hundreds of millions [

] to hundreds of

4310

billions [

] of trainable parameters. Generating task-specic

training data at suciently large scale is often prohibitively ex-

pensive. Fortunately, it is typically possible to reduce the required

amount of task-specic training signicantly by a pre-training

stage that uses large amounts of unlabeled data (e.g., Web text) [

This study evaluates pre-trained Transformer models, ne-tuned

with a moderate amount of training data that is specic to the

task of correlation detection. Whereas large Transformer models

with hundreds of billions of parameters are nowadays available,

typically hosted remotely by providers such as OpenAI [

], this

study focuses on much smaller models (with parameter counts in

the hundreds of millions “only”) that can be run with moderate

overheads on commodity machines. This seems reasonable as over-

heads due to using large language models may otherwise eclipse

data proling overheads altogether.

This study is based on a newly generated benchmark for data

correlation detection. Prior benchmarks of algorithms for correla-

tion detection typically use a small number of data sets [

]. This is

reasonable, as long as performance depends on data properties but

not on data semantics. When analyzing column names via language

models, however, the data domain may have signicant impact on

prediction performance (e.g., beneting application domains that

appear more frequently in the pre-training data). Hence, to evalu-

ate language models under realistic conditions, this study uses a

benchmark generated from around 4,000 tabular data sets, down-

loaded from the Kaggle platform. For those data sets, the benchmark

analyzes correlation between column pairs according to multiple

popular correlation metrics, namely Pearson correlation [

], Spear-

man’s correlation coecient [

], and Theil’s U [

]. While this data

is useful to test the primary hypothesis evaluated in this paper, i.e.

that relevant information can be extracted from schema elements

via language models, it can also be used to test NLP-enhanced data

proling approaches. We will see one example of that in Section

In summary, the original scientic contributions in this experi-

mental paper are the following.

•

The paper introduces a new benchmark, useful to test corre-

lation prediction, based on column names, and to evaluate

approaches for NLP-enhanced database tuning.

•

The paper tests the ability of language models to infer infor-

mation on data correlation from column names, considering

dierent correlation metrics, scenarios, and models.

•

The paper evaluates a simple baseline algorithm for ecient

correlation prediction, exploiting information gained via

natural language analysis.

The remainder of this paper is organized as follows. Section 2

provides background on the techniques used throughout the paper

and discusses related work. Section 3 describes the generation of the

benchmark, used to evaluate correlation detection methods. Next,

Section 4 analyzes the benchmark data set, in terms of data statistics

and correlation properties. Section 5 compares dierent methods

for predicting data correlations from column names, including pre-

trained models and simpler baselines. Section 6 studies the impact

of several scenario properties, including the amount and quality

of training data, to study the impact on prediction performance.

Section 7 analyzes prediction performance for dierent data subsets

separately, breaking down, for instance, by column name length

among other properties. Section 8 considers dierent correlation

metrics, thereby obtaining insights into how well the prior ndings

generalize. Finally, Section 9 evaluates the impact of column types

on prediction performance.

2 BACKGROUND AND RELATED WORK

This section discusses prior work, related to this study. Section 2.1

discusses prior work on data proling, a primary application do-

main for the approaches evaluated in this paper. Section 2.2 dis-

cusses, more specically, prior work on data correlation analysis.

Section 2.3 discusses the technology that this study is based upon:

pre-trained language models. Finally, Section 2.4 discusses prior

work applying such or similar technology in the context of data

management.

2.1 Data Proling

The goal of data proling is to generate statistics and meta-data

about a given data set [

]. Specialized tools have been developed

for data proling, including systems from industry [

] as well

as academia [

]. Typically, users specify a target data set

for proling as well as specic types of meta-data to consider. Data

proling is expensive and may have to be repeated periodically as

the data changes. Hence, proling tools often allow users to restrict

proling overheads, e.g. by setting time limits [32, 45].

Proling methods have been proposed for mining dierent kinds

of meta-data, ranging from statistics over single columns [

] to

more expensive operations such as unique column combination

discovery [

], detecting inclusion dependencies [

], foreign

keys [

], order dependencies [

], or statistical data correla-

tions [5, 16], the focus of this study.

2.2 Detecting Correlations

The fact that data correlations are important has motivated work

aimed at nding correlations in data sets [

]. To guide prol-

ing eorts, such tools typically analyze data samples. The sample

size is often chosen as a function of total data size. In contrast, the

time for predicting correlation based on column names does not

depend on the data size. Signicant work has been dedicated to the

problem of selectivity estimation with correlations [

]. Here,

correlations play an important role in estimating aggregate selectiv-

ity of predicate groups. More recently, machine learning has been

proposed as a method to solve various types of tuning problems

in the context of databases [

]. Correlated data is a

primary reason to replace more traditional cost models, often based

on the independence assumption, via learned models. This stream

of work connects to this study as it applies machine learning for

predicting correlations. However, this study uses machine learning

in the form of NLP-based analysis of database schema elements.

2.3 Language Models

Pre-trained language models, based on the Transformer architec-

ture [55], have recently led to signicant advances on a multitude

of NLP tasks [

]. Pre-trained language models are based on the

idea of “transfer learning”. For many specialized NLP tasks, it is

dicult to accumulate a large enough body of training data. Also,

overheads related to the training of large neural networks from

4311

of 14

免费下载

关注

评论