DBPal- A Fully Pluggable NL2SQL Training Pipeline.pdf

明智健悟

181

15页

3次

2023-03-09

免费下载

DBPal: A Fully Pluggable NL2SQL Training Pipeline

Nathaniel Weir

Prasetya Utama

Alex Galakatos

Andrew Crotty

Amir Ilkhechi

Shekar Ramaswamy

Rohin Bhushan

Nadja Geisler

Benjamin Hättasch

Steen Eger

Ugur Cetintemel

Carsten Binnig

Johns Hopkins University {nweir3@jhu.edu}

TU Darmstadt {rst.last@cs.tu-darmstadt.de}

Brown University {rst_last@brown.edu}

ABSTRACT

Natural language is a promising alternative interface to

DBMSs because it enables non-technical users to formulate

complex questions in a more concise manner than SQL. Re-

cently, deep learning has gained traction for translating natu-

ral language to SQL, since similar ideas have been successful

in the related domain of machine translation. However, the

core problem with existing deep learning approaches is that

they require an enormous amount of training data in or-

der to provide accurate translations. This training data is

extremely expensive to curate, since it generally requires

humans to manually annotate natural language examples

with the corresponding SQL queries (or vice versa).

Based on these observations, we propose DBPal, a new

approach that augments existing deep learning techniques

in order to improve the performance of models for natural

language to SQL translation. More specically, we present

a novel training pipeline that automatically generates syn-

thetic training data in order to (1) improve overall translation

accuracy, (2) increase robustness to linguistic variation, and

(3) specialize the model for the target database. As we show,

our DBPal training pipeline is able to improve both the ac-

curacy and linguistic robustness of state-of-the-art natural

language to SQL translation models.

ACM Reference Format:

Nathaniel Weir et al. 2020. DBPal: A Fully Pluggable NL2SQL Train-

ing Pipeline. In Proceedings of the 2020 ACM SIGMOD Interna-

tional Conference on Management of Data (SIGMOD’20), June 14–

19, 2020, Portland, OR, USA. ACM, New York, NY, USA, 15 pages.

https://doi.org/10.1145/3318464.3380589

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for prot or commercial advantage and that copies bear

this notice and the full citation on the rst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with

credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior specic permission and/or a fee. Request

permissions from permissions@acm.org.

SIGMOD’20, June 14–19, 2020, Portland, OR, USA

ACM ISBN 978-1-4503-6735-6/20/06. .. $15.00

https://doi.org/10.1145/3318464.3380589

1 INTRODUCTION

In order to eectively leverage their data, DBMS users are re-

quired to not only have prior knowledge about the database

schema (e.g., table and column names, entity relationships)

but also a working understanding of the syntax and seman-

tics of SQL. Unfortunately, despite its expressiveness, SQL

can often hinder non-technical users from exploring and

making use of data stored in a DBMS. These requirements

set “a high barrier to entry” for data exploration and have

therefore triggered new eorts to develop alternative inter-

faces that allow non-technical users to explore and interact

with their data conveniently.

For example, imagine that a doctor wants to look at the

age distribution of patients with the longest stays in a hospi-

tal. To answer this question, the doctor would either need to

write a complex nested SQL query or work with an analyst

to craft the query. Even with a visual exploration tool (e.g.,

Tableau [

], Vizdom [

]), posing such a query is nontrivial,

since it requires the user to perform multiple interactions

with an understanding of the nested query semantics. Alter-

natively, with a natural language (NL) interface, the query is

as simple as stating: “What is the age distribution of patients

who stayed longest in the hospital?”

Based on this observation, a number of Natural Language

Interfaces to Databases (NLIDBs) have been proposed that

aim to translate natural language to SQL (NL2SQL). The rst

category of solutions are rule-based systems (e.g., NaLIR [

]), which use xed rules for performing translations. Al-

though eective in specic instances, these approaches are

brittle and do not generalize well without substantial addi-

tional eort to support new use cases. More recently, deep

learning techniques [

] have gained traction for

NL2SQL, since similar ideas have achieved success in the

related domain of machine translation. For example, generic

sequence-to-sequence (seq2seq) [

] models have been suc-

cessfully used in practice for NL2SQL translation, and more

advanced approaches like SyntaxSQLNet [

], which aug-

ments deep learning models with a structured model that

considers the syntax and semantics of SQL, have also been

proposed.

Research 26: Usability and Natural Language User Interfaces

SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

2347

However, a crucial problem with deep learning approaches

is that they require an enormous amount of training data in

order to build accurate models [

]. The aforementioned

approaches have largely ignored this problem and assumed

the availability of large, manually-curated training datasets

(e.g., using crowdsourcing). In almost all cases, however,

gathering and cleaning such data is a substantial undertaking

that requires a signicant amount of time, eort, and money.

Moreover, existing approaches for NL2SQL translation

attempt to build models that generalize to new and unseen

databases, yielding performance that is generally decent but

does not perform as well as running new queries on the

databases used for training. That is, the training data used to

translate queries for one specic database, such as queries

containing words and phrases pertaining to patients in a

hospital, does not always allow the model to generalize to

queries in other domains, such as databases of geographical

locations or ights.

In order to address these fundamental limitations, we pro-

pose DBPal, a fully pluggable NL2SQL training pipeline that

can be used with any existing NL2SQL deep learning model

to improve translation accuracy. DBPal implements a novel

training pipeline for NLIDBs that synthesizes its training

data using the principle of weak supervision [11, 15].

The basic idea of weak supervision is to leverage various

heuristics and existing datasets to automatically generate

large (and potentially noisy) training data instead of manu-

ally handcrafting training examples. In its basic form, only

the database schema is required as input to generate a large

collection of pairs of NL queries and their corresponding

SQL statements that can be used to train any NL2SQL deep

learning model.

In order to maximize our coverage across natural linguis-

tic variations, DBPal also uses additional input sources to

automatically augment the training data through a variety

of techniques. One such augmentation step, for example, is

an automatic paraphrasing process using an o-the-shelf

paraphrasing database [

]. The goal of these augmentation

steps is to make the model robust to dierent linguistic varia-

tions of the same question (e.g., “What is the age distribution

of patients who stayed longest in the hospital?” and “For pa-

tients with the longest hospital stay, what is the distribution

of age?”).

In our evaluation, we show that DBPal, which requires no

manually crafted training data, can eectively improve the

performance of a state-of-the-art deep learning model for

NL2SQL translation. Our results demonstrate that an NLIDB

can be eectively bootstrapped without requiring manual

training data for each new database schema or target domain.

Furthermore, if manually curated training data is available,

such data can still be used to complement our proposed data

generation pipeline.

In summary, we make the following contributions:

•

We present DBPal, a fully pluggable natural language

to SQL (NL2SQL) training pipeline that automatically

synthesizes training data in order to improve the trans-

lation accuracy of an existing deep learning model.

•

We propose several data augmentation techniques that

give the model better coverage and make it more robust

towards linguistic variation in NL queries.

•

We propose a new benchmark that systematically tests

the robustness of a NLIDB to dierent linguistic varia-

tions.

•

Using a state-of-the-art deep learning model, we show

that our training pipeline can improve translation ac-

curacy by up to almost 40%.

The remainder of this paper is organized as follows. First,

in Section 2, we introduce the overall system architecture of

DBPal. Next, in Section 3, we describe the details of DBPal’s

novel training pipeline, which is based on weak supervision.

We then show how the learned model for NL2SQL translation

is applied at runtime in Section 4. Furthermore, we discuss

the handling of more complex queries like joins and nested

queries in Section 5. In order to demonstrate the eectiveness

of DBPal, we present the results of our extensive evaluation

in Section 6. Finally, we discuss related work in Section 7

and then conclude in Section 8.

2 OVERVIEW

In the following, we rst discuss the overall architecture

of a NLIDB and then discuss DBPal, our proposed training

pipeline based on weak supervision that synthesizes the

training data from a given database schema.

2.1 System Architecture

Figure 1 shows an overview of the architecture of our fully

functional prototype NLIDB, which consists of multiple com-

ponents, including a user-interface that allows users to pose

NL questions that are automatically translated into SQL. The

results from the user’s NL query are then returned to the

user in an easy-to-read tabular visualization.

At the core of our prototype is a Neural Translator, which

is trained by DBPal’s pipeline, that translates incoming NL

queries coming from a user into SQL queries. Importantly,

our fully pluggable training pipeline is agnostic to the actual

translation model; that is, DBPal is designed to improve

the accuracy of existing NL2SQL deep learning models (e.g.,

SyntaxSQLNet [

]) by generating training data for a given

database schema.

2.1.1 Training Phase. During the training phase, DBPal’s

training pipeline provides existing NL2SQL deep learning

models with large corpora of synthesized training data. This

Research 26: Usability and Natural Language User Interfaces

SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

2348

of 15

免费下载

nl2sql paper

关注

评论