However, a crucial problem with deep learning approaches
is that they require an enormous amount of training data in
order to build accurate models [
21
,
38
]. The aforementioned
approaches have largely ignored this problem and assumed
the availability of large, manually-curated training datasets
(e.g., using crowdsourcing). In almost all cases, however,
gathering and cleaning such data is a substantial undertaking
that requires a signicant amount of time, eort, and money.
Moreover, existing approaches for NL2SQL translation
attempt to build models that generalize to new and unseen
databases, yielding performance that is generally decent but
does not perform as well as running new queries on the
databases used for training. That is, the training data used to
translate queries for one specic database, such as queries
containing words and phrases pertaining to patients in a
hospital, does not always allow the model to generalize to
queries in other domains, such as databases of geographical
locations or ights.
In order to address these fundamental limitations, we pro-
pose DBPal, a fully pluggable NL2SQL training pipeline that
can be used with any existing NL2SQL deep learning model
to improve translation accuracy. DBPal implements a novel
training pipeline for NLIDBs that synthesizes its training
data using the principle of weak supervision [11, 15].
The basic idea of weak supervision is to leverage various
heuristics and existing datasets to automatically generate
large (and potentially noisy) training data instead of manu-
ally handcrafting training examples. In its basic form, only
the database schema is required as input to generate a large
collection of pairs of NL queries and their corresponding
SQL statements that can be used to train any NL2SQL deep
learning model.
In order to maximize our coverage across natural linguis-
tic variations, DBPal also uses additional input sources to
automatically augment the training data through a variety
of techniques. One such augmentation step, for example, is
an automatic paraphrasing process using an o-the-shelf
paraphrasing database [
29
]. The goal of these augmentation
steps is to make the model robust to dierent linguistic varia-
tions of the same question (e.g., “What is the age distribution
of patients who stayed longest in the hospital?” and “For pa-
tients with the longest hospital stay, what is the distribution
of age?”).
In our evaluation, we show that DBPal, which requires no
manually crafted training data, can eectively improve the
performance of a state-of-the-art deep learning model for
NL2SQL translation. Our results demonstrate that an NLIDB
can be eectively bootstrapped without requiring manual
training data for each new database schema or target domain.
Furthermore, if manually curated training data is available,
such data can still be used to complement our proposed data
generation pipeline.
In summary, we make the following contributions:
•
We present DBPal, a fully pluggable natural language
to SQL (NL2SQL) training pipeline that automatically
synthesizes training data in order to improve the trans-
lation accuracy of an existing deep learning model.
•
We propose several data augmentation techniques that
give the model better coverage and make it more robust
towards linguistic variation in NL queries.
•
We propose a new benchmark that systematically tests
the robustness of a NLIDB to dierent linguistic varia-
tions.
•
Using a state-of-the-art deep learning model, we show
that our training pipeline can improve translation ac-
curacy by up to almost 40%.
The remainder of this paper is organized as follows. First,
in Section 2, we introduce the overall system architecture of
DBPal. Next, in Section 3, we describe the details of DBPal’s
novel training pipeline, which is based on weak supervision.
We then show how the learned model for NL2SQL translation
is applied at runtime in Section 4. Furthermore, we discuss
the handling of more complex queries like joins and nested
queries in Section 5. In order to demonstrate the eectiveness
of DBPal, we present the results of our extensive evaluation
in Section 6. Finally, we discuss related work in Section 7
and then conclude in Section 8.
2 OVERVIEW
In the following, we rst discuss the overall architecture
of a NLIDB and then discuss DBPal, our proposed training
pipeline based on weak supervision that synthesizes the
training data from a given database schema.
2.1 System Architecture
Figure 1 shows an overview of the architecture of our fully
functional prototype NLIDB, which consists of multiple com-
ponents, including a user-interface that allows users to pose
NL questions that are automatically translated into SQL. The
results from the user’s NL query are then returned to the
user in an easy-to-read tabular visualization.
At the core of our prototype is a Neural Translator, which
is trained by DBPal’s pipeline, that translates incoming NL
queries coming from a user into SQL queries. Importantly,
our fully pluggable training pipeline is agnostic to the actual
translation model; that is, DBPal is designed to improve
the accuracy of existing NL2SQL deep learning models (e.g.,
SyntaxSQLNet [
46
]) by generating training data for a given
database schema.
2.1.1 Training Phase. During the training phase, DBPal’s
training pipeline provides existing NL2SQL deep learning
models with large corpora of synthesized training data. This
Research 26: Usability and Natural Language User Interfaces
SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
评论