暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
DBPal- A Fully Pluggable NL2SQL Training Pipeline.pdf
181
15页
3次
2023-03-09
免费下载
DBPal: A Fully Pluggable NL2SQL Training Pipeline
Nathaniel Weir
1
Prasetya Utama
2
Alex Galakatos
3
Andrew Crotty
3
Amir Ilkhechi
3
Shekar Ramaswamy
3
Rohin Bhushan
3
Nadja Geisler
2
Benjamin Hättasch
2
Steen Eger
2
Ugur Cetintemel
3
Carsten Binnig
2
1
Johns Hopkins University {nweir3@jhu.edu}
2
TU Darmstadt {rst.last@cs.tu-darmstadt.de}
3
Brown University {rst_last@brown.edu}
ABSTRACT
Natural language is a promising alternative interface to
DBMSs because it enables non-technical users to formulate
complex questions in a more concise manner than SQL. Re-
cently, deep learning has gained traction for translating natu-
ral language to SQL, since similar ideas have been successful
in the related domain of machine translation. However, the
core problem with existing deep learning approaches is that
they require an enormous amount of training data in or-
der to provide accurate translations. This training data is
extremely expensive to curate, since it generally requires
humans to manually annotate natural language examples
with the corresponding SQL queries (or vice versa).
Based on these observations, we propose DBPal, a new
approach that augments existing deep learning techniques
in order to improve the performance of models for natural
language to SQL translation. More specically, we present
a novel training pipeline that automatically generates syn-
thetic training data in order to (1) improve overall translation
accuracy, (2) increase robustness to linguistic variation, and
(3) specialize the model for the target database. As we show,
our DBPal training pipeline is able to improve both the ac-
curacy and linguistic robustness of state-of-the-art natural
language to SQL translation models.
ACM Reference Format:
Nathaniel Weir et al. 2020. DBPal: A Fully Pluggable NL2SQL Train-
ing Pipeline. In Proceedings of the 2020 ACM SIGMOD Interna-
tional Conference on Management of Data (SIGMOD’20), June 14–
19, 2020, Portland, OR, USA. ACM, New York, NY, USA, 15 pages.
https://doi.org/10.1145/3318464.3380589
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
SIGMOD’20, June 14–19, 2020, Portland, OR, USA
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-6735-6/20/06. .. $15.00
https://doi.org/10.1145/3318464.3380589
1 INTRODUCTION
In order to eectively leverage their data, DBMS users are re-
quired to not only have prior knowledge about the database
schema (e.g., table and column names, entity relationships)
but also a working understanding of the syntax and seman-
tics of SQL. Unfortunately, despite its expressiveness, SQL
can often hinder non-technical users from exploring and
making use of data stored in a DBMS. These requirements
set “a high barrier to entry” for data exploration and have
therefore triggered new eorts to develop alternative inter-
faces that allow non-technical users to explore and interact
with their data conveniently.
For example, imagine that a doctor wants to look at the
age distribution of patients with the longest stays in a hospi-
tal. To answer this question, the doctor would either need to
write a complex nested SQL query or work with an analyst
to craft the query. Even with a visual exploration tool (e.g.,
Tableau [
1
], Vizdom [
12
]), posing such a query is nontrivial,
since it requires the user to perform multiple interactions
with an understanding of the nested query semantics. Alter-
natively, with a natural language (NL) interface, the query is
as simple as stating: “What is the age distribution of patients
who stayed longest in the hospital?”
Based on this observation, a number of Natural Language
Interfaces to Databases (NLIDBs) have been proposed that
aim to translate natural language to SQL (NL2SQL). The rst
category of solutions are rule-based systems (e.g., NaLIR [
25
,
26
]), which use xed rules for performing translations. Al-
though eective in specic instances, these approaches are
brittle and do not generalize well without substantial addi-
tional eort to support new use cases. More recently, deep
learning techniques [
22
,
43
,
44
] have gained traction for
NL2SQL, since similar ideas have achieved success in the
related domain of machine translation. For example, generic
sequence-to-sequence (seq2seq) [
51
] models have been suc-
cessfully used in practice for NL2SQL translation, and more
advanced approaches like SyntaxSQLNet [
46
], which aug-
ments deep learning models with a structured model that
considers the syntax and semantics of SQL, have also been
proposed.
Research 26: Usability and Natural Language User Interfaces
SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
2347
However, a crucial problem with deep learning approaches
is that they require an enormous amount of training data in
order to build accurate models [
21
,
38
]. The aforementioned
approaches have largely ignored this problem and assumed
the availability of large, manually-curated training datasets
(e.g., using crowdsourcing). In almost all cases, however,
gathering and cleaning such data is a substantial undertaking
that requires a signicant amount of time, eort, and money.
Moreover, existing approaches for NL2SQL translation
attempt to build models that generalize to new and unseen
databases, yielding performance that is generally decent but
does not perform as well as running new queries on the
databases used for training. That is, the training data used to
translate queries for one specic database, such as queries
containing words and phrases pertaining to patients in a
hospital, does not always allow the model to generalize to
queries in other domains, such as databases of geographical
locations or ights.
In order to address these fundamental limitations, we pro-
pose DBPal, a fully pluggable NL2SQL training pipeline that
can be used with any existing NL2SQL deep learning model
to improve translation accuracy. DBPal implements a novel
training pipeline for NLIDBs that synthesizes its training
data using the principle of weak supervision [11, 15].
The basic idea of weak supervision is to leverage various
heuristics and existing datasets to automatically generate
large (and potentially noisy) training data instead of manu-
ally handcrafting training examples. In its basic form, only
the database schema is required as input to generate a large
collection of pairs of NL queries and their corresponding
SQL statements that can be used to train any NL2SQL deep
learning model.
In order to maximize our coverage across natural linguis-
tic variations, DBPal also uses additional input sources to
automatically augment the training data through a variety
of techniques. One such augmentation step, for example, is
an automatic paraphrasing process using an o-the-shelf
paraphrasing database [
29
]. The goal of these augmentation
steps is to make the model robust to dierent linguistic varia-
tions of the same question (e.g., “What is the age distribution
of patients who stayed longest in the hospital?” and “For pa-
tients with the longest hospital stay, what is the distribution
of age?”).
In our evaluation, we show that DBPal, which requires no
manually crafted training data, can eectively improve the
performance of a state-of-the-art deep learning model for
NL2SQL translation. Our results demonstrate that an NLIDB
can be eectively bootstrapped without requiring manual
training data for each new database schema or target domain.
Furthermore, if manually curated training data is available,
such data can still be used to complement our proposed data
generation pipeline.
In summary, we make the following contributions:
We present DBPal, a fully pluggable natural language
to SQL (NL2SQL) training pipeline that automatically
synthesizes training data in order to improve the trans-
lation accuracy of an existing deep learning model.
We propose several data augmentation techniques that
give the model better coverage and make it more robust
towards linguistic variation in NL queries.
We propose a new benchmark that systematically tests
the robustness of a NLIDB to dierent linguistic varia-
tions.
Using a state-of-the-art deep learning model, we show
that our training pipeline can improve translation ac-
curacy by up to almost 40%.
The remainder of this paper is organized as follows. First,
in Section 2, we introduce the overall system architecture of
DBPal. Next, in Section 3, we describe the details of DBPal’s
novel training pipeline, which is based on weak supervision.
We then show how the learned model for NL2SQL translation
is applied at runtime in Section 4. Furthermore, we discuss
the handling of more complex queries like joins and nested
queries in Section 5. In order to demonstrate the eectiveness
of DBPal, we present the results of our extensive evaluation
in Section 6. Finally, we discuss related work in Section 7
and then conclude in Section 8.
2 OVERVIEW
In the following, we rst discuss the overall architecture
of a NLIDB and then discuss DBPal, our proposed training
pipeline based on weak supervision that synthesizes the
training data from a given database schema.
2.1 System Architecture
Figure 1 shows an overview of the architecture of our fully
functional prototype NLIDB, which consists of multiple com-
ponents, including a user-interface that allows users to pose
NL questions that are automatically translated into SQL. The
results from the user’s NL query are then returned to the
user in an easy-to-read tabular visualization.
At the core of our prototype is a Neural Translator, which
is trained by DBPal’s pipeline, that translates incoming NL
queries coming from a user into SQL queries. Importantly,
our fully pluggable training pipeline is agnostic to the actual
translation model; that is, DBPal is designed to improve
the accuracy of existing NL2SQL deep learning models (e.g.,
SyntaxSQLNet [
46
]) by generating training data for a given
database schema.
2.1.1 Training Phase. During the training phase, DBPal’s
training pipeline provides existing NL2SQL deep learning
models with large corpora of synthesized training data. This
Research 26: Usability and Natural Language User Interfaces
SIGMOD ’20, June 14–19, 2020, Portland, OR, USA
2348
of 15
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜