暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
中国人民大学-RESDSQL- Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL.pdf
846
9页
1次
2023-03-10
免费下载
RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL
Haoyang Li
1,2,3
, Jing Zhang
1,2,3
*
, Cuiping Li
1,2,3
, Hong Chen
1,2,3
1
Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University of China
2
Engineering Research Center of Ministry of Education on Database and BI
3
Information School, Renmin University of China
{lihaoyang.cs, zhang-jing, licuiping, chong}@ruc.edu.cn
Abstract
One of the recent best attempts at Text-to-SQL is the pre-
trained language model. Due to the structural property of the
SQL queries, the seq2seq model takes the responsibility of
parsing both the schema items (i.e., tables and columns) and
the skeleton (i.e., SQL keywords). Such coupled targets in-
crease the difficulty of parsing the correct SQL queries es-
pecially when they involve many schema items and logic
operators. This paper proposes a ranking-enhanced encod-
ing and skeleton-aware decoding framework to decouple the
schema linking and the skeleton parsing. Specifically, for a
seq2seq encoder-decode model, its encoder is injected by
the most relevant schema items instead of the whole un-
ordered ones, which could alleviate the schema linking ef-
fort during SQL parsing, and its decoder first generates the
skeleton and then the actual SQL query, which could im-
plicitly constrain the SQL parsing. We evaluate our pro-
posed framework on Spider and its three robustness vari-
ants: Spider-DK, Spider-Syn, and Spider-Realistic. The ex-
perimental results show that our framework delivers promis-
ing performance and robustness. Our code is available at
https://github.com/RUCKBReasoning/RESDSQL.
1 Introduction
Relational databases that are used to store heterogeneous
data types including text, integer, float, etc., are omnipresent
in modern data management systems. However, ordinary
users usually cannot make the best use of databases be-
cause they are not good at translating their requirements to
the database language—i.e., the structured query language
(SQL). To assist these non-professional users in querying
the databases, researchers propose the Text-to-SQL task (Yu
et al. 2018a; Cai et al. 2018), which aims to automati-
cally translate users’ natural language questions into SQL
queries. At the same time, related benchmarks are becom-
ing increasingly complex, from the single-domain bench-
marks such as ATIS (Iyer et al. 2017) and GeoQuery (Zelle
and Mooney 1996) to the cross-domain benchmarks such as
WikiSQL (Zhong, Xiong, and Socher 2017) and Spider (Yu
et al. 2018c). Most of the recent works are done on Spi-
der because it is the most challenging Text-to-SQL bench-
mark which involves many complex SQL operators (such
*
Jing Zhang is the corresponding author.
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Serialize database schema into a schema sequence (using original names)
Question
What are flight numbers of flights departing from City "Aberdeen"?
Database schema
airlines (airlines)
country
(country)
abbreviation
(abbreviation)
uid
(airline id)
airline
(airline name)
airports (airports)
country
(country)
airportname
(airport name)
city
(city)
airportcode
(airport code)
countryabbrev
(country abbrev)
flights (flights)
destairport
(destination airport)
sourceairport
(source airport)
airline
(airline)
flightno
(flight number)
airlines : uid , airline , abbreviation , county | airports :
city , airportcode , airportname , county , countyabbrev |
flights : airline , flightno , sourceairport , destairport
Question + schema sequence
Target SQL
select flights.flightno from flights join airports on flights.
sourceairport = airports.airportcode where airports.city = “Aberdeen”
Foreign key relations
Sequence-to-sequence PLM (such as BART and T5)
Figure 1: Illustration of a Text-to-SQL instance solved by a
seq2seq PLM. In the database schema, each schema item is
denoted by “original name (semantic name)”.
as GROUP BY, ORDER BY, and HAVING, etc.) and nested
SQL queries.
With the recent advances in pre-trained language mod-
els (PLMs), many existing works formulate the Text-to-SQL
task as a semantic parsing problem and use a sequence-to-
sequence (seq2seq) model to solve it (Scholak, Schucher,
and Bahdanau 2021; Shi et al. 2021; Shaw et al. 2021). Con-
cretely, as shown in Figure 1, given a question and a database
schema, the schema items are serialized into a schema se-
quence where the order of the schema items is either default
or random. Then, a seq2seq PLM, such as BART (Lewis
et al. 2020) and T5 (Raffel et al. 2020), is leveraged to
generate the SQL query based on the concatenation of the
question and the schema sequence. The SQL query contains
not only the skeleton that reveals the logic of the question
but also the required schema items. For instance, for a SQL
query: “SELECT petid FROM pets WHERE pet age = 1”,
its skeleton is “SELECT FROM WHERE and its re-
arXiv:2302.05965v2 [cs.CL] 22 Feb 2023
quired schema items are “petid”, “pets”, and “pet age”.
Since Text-to-SQL needs to perform not only the schema
linking which aligns the mentioned entities in the question
to schema items in the database schema, but also the skele-
ton parsing which parses out the skeleton of the SQL query,
the major challenges are caused by a large number of re-
quired schema items and the complex composition of opera-
tors such as GROUP BY, HAVING, and JOIN ON involved
in a SQL query. The intertwining of the schema linking and
the skeleton parsing complicates learning even more.
To investigate whether the Text-to-SQL task could be-
come easier if the skeleton parsing and the schema link-
ing are decoupled, we conduct a preliminary experiment
on Spider’s dev set. Concretely, we fine-tune a T5-base
model to generate the pure skeletons based on the ques-
tions (i.e., skeleton parsing task). We observe that the exact
match accuracy on such a task achieves about 80% using the
fine-tuned T5-base. However, even the T5-3B model only
achieves about 70% accuracy (Shaw et al. 2021; Scholak,
Schucher, and Bahdanau 2021). This pre-experiment indi-
cates that decoupling such two objectives could be a poten-
tial way of reducing the difficulty of Text-to-SQL.
To realize the above decoupling idea, we propose
a Ranking-enhanced Encoding plus a Skeleton-aware
Decoding framework for Text-to-SQL (RESDSQL). The
former injects a few but most relevant schema items into
the seq2seq model’s encoder instead of all schema items. In
other words, the schema linking is conducted beforehand to
filter out most of the irrelevant schema items in the database
schema, which can alleviate the difficulty of the schema
linking for the seq2seq model. For such purpose, we train an
additional cross-encoder to classify the tables and columns
simultaneously based on the input question, and then rank
and filter them according to the classification probabilities
to form a ranked schema sequence.
The latter does not add any new modules but simply al-
lows the seq2seq model’s decoder to first generate the SQL
skeleton, and then the actual SQL query. Since skeleton
parsing is much easier than whole SQL parsing, the first gen-
erated skeleton could implicitly guide the subsequent SQL
parsing via the masked self-attention mechanism in the de-
coder.
Contributions. (1) We investigate a potential way of de-
coupling the skeleton parsing and the schema linking to
reduce the difficulty of Text-to-SQL. Specifically, we pro-
pose a ranking-enhanced encoder to alleviate the effort of
the schema linking and a skeleton-aware decoder to implic-
itly guide the SQL parsing by the skeleton. (2) We conduct
extensive evaluations and analysis and show that our frame-
work not only achieves the new SOTA performance on Spi-
der but also exhibits strong robustness.
2 Problem Definition
Database Schema. A relational database is denoted as
D. The database schema S of D includes (1) a set of
N tables T = {t
1
, t
2
, · · · , t
N
}, (2) a set of columns
C = {c
1
1
, · · · , c
1
n
1
, c
2
1
, · · · , c
2
n
2
, · · · , c
N
1
, · · · , c
N
n
N
} associ-
ated with the tables, where n
i
is the number of columns in
the i-th table, (3) and a set of foreign key relations R =
{(c
i
k
, c
j
h
)|c
i
k
, c
j
h
C}, where each (c
i
k
, c
j
h
) denotes a for-
eign key relation between column c
i
k
and column c
j
h
. We use
M =
P
N
i=1
n
i
to denote the total number of columns in D.
Original Name and Semantic Name. We use “schema
items” to uniformly refer to tables and columns in the
database. Each schema item can be represented by an origi-
nal name and a semantic name. The semantic name can in-
dicate the semantics of the schema item more precisely. As
shown in Figure 1, it is obvious that the semantic names “air-
line id” and “destination airport” are more clear than their
original names “uid” and “destairport”. Sometimes the se-
mantic name is the same as the original name.
Text-to-SQL Task. Formally, given a question q in natural
language and a database D with its schema S, the Text-to-
SQL task aims to translate q into a SQL query l that can be
executed on D to answer the question q.
3 Methodology
In this section, we first give an overview of the proposed
RESDSQL framework and then delve into its design details.
3.1 Model Overview
Following Shaw et al. (2021); Scholak, Schucher, and Bah-
danau (2021), we treat Text-to-SQL as a translation task,
which can be solved by an encoder-decoder transformer
model. Facing the above problems, we extend the existing
seq2seq Text-to-SQL methods by injecting the most relevant
schema items in the input sequence and the SQL skeleton
in the output sequence, which results in a ranking-enhanced
encoder and a skeleton-aware decoder. We provide the high-
level overview of the proposed RESDSQL model in Fig-
ure 2. The encoder of the seq2seq model receives the ranked
schema sequence, such that the schema linking effort could
be alleviated during SQL parsing. To obtain such a ranked
schema sequence, an additional cross-encoder is proposed
to classify the schema items according to the given ques-
tion, and then we rank and filter them based on the classifi-
cation probabilities. The decoder of the seq2seq model first
parses out the SQL skeleton and then the actual SQL query,
such that the SQL generation can be implicitly constrained
by the previously parsed skeleton. By doing this, to a certain
extent, the schema linking and the skeleton parsing are not
intertwined but decoupled.
3.2 Ranking-enhanced Encoder
Instead of injecting all schema items, we only consider the
most relevant schema items in the input of the encoder. For
this purpose, we devise a cross-encoder to classify the tables
and columns simultaneously and then rank them based on
their probabilities. Based on the ranking order, on one hand,
we filter out the irrelevant schema items. On the other hand,
we use the ranked schema sequence instead of the unordered
schema sequence, so that the seq2seq model could capture
potential position information for schema linking.
As for the input of the cross-encoder, we flatten the
schema items into a schema sequence in their default or-
der and concatenate it with the question to form an input
sequence: X = q | t
1
: c
1
1
, · · · , c
1
n
1
| · · · | t
N
: c
N
1
, · · · , c
N
n
N
,
of 9
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜