against baseline methods. We also achieve new state-of-the-art
results on the leaderboard at the time of writing this paper.
Our proposed mechanisms show advantages in the follow-
ing aspects. (1) TSP and CSP work from a natural-language-
understanding perspective and a database-schema-aware per-
spective on better modelling conversational context. (2) Our
proposed method works as auxiliary tasks of multi-task learn-
ing, which avoids troublesome synthetic conversational data
collection and extensive computational costs compared with
pre-training methods. (3) We boost baseline methods signifi-
cantly and achieve new state-of-the-art results on a large-scale
cross-domain benchmark.
2. Related Work
2.1. Semantic Parsing and Text-to-SQL
Semantic parsing has been studied for a long period. Previous
semantic parsers are generally based on either expert-designed
rules [5, 6, 7] or statistical techniques [8, 9, 10]. In recent
years, neural semantic parsers come to the fore. Neural se-
mantic parsers generally treat semantic parsing as a sequence-
to-sequence task, and solve it with encoder-decoder frame-
work [11, 12, 13, 14, 15].
Text-to-SQL takes a large share of all semantic parsing
tasks. Previous text-to-SQL task mainly focus on relative-
simple in-domain text-to-SQL scenarios, and state-of-the-art
models show promising performance in this scenario [16, 17,
18]. Recently, a cross-domain multi-table text-to-SQL dataset
called Spider is proposed[19]. Compared with in-domain text-
to-SQL, cross-domain multi-table text-to-SQL requires mod-
els for higher ability of generalization on both natural lan-
guage and database schema understanding. On better solv-
ing this task, besides pure sequence-to-sequence methods, a
new skeleton-then-detail paradigm is proposed and widely ap-
plied. This paradigm generates a SQL skeleton first and then
fill the skeleton with database schema tokens. Models be-
long to this paradigm includes SQLNet [20], TypeSQL [21],
SQLova [22], Coarse2Fine [23], XSQL [24], HydraNet [25],
etc. Besides, some other strategies are proposed for enhancing
text-to-SQL parsers, including intermediate representation en-
hancement [26, 27, 28], reasoning through GNN model [29, 30,
4, 31, 32], and data augmentation [33, 34].
2.2. Conversational Text-to-SQL
Compared with single-turn text-to-SQL, conversational text-to-
SQL requires semantic parsers to understand the context of con-
versations to make correct SQL predictions. More recently, two
large-scale cross-domain benchmarks for conversational text-
to-SQL (i.e., SParC and CoSQL [1, 35]) are constructed, and
several studies are conducted based on these two benchmarks.
EditSQL [3] takes predicted SQL from the previous turn and
natural language utterance of the current turn as input, and ed-
its the previous SQL according to the current turn to gener-
ate the newly predicted SQL. This method tends to fail when
users ask for a new question less related to the conversation
context. IGSQL [36] solves this problem by building graph
among database schema and turns of queries to model the con-
text consistency during a conversation. IST-SQL [37] borrows
the idea from dialogue state tracking and regards columns as
slots with their value being their usage. Those slot-value pairs
are stored to represent the dialogue state. R
2
SQL [38] intro-
duces a dynamic schema-linking graph network and several dy-
namic memory decay mechanisms to track dialogue states and
uses a reranker to filter out some easily-detected incorrect pre-
dicted SQLs. Yu et al. proposed a language model pre-training
method specified for conversational text-to-SQL and achieved
state-of-the-art results on both datasets named score [39]. How-
ever, this method requires quantities of synthesized conversa-
tional semantic parsing data and relative high training cost.
3. Problem Formalization
Conversational text-to-SQL is a task that maps multi-turn nat-
ural language queries u = [u
1
, u
2
, · · · , u
T
] into correspond-
ing SQL logical forms y = [y
1
, y
2
, · · · , y
T
] w.r.t a pre-defined
database schema s, where T is the number of turns of a con-
versation. A database schema s = [s
1
, s
2
, · · · , s
m
] indicates
for all tables and columns from a multi-table database, where
each s
i
represents a hTable, Columni pairs. The goal of neural
semantic parsers is to maximize the probability of predicting
correct SQL y
t
given all natural language turns before t, i.e.,
max
T
Y
t=1
P (y
t
|u
1,··· ,t
; s)
(1)
Different from single-turn semantic parsing, when parsing y
t
,
all utterance turns before the t-th turn, i.e., [u
1
, u
2
, · · · , u
t
],
should be considered.
4. Methodology
In this paper, we propose RAT-SQL-TC for conversational text-
to-SQL, which adds two auxiliary tasks into the widely applied
RAT-SQL. We will introduce the framework of our proposed
model and the proposed two tasks in the following sections.
4.1. Overview of RAT-SQL-TC
RAT-SQL is one of the state-of-the-art neural semantic parsers
in recent years [4]. RAT-SQL is a unified framework which
encodes both relational structure in the database schema and
the given question for SQL generation. We take the RAT-SQL
as the basis to build our model. Concretely, we use a relation-
aware transformer-based encoder model to encode a natural lan-
guage query into vectors, and use a decoder model to translate
the encoded vectors into an abstract syntax tree (AST). This
AST can be further converted into SQL.
Notate u = [u
1
, u
2
, · · · , u
T
] to be a sequential query with
T turns, and u
i
= [u
1
i
, u
2
i
, · · · , u
|u
i
|
i
] where u
j
i
is the j-th token
of the i-th query. Notate s = [s
1
, s
2
, · · · , s
M
] to be the corre-
sponding database schema with column names. We can obtain
the input of the encoder model by jointing each turn and each
column name. To be specified, we concatenate turns of queries
with a special token “hsi” to indicate the boundary of each turn,
and each column name is concatenated with another special to-
ken “h/si”. Then the combination of the query and the database
schema is fed into the encoder, as is shown in Figure 2. This
input sequence is processed by the transformer-based encoder
model similar to RAT-SQL and a set of encoder vectors is gen-
erated with the same length as the input sequence. We follow
the AST decoding paradigm of RAT-SQL and use a decoder to
generate predicted SQL according to those vectors, and the loss
of decoding is defined as
L
dec
=
|Y |
X
i=1
y
i
log P (y
i
|y
<i
, u; s),
(2)
评论