RAT-SQL-TC Pay More Attention to History- A Context Modelling Strategy for Conversational Text-to-SQL.pdf

章芋文

244

5页

1次

2023-03-10

免费下载

Pay More Attention to History: A Context Modelling Strategy for

Conversational Text-to-SQL

Yuntao Li

, Hanchu Zhang

, Yutian Li

, Sirui Wang

, Wei Wu

, Yan Zhang

Peking University

Meituan

{li.yt, zhyzhy001}@pku.edu.cn, {zhanghanchu, liyutian, wangsirui, wuwei30}@meituan.com

Abstract

Conversational text-to-SQL aims at converting multi-turn natu-

ral language queries into their corresponding SQL (Structured

Query Language) representations. One of the most intractable

problems of conversational text-to-SQL is modelling the se-

mantics of multi-turn queries and gathering the proper infor-

mation required for the current query. This paper shows that

explicitly modelling the semantic changes by adding each turn

and the summarization of the whole context can bring better

performance on converting conversational queries into SQLs.

In particular, we propose two conversational modelling tasks in

both turn grain and conversation grain. These two tasks simply

work as auxiliary training tasks to help with multi-turn conver-

sational semantic parsing. We conducted empirical studies and

achieved new state-of-the-art results on the large-scale open-

domain conversational text-to-SQL dataset. The results demon-

strate that the proposed mechanism signiﬁcantly improves the

performance of multi-turn semantic parsing.

Index Terms: conversational text-to-sql, human-computer in-

teraction, computational paralinguistics

1. Introduction

Semantic parsing is a task that maps natural language queries

into corresponding machine-executable logical forms. Being

one of the most popular branches of semantic parsing, text-

to-SQL, which relieves real users from the burden of learning

about techniques behind the queries, has drawn quantities of

attention in the ﬁeld of natural language processing. Existing

work mainly focused on converting individual utterances into

SQL (Structured Query Language) queries. However, in real

scenarios, users tend to interact with systems through conversa-

tions to acquire information, in which the conversation context

should be considered. To meet this users’ demand, the attention

of research on single-turn text-to-SQL shifted to conversational

text-to-SQL.

Conversational text-to-SQL is an extension of the standard

text-to-SQL task, which frees the restriction of natural language

queries from single-turn settings into multi-turn settings. Re-

cent studies [1, 2, 3] indicate that conversational text-to-SQL

shows much higher difﬁculty compared with single-turn text-

to-SQL. This kind of difﬁculty mainly comes from modelling

multi-turn natural language queries. Figure 1 shows an exam-

ple of conversational semantic parsing. Three utterances appear

in this conversation. The second query is asked according to the

ﬁrst query, and the SQL of the second turn is a modiﬁcation of

the ﬁrst one by adding an additional restriction. The third query

Our code is publicly available at https://github.com/JuruoMP/RAT-

SQL-TC.

changes the selected columns based on the second query, which

results in modiﬁcation of the selected columns of the SQL.

What countries are in North

America?

SELECT * FROM country WHERE Continent =

"North America"

Of those, which have

surface area greater than

3000?

What is the total population

and average surface area of

those countries?

SELECT * FROM country WHERE Continent =

"North America" AND SurfaceArea > 3000

SELECT sum(Population) , avg(SurfaceArea)

FROM country WHERE Continent = "North

America" AND SurfaceArea > 3000

Conversational Queries SQLs

Figure 1: A conversation with three queries. The semantics of

latter turns depends on previous turns, and the corresponding

SQLs can be regarded as a modiﬁcation of the previous ones.

It can be observed from the example that to better under-

stand a contextual query and generate a corresponding SQL, it

is essential to model both the semantics changes by adding each

separate turn, as well as mapping those changes into the SQL

operations. On the one hand, modelling the semantic changes

by adding every single turn is conducive to better understanding

the semantic ﬂow during a conversation, and thus helps to better

summarise them into a single SQL. On the other hand, in order

to generate correct predicted SQLs, it is vital to correlate those

semantic changes with database schema operations.

Motivated by these observations, in this paper, we pro-

pose RAT-SQL-TC, which uses two auxiliary tasks to better

modelling multi-turn conversational context and generating cor-

rect SQL representations based on RAT-SQL[4]. The ﬁrst

task is Turn Switch Prediction (TSP), which predicts how SQL

changes while adding a new turn during a conversation. And

the second task is Contextual Schema Prediction (CSP), which

helps with mapping the contextual changes to database schema

operations. CSP requires the utterance encoder model to predict

the changes of usage of each column w.r.t the current turn of a

conversation. CSP also enhances the encoder model to make

a better understanding of database schemas. These two tasks

work as auxiliary tasks of multi-task learning that are trained

together with the SQL generation task. Our proposed two tasks

work from a natural-language-understanding perspective and a

database-schema-aware perspective respectively, to enhance the

understanding of conversation context and further promote text-

to-SQL generation.

We evaluate our proposed method on a popular large-

scale cross-domain conversational text-to-SQL benchmarks,

i.e., SParC [1]. By adding our mechanisms, the accuracy of both

query match and interaction match is signiﬁcantly improved

arXiv:2112.08735v2 [cs.CL] 26 Jul 2022

against baseline methods. We also achieve new state-of-the-art

results on the leaderboard at the time of writing this paper.

Our proposed mechanisms show advantages in the follow-

ing aspects. (1) TSP and CSP work from a natural-language-

understanding perspective and a database-schema-aware per-

spective on better modelling conversational context. (2) Our

proposed method works as auxiliary tasks of multi-task learn-

ing, which avoids troublesome synthetic conversational data

collection and extensive computational costs compared with

pre-training methods. (3) We boost baseline methods signiﬁ-

cantly and achieve new state-of-the-art results on a large-scale

cross-domain benchmark.

2. Related Work

2.1. Semantic Parsing and Text-to-SQL

Semantic parsing has been studied for a long period. Previous

semantic parsers are generally based on either expert-designed

rules [5, 6, 7] or statistical techniques [8, 9, 10]. In recent

years, neural semantic parsers come to the fore. Neural se-

mantic parsers generally treat semantic parsing as a sequence-

to-sequence task, and solve it with encoder-decoder frame-

work [11, 12, 13, 14, 15].

Text-to-SQL takes a large share of all semantic parsing

tasks. Previous text-to-SQL task mainly focus on relative-

simple in-domain text-to-SQL scenarios, and state-of-the-art

models show promising performance in this scenario [16, 17,

18]. Recently, a cross-domain multi-table text-to-SQL dataset

called Spider is proposed[19]. Compared with in-domain text-

to-SQL, cross-domain multi-table text-to-SQL requires mod-

els for higher ability of generalization on both natural lan-

guage and database schema understanding. On better solv-

ing this task, besides pure sequence-to-sequence methods, a

new skeleton-then-detail paradigm is proposed and widely ap-

plied. This paradigm generates a SQL skeleton ﬁrst and then

ﬁll the skeleton with database schema tokens. Models be-

long to this paradigm includes SQLNet [20], TypeSQL [21],

SQLova [22], Coarse2Fine [23], XSQL [24], HydraNet [25],

etc. Besides, some other strategies are proposed for enhancing

text-to-SQL parsers, including intermediate representation en-

hancement [26, 27, 28], reasoning through GNN model [29, 30,

4, 31, 32], and data augmentation [33, 34].

2.2. Conversational Text-to-SQL

Compared with single-turn text-to-SQL, conversational text-to-

SQL requires semantic parsers to understand the context of con-

versations to make correct SQL predictions. More recently, two

large-scale cross-domain benchmarks for conversational text-

to-SQL (i.e., SParC and CoSQL [1, 35]) are constructed, and

several studies are conducted based on these two benchmarks.

EditSQL [3] takes predicted SQL from the previous turn and

natural language utterance of the current turn as input, and ed-

its the previous SQL according to the current turn to gener-

ate the newly predicted SQL. This method tends to fail when

users ask for a new question less related to the conversation

context. IGSQL [36] solves this problem by building graph

among database schema and turns of queries to model the con-

text consistency during a conversation. IST-SQL [37] borrows

the idea from dialogue state tracking and regards columns as

slots with their value being their usage. Those slot-value pairs

are stored to represent the dialogue state. R

SQL [38] intro-

duces a dynamic schema-linking graph network and several dy-

namic memory decay mechanisms to track dialogue states and

uses a reranker to ﬁlter out some easily-detected incorrect pre-

dicted SQLs. Yu et al. proposed a language model pre-training

method speciﬁed for conversational text-to-SQL and achieved

state-of-the-art results on both datasets named score [39]. How-

ever, this method requires quantities of synthesized conversa-

tional semantic parsing data and relative high training cost.

3. Problem Formalization

Conversational text-to-SQL is a task that maps multi-turn nat-

ural language queries u = [u

, u

, · · · , u

] into correspond-

ing SQL logical forms y = [y

, y

, · · · , y

] w.r.t a pre-deﬁned

database schema s, where T is the number of turns of a con-

versation. A database schema s = [s

, s

, · · · , s

] indicates

for all tables and columns from a multi-table database, where

each s

represents a hTable, Columni pairs. The goal of neural

semantic parsers is to maximize the probability of predicting

correct SQL y

given all natural language turns before t, i.e.,

max

t=1

P (y

1,··· ,t

; s)

(1)

Different from single-turn semantic parsing, when parsing y

all utterance turns before the t-th turn, i.e., [u

, u

, · · · , u

should be considered.

4. Methodology

In this paper, we propose RAT-SQL-TC for conversational text-

to-SQL, which adds two auxiliary tasks into the widely applied

RAT-SQL. We will introduce the framework of our proposed

model and the proposed two tasks in the following sections.

4.1. Overview of RAT-SQL-TC

RAT-SQL is one of the state-of-the-art neural semantic parsers

in recent years [4]. RAT-SQL is a uniﬁed framework which

encodes both relational structure in the database schema and

the given question for SQL generation. We take the RAT-SQL

as the basis to build our model. Concretely, we use a relation-

aware transformer-based encoder model to encode a natural lan-

guage query into vectors, and use a decoder model to translate

the encoded vectors into an abstract syntax tree (AST). This

AST can be further converted into SQL.

Notate u = [u

, u

, · · · , u

] to be a sequential query with

T turns, and u

= [u

, u

, · · · , u

] where u

is the j-th token

of the i-th query. Notate s = [s

, s

, · · · , s

] to be the corre-

sponding database schema with column names. We can obtain

the input of the encoder model by jointing each turn and each

column name. To be speciﬁed, we concatenate turns of queries

with a special token “hsi” to indicate the boundary of each turn,

and each column name is concatenated with another special to-

ken “h/si”. Then the combination of the query and the database

schema is fed into the encoder, as is shown in Figure 2. This

input sequence is processed by the transformer-based encoder

model similar to RAT-SQL and a set of encoder vectors is gen-

erated with the same length as the input sequence. We follow

the AST decoding paradigm of RAT-SQL and use a decoder to

generate predicted SQL according to those vectors, and the loss

of decoding is deﬁned as

dec

|Y |

i=1

log P (y

, u; s),

(2)