中国人民大学-RESDSQL- Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL.pdf

章芋文

850

9页

1次

2023-03-10

免费下载

RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL

Haoyang Li

1,2,3

, Jing Zhang

1,2,3

, Cuiping Li

1,2,3

, Hong Chen

1,2,3

Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University of China

Engineering Research Center of Ministry of Education on Database and BI

Information School, Renmin University of China

{lihaoyang.cs, zhang-jing, licuiping, chong}@ruc.edu.cn

Abstract

One of the recent best attempts at Text-to-SQL is the pre-

trained language model. Due to the structural property of the

SQL queries, the seq2seq model takes the responsibility of

parsing both the schema items (i.e., tables and columns) and

the skeleton (i.e., SQL keywords). Such coupled targets in-

crease the difﬁculty of parsing the correct SQL queries es-

pecially when they involve many schema items and logic

operators. This paper proposes a ranking-enhanced encod-

ing and skeleton-aware decoding framework to decouple the

schema linking and the skeleton parsing. Speciﬁcally, for a

seq2seq encoder-decode model, its encoder is injected by

the most relevant schema items instead of the whole un-

ordered ones, which could alleviate the schema linking ef-

fort during SQL parsing, and its decoder ﬁrst generates the

skeleton and then the actual SQL query, which could im-

plicitly constrain the SQL parsing. We evaluate our pro-

posed framework on Spider and its three robustness vari-

ants: Spider-DK, Spider-Syn, and Spider-Realistic. The ex-

perimental results show that our framework delivers promis-

ing performance and robustness. Our code is available at

https://github.com/RUCKBReasoning/RESDSQL.

1 Introduction

Relational databases that are used to store heterogeneous

data types including text, integer, ﬂoat, etc., are omnipresent

in modern data management systems. However, ordinary

users usually cannot make the best use of databases be-

cause they are not good at translating their requirements to

the database language—i.e., the structured query language

(SQL). To assist these non-professional users in querying

the databases, researchers propose the Text-to-SQL task (Yu

et al. 2018a; Cai et al. 2018), which aims to automati-

cally translate users’ natural language questions into SQL

queries. At the same time, related benchmarks are becom-

ing increasingly complex, from the single-domain bench-

marks such as ATIS (Iyer et al. 2017) and GeoQuery (Zelle

and Mooney 1996) to the cross-domain benchmarks such as

WikiSQL (Zhong, Xiong, and Socher 2017) and Spider (Yu

et al. 2018c). Most of the recent works are done on Spi-

der because it is the most challenging Text-to-SQL bench-

mark which involves many complex SQL operators (such

Jing Zhang is the corresponding author.

Serialize database schema into a schema sequence (using original names)

Question

What are flight numbers of flights departing from City "Aberdeen"?

Database schema

airlines (airlines)

country

(country)

abbreviation

(abbreviation)

uid

(airline id)

airline

(airline name)

airports (airports)

country

(country)

airportname

(airport name)

city

(city)

airportcode

(airport code)

countryabbrev

(country abbrev)

flights (flights)

destairport

(destination airport)

sourceairport

(source airport)

airline

(airline)

flightno

(flight number)

airlines : uid , airline , abbreviation , county | airports :

city , airportcode , airportname , county , countyabbrev |

flights : airline , flightno , sourceairport , destairport

Question + schema sequence

Target SQL

select flights.flightno from flights join airports on flights.

sourceairport = airports.airportcode where airports.city = “Aberdeen”

Foreign key relations

Sequence-to-sequence PLM (such as BART and T5)

Figure 1: Illustration of a Text-to-SQL instance solved by a

seq2seq PLM. In the database schema, each schema item is

denoted by “original name (semantic name)”.

as GROUP BY, ORDER BY, and HAVING, etc.) and nested

SQL queries.

With the recent advances in pre-trained language mod-

els (PLMs), many existing works formulate the Text-to-SQL

task as a semantic parsing problem and use a sequence-to-

sequence (seq2seq) model to solve it (Scholak, Schucher,

and Bahdanau 2021; Shi et al. 2021; Shaw et al. 2021). Con-

cretely, as shown in Figure 1, given a question and a database

schema, the schema items are serialized into a schema se-

quence where the order of the schema items is either default

or random. Then, a seq2seq PLM, such as BART (Lewis

et al. 2020) and T5 (Raffel et al. 2020), is leveraged to

generate the SQL query based on the concatenation of the

question and the schema sequence. The SQL query contains

not only the skeleton that reveals the logic of the question

but also the required schema items. For instance, for a SQL

query: “SELECT petid FROM pets WHERE pet age = 1”,

its skeleton is “SELECT FROM WHERE ” and its re-

arXiv:2302.05965v2 [cs.CL] 22 Feb 2023

quired schema items are “petid”, “pets”, and “pet age”.

Since Text-to-SQL needs to perform not only the schema

linking which aligns the mentioned entities in the question

to schema items in the database schema, but also the skele-

ton parsing which parses out the skeleton of the SQL query,

the major challenges are caused by a large number of re-

quired schema items and the complex composition of opera-

tors such as GROUP BY, HAVING, and JOIN ON involved

in a SQL query. The intertwining of the schema linking and

the skeleton parsing complicates learning even more.

To investigate whether the Text-to-SQL task could be-

come easier if the skeleton parsing and the schema link-

ing are decoupled, we conduct a preliminary experiment

on Spider’s dev set. Concretely, we ﬁne-tune a T5-base

model to generate the pure skeletons based on the ques-

tions (i.e., skeleton parsing task). We observe that the exact

match accuracy on such a task achieves about 80% using the

ﬁne-tuned T5-base. However, even the T5-3B model only

achieves about 70% accuracy (Shaw et al. 2021; Scholak,

Schucher, and Bahdanau 2021). This pre-experiment indi-

cates that decoupling such two objectives could be a poten-

tial way of reducing the difﬁculty of Text-to-SQL.

To realize the above decoupling idea, we propose

a Ranking-enhanced Encoding plus a Skeleton-aware

Decoding framework for Text-to-SQL (RESDSQL). The

former injects a few but most relevant schema items into

the seq2seq model’s encoder instead of all schema items. In

other words, the schema linking is conducted beforehand to

ﬁlter out most of the irrelevant schema items in the database

schema, which can alleviate the difﬁculty of the schema

linking for the seq2seq model. For such purpose, we train an

additional cross-encoder to classify the tables and columns

simultaneously based on the input question, and then rank

and ﬁlter them according to the classiﬁcation probabilities

to form a ranked schema sequence.

The latter does not add any new modules but simply al-

lows the seq2seq model’s decoder to ﬁrst generate the SQL

skeleton, and then the actual SQL query. Since skeleton

parsing is much easier than whole SQL parsing, the ﬁrst gen-

erated skeleton could implicitly guide the subsequent SQL

parsing via the masked self-attention mechanism in the de-

coder.

Contributions. (1) We investigate a potential way of de-

coupling the skeleton parsing and the schema linking to

reduce the difﬁculty of Text-to-SQL. Speciﬁcally, we pro-

pose a ranking-enhanced encoder to alleviate the effort of

the schema linking and a skeleton-aware decoder to implic-

itly guide the SQL parsing by the skeleton. (2) We conduct

extensive evaluations and analysis and show that our frame-

work not only achieves the new SOTA performance on Spi-

der but also exhibits strong robustness.

2 Problem Deﬁnition

Database Schema. A relational database is denoted as

D. The database schema S of D includes (1) a set of

N tables T = {t

, t

, · · · , t

}, (2) a set of columns

C = {c

, · · · , c

, c

, · · · , c

} associ-

ated with the tables, where n

is the number of columns in

the i-th table, (3) and a set of foreign key relations R =

{(c

, c

)|c

, c

∈ C}, where each (c

, c

) denotes a for-

eign key relation between column c

and column c

. We use

M =

i=1

to denote the total number of columns in D.

Original Name and Semantic Name. We use “schema

items” to uniformly refer to tables and columns in the

database. Each schema item can be represented by an origi-

nal name and a semantic name. The semantic name can in-

dicate the semantics of the schema item more precisely. As

shown in Figure 1, it is obvious that the semantic names “air-

line id” and “destination airport” are more clear than their

original names “uid” and “destairport”. Sometimes the se-

mantic name is the same as the original name.

Text-to-SQL Task. Formally, given a question q in natural

language and a database D with its schema S, the Text-to-

SQL task aims to translate q into a SQL query l that can be

executed on D to answer the question q.

3 Methodology

In this section, we ﬁrst give an overview of the proposed

RESDSQL framework and then delve into its design details.

3.1 Model Overview

Following Shaw et al. (2021); Scholak, Schucher, and Bah-

danau (2021), we treat Text-to-SQL as a translation task,

which can be solved by an encoder-decoder transformer

model. Facing the above problems, we extend the existing

seq2seq Text-to-SQL methods by injecting the most relevant

schema items in the input sequence and the SQL skeleton

in the output sequence, which results in a ranking-enhanced

encoder and a skeleton-aware decoder. We provide the high-

level overview of the proposed RESDSQL model in Fig-

ure 2. The encoder of the seq2seq model receives the ranked

schema sequence, such that the schema linking effort could

be alleviated during SQL parsing. To obtain such a ranked

schema sequence, an additional cross-encoder is proposed

to classify the schema items according to the given ques-

tion, and then we rank and ﬁlter them based on the classiﬁ-

cation probabilities. The decoder of the seq2seq model ﬁrst

parses out the SQL skeleton and then the actual SQL query,

such that the SQL generation can be implicitly constrained

by the previously parsed skeleton. By doing this, to a certain

extent, the schema linking and the skeleton parsing are not

intertwined but decoupled.

3.2 Ranking-enhanced Encoder

Instead of injecting all schema items, we only consider the

most relevant schema items in the input of the encoder. For

this purpose, we devise a cross-encoder to classify the tables

and columns simultaneously and then rank them based on

their probabilities. Based on the ranking order, on one hand,

we ﬁlter out the irrelevant schema items. On the other hand,

we use the ranked schema sequence instead of the unordered

schema sequence, so that the seq2seq model could capture

potential position information for schema linking.

As for the input of the cross-encoder, we ﬂatten the

schema items into a schema sequence in their default or-

der and concatenate it with the question to form an input

sequence: X = q | t

: c

, · · · , c

| · · · | t

: c

, · · · , c

RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL

Haoyang Li

1,2,3

, Jing Zhang

1,2,3

, Cuiping Li

1,2,3

, Hong Chen

1,2,3

Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University of China

Engineering Research Center of Ministry of Education on Database and BI

Information School, Renmin University of China

{lihaoyang.cs, zhang-jing, licuiping, chong}@ruc.edu.cn

Abstract

One of the recent best attempts at Text-to-SQL is the pre-

trained language model. Due to the structural property of the

SQL queries, the seq2seq model takes the responsibility of

parsing both the schema items (i.e., tables and columns) and

the skeleton (i.e., SQL keywords). Such coupled targets in-

crease the difﬁculty of parsing the correct SQL queries es-

pecially when they involve many schema items and logic

operators. This paper proposes a ranking-enhanced encod-

ing and skeleton-aware decoding framework to decouple the

schema linking and the skeleton parsing. Speciﬁcally, for a

seq2seq encoder-decode model, its encoder is injected by

the most relevant schema items instead of the whole un-

ordered ones, which could alleviate the schema linking ef-

fort during SQL parsing, and its decoder ﬁrst generates the

skeleton and then the actual SQL query, which could im-

plicitly constrain the SQL parsing. We evaluate our pro-

posed framework on Spider and its three robustness vari-

ants: Spider-DK, Spider-Syn, and Spider-Realistic. The ex-

perimental results show that our framework delivers promis-

ing performance and robustness. Our code is available at

https://github.com/RUCKBReasoning/RESDSQL.

1 Introduction

Relational databases that are used to store heterogeneous

data types including text, integer, ﬂoat, etc., are omnipresent

in modern data management systems. However, ordinary

users usually cannot make the best use of databases be-

cause they are not good at translating their requirements to

the database language—i.e., the structured query language

(SQL). To assist these non-professional users in querying

the databases, researchers propose the Text-to-SQL task (Yu

et al. 2018a; Cai et al. 2018), which aims to automati-

cally translate users’ natural language questions into SQL

queries. At the same time, related benchmarks are becom-

ing increasingly complex, from the single-domain bench-

marks such as ATIS (Iyer et al. 2017) and GeoQuery (Zelle

and Mooney 1996) to the cross-domain benchmarks such as

WikiSQL (Zhong, Xiong, and Socher 2017) and Spider (Yu

et al. 2018c). Most of the recent works are done on Spi-

der because it is the most challenging Text-to-SQL bench-

mark which involves many complex SQL operators (such

Jing Zhang is the corresponding author.

Serialize database schema into a schema sequence (using original names)

Question

What are flight numbers of flights departing from City "Aberdeen"?

Database schema

airlines (airlines)

country

(country)

abbreviation

(abbreviation)

uid

(airline id)

airline

(airline name)

airports (airports)

country

(country)

airportname

(airport name)

city

(city)

airportcode

(airport code)

countryabbrev

(country abbrev)

flights (flights)

destairport

(destination airport)

sourceairport

(source airport)

airline

(airline)

flightno

(flight number)

airlines : uid , airline , abbreviation , county | airports :

city , airportcode , airportname , county , countyabbrev |

flights : airline , flightno , sourceairport , destairport

Question + schema sequence

Target SQL

select flights.flightno from flights join airports on flights.

sourceairport = airports.airportcode where airports.city = “Aberdeen”

Foreign key relations

Sequence-to-sequence PLM (such as BART and T5)

Figure 1: Illustration of a Text-to-SQL instance solved by a

seq2seq PLM. In the database schema, each schema item is

denoted by “original name (semantic name)”.

as GROUP BY, ORDER BY, and HAVING, etc.) and nested

SQL queries.

With the recent advances in pre-trained language mod-

els (PLMs), many existing works formulate the Text-to-SQL

task as a semantic parsing problem and use a sequence-to-

sequence (seq2seq) model to solve it (Scholak, Schucher,

and Bahdanau 2021; Shi et al. 2021; Shaw et al. 2021). Con-

cretely, as shown in Figure 1, given a question and a database

schema, the schema items are serialized into a schema se-

quence where the order of the schema items is either default

or random. Then, a seq2seq PLM, such as BART (Lewis

et al. 2020) and T5 (Raffel et al. 2020), is leveraged to

generate the SQL query based on the concatenation of the

question and the schema sequence. The SQL query contains

not only the skeleton that reveals the logic of the question

but also the required schema items. For instance, for a SQL

query: “SELECT petid FROM pets WHERE pet age = 1”,

its skeleton is “SELECT FROM WHERE ” and its re-

arXiv:2302.05965v2 [cs.CL] 22 Feb 2023

quired schema items are “petid”, “pets”, and “pet age”.

Since Text-to-SQL needs to perform not only the schema

linking which aligns the mentioned entities in the question

to schema items in the database schema, but also the skele-

ton parsing which parses out the skeleton of the SQL query,

the major challenges are caused by a large number of re-

quired schema items and the complex composition of opera-

tors such as GROUP BY, HAVING, and JOIN ON involved

in a SQL query. The intertwining of the schema linking and

the skeleton parsing complicates learning even more.

To investigate whether the Text-to-SQL task could be-

come easier if the skeleton parsing and the schema link-

ing are decoupled, we conduct a preliminary experiment

on Spider’s dev set. Concretely, we ﬁne-tune a T5-base

model to generate the pure skeletons based on the ques-

tions (i.e., skeleton parsing task). We observe that the exact

match accuracy on such a task achieves about 80% using the

ﬁne-tuned T5-base. However, even the T5-3B model only

achieves about 70% accuracy (Shaw et al. 2021; Scholak,

Schucher, and Bahdanau 2021). This pre-experiment indi-

cates that decoupling such two objectives could be a poten-

tial way of reducing the difﬁculty of Text-to-SQL.

To realize the above decoupling idea, we propose

a Ranking-enhanced Encoding plus a Skeleton-aware

Decoding framework for Text-to-SQL (RESDSQL). The

former injects a few but most relevant schema items into

the seq2seq model’s encoder instead of all schema items. In

other words, the schema linking is conducted beforehand to

ﬁlter out most of the irrelevant schema items in the database

schema, which can alleviate the difﬁculty of the schema

linking for the seq2seq model. For such purpose, we train an

additional cross-encoder to classify the tables and columns

simultaneously based on the input question, and then rank

and ﬁlter them according to the classiﬁcation probabilities

to form a ranked schema sequence.

The latter does not add any new modules but simply al-

lows the seq2seq model’s decoder to ﬁrst generate the SQL

skeleton, and then the actual SQL query. Since skeleton

parsing is much easier than whole SQL parsing, the ﬁrst gen-

erated skeleton could implicitly guide the subsequent SQL

parsing via the masked self-attention mechanism in the de-

coder.

Contributions. (1) We investigate a potential way of de-

coupling the skeleton parsing and the schema linking to

reduce the difﬁculty of Text-to-SQL. Speciﬁcally, we pro-

pose a ranking-enhanced encoder to alleviate the effort of

the schema linking and a skeleton-aware decoder to implic-

itly guide the SQL parsing by the skeleton. (2) We conduct

extensive evaluations and analysis and show that our frame-

work not only achieves the new SOTA performance on Spi-

der but also exhibits strong robustness.

2 Problem Deﬁnition

Database Schema. A relational database is denoted as

D. The database schema S of D includes (1) a set of

N tables T = {t

, t

, · · · , t

}, (2) a set of columns

C = {c

, · · · , c

, c

, · · · , c

} associ-

ated with the tables, where n

is the number of columns in

the i-th table, (3) and a set of foreign key relations R =

{(c

, c

)|c

, c

∈ C}, where each (c

, c

) denotes a for-

eign key relation between column c

and column c

. We use

M =

i=1

to denote the total number of columns in D.

Original Name and Semantic Name. We use “schema

items” to uniformly refer to tables and columns in the

database. Each schema item can be represented by an origi-

nal name and a semantic name. The semantic name can in-

dicate the semantics of the schema item more precisely. As

shown in Figure 1, it is obvious that the semantic names “air-

line id” and “destination airport” are more clear than their

original names “uid” and “destairport”. Sometimes the se-

mantic name is the same as the original name.

Text-to-SQL Task. Formally, given a question q in natural

language and a database D with its schema S, the Text-to-

SQL task aims to translate q into a SQL query l that can be

executed on D to answer the question q.

3 Methodology

In this section, we ﬁrst give an overview of the proposed

RESDSQL framework and then delve into its design details.

3.1 Model Overview

Following Shaw et al. (2021); Scholak, Schucher, and Bah-

danau (2021), we treat Text-to-SQL as a translation task,

which can be solved by an encoder-decoder transformer

model. Facing the above problems, we extend the existing

seq2seq Text-to-SQL methods by injecting the most relevant

schema items in the input sequence and the SQL skeleton

in the output sequence, which results in a ranking-enhanced

encoder and a skeleton-aware decoder. We provide the high-

level overview of the proposed RESDSQL model in Fig-

ure 2. The encoder of the seq2seq model receives the ranked

schema sequence, such that the schema linking effort could

be alleviated during SQL parsing. To obtain such a ranked

schema sequence, an additional cross-encoder is proposed

to classify the schema items according to the given ques-

tion, and then we rank and ﬁlter them based on the classiﬁ-

cation probabilities. The decoder of the seq2seq model ﬁrst

parses out the SQL skeleton and then the actual SQL query,

such that the SQL generation can be implicitly constrained

by the previously parsed skeleton. By doing this, to a certain

extent, the schema linking and the skeleton parsing are not

intertwined but decoupled.

3.2 Ranking-enhanced Encoder

Instead of injecting all schema items, we only consider the

most relevant schema items in the input of the encoder. For

this purpose, we devise a cross-encoder to classify the tables

and columns simultaneously and then rank them based on

their probabilities. Based on the ranking order, on one hand,

we ﬁlter out the irrelevant schema items. On the other hand,

we use the ranked schema sequence instead of the unordered

schema sequence, so that the seq2seq model could capture

potential position information for schema linking.

As for the input of the cross-encoder, we ﬂatten the

schema items into a schema sequence in their default or-

der and concatenate it with the question to form an input

sequence: X = q | t

: c

, · · · , c

| · · · | t

: c

, · · · , c

of 9

免费下载

nl2sql paper

关注

评论