
volve more than one database, and require the
model to be able to generalize to and handle unseen
databases during evaluation. To accommodate this
need, the WikiSQL dataset is then released by
Zhong et al. (2017). It consists of 80,654 ques-
tion/SQL pairs for 24,241 single-table databases.
They propose a new data split setting to ensure that
databases in train/dev/test do not overlap. However,
they focus on very simple SQL queries containing
one SELECT statement with one WHERE clause.
In addition, Sun et al. (2020) released TableQA, a
Chinese dataset similar to the WikiSQL dataset.
Yu et al. (2018b) released a more challenging
Spider dataset, consisting of 10,181 question/SQL
pairs against 200 multi-table databases. Compared
with WikiSQL and TableQA, Spider is much more
complex due to two reasons: 1) the need of select-
ing relevant tables; 2) many nested queries and ad-
vanced SQL clauses like GROUP BY and ORDER
BY.
As far as we know, most existing datasets are
constructed for English. Another issue is that they
do not refer to the question distribution in real-
world applications during data construction. Tak-
ing Spider as an example. Given a database, anno-
tators are asked to write many SQL queries from
scratch. The only requirement is that SQL queries
have to cover a list of SQL clauses and nested
queries. Meanwhile, the annotators write NL ques-
tions corresponding to SQL queries. In particular,
all these datasets contain very few questions involv-
ing calculations between rows or columns, which
we find are very common in real applications.
This paper presents DuSQL, a large-scale and
pragmatic Chinese text-to-SQL dataset, contain-
ing 200 databases, 813 tables, and 23,797 ques-
tion/SQL pairs. Specifically, our contributions are
summarized as follows.
• In order to determine a more realistic distribution
of SQL queries, we collect user questions from
three representative database-oriented applica-
tions and perform manual analysis. In particular,
we find that a considerable proportion of ques-
tions require row/column calculations, which are
not included in existing datasets.
• We adopt an effective data construction frame-
work via human-computer collaboration. The ba-
sic idea is automatically generating SQL queries
based on the SQL grammar and constrained by
the given database. For each SQL query, we first
Figure 2: The SQL query distributions of the three ap-
plications. Please kindly note that a query may belong
to multiple types.
generate a pseudo question by traversing it in the
execution order and then ask annotators to para-
phrase it into a NL question.
• We conduct experiments on DuSQL using
three open-source parsing models. In par-
ticular, we extend the state-of-the-art IRNet
(Guo et al., 2019) model to accommodate
the characteristics of DuSQL. Results and
analysis show that DuSQL is a very chal-
lenging dataset. We will release our data at
https://github.com/luge-ai/luge-ai/
tree/master/semantic-parsing.
2 SQL Query Distribution
As far as we know, existing text-to-SQL datasets
mainly consider the complexity of SQL syntax
when creating SQL queries. For example, Wik-
iSQL has only simple SQL queries containing SE-
LECT and WHERE clauses. Spider covers 15 SQL
clauses including SELECT, WHERE, ORDER BY,
GROUP BY, etc, and allows nested queries.
However, to build a pragmatic text-to-SQL sys-
tem that allows ordinary users to directly interact
with databases via NL questions, it is very impor-
tant to know the SQL query distribution in real-
world applications, from the aspect of user need
rather than SQL syntax. Our analysis shows that
Spider mainly covers three types of SQL queries,
i.e., matching, sor ting, and clustering, whereas
WikiSQL only has matching queries. Neither of
them contains the calculation type, which we find
composes a large portion of questions in certain
real-world applications.
To find out the SQL query distribution in real-
life applications, we consider the following three
评论