
External Knowledge Reasoning
Large and Realistic Database Values
SQL Execution Efficiency
What is the winning rate of Boston Celtics in 2000?
SELECT COUNT(won) / ((COUNT(won) + COUNT(lose))
FROM teams WHERE team_name = ‘Boston Celtics’
AND year = 2000;
External Knowledge:
winning rate = # won / (# won + # lose)
What is the average salary of the worst performing managers?
SELECT AVG(CAST(REPLACE(SUBSTR(T1.salary, 4), ',', '') AS REAL)) FROM
last_name
Milgrom
… …
em_id
0000
… …
Employees
US$57,500.00
first_name
salary
2222
6543
Adams
Wood
Milgrom
Sandy
Emily
… …
US$19,500.00
US$69,000.00
… …
Reasoned Database:
employee AS T1 JOIN position AS T2 ON T1.positionID = T2.positionID
WHERE T1.performance = 'Poor' AND T2.positiontitle = 'Manager'
Among the coaches who have served more than 2 NBA teams, during
which coach‘s period of coaching, a team has the least numbers of
games lost in the post-season games?
SELECT coachID FROM coaches WHERE lgID='NBA’ AND post_wins !=0
SQL
1
: normal semantic parser
AND post_losses !=0 AND coachID IN
(SELECT coachID FROM coaches WHERE lgID='NBA’ GROUP BY coachID
HAVING COUNT(tmID)>=2) ORDER BY post_losses ASC LIMIT 1 ;
What is the average salary of the worst performing managers?
SELECT AVG (CAST(REPLACE(SUBSTR(T1.salary, 4), ',', '') AS REAL)) FROM
last_name
Milgrom
… …
em_id
0000
… …
Employees
US$57,500.00
first_name
salary
2222
6543
Adams
Wood
Santa
Sandy
Emily
… …
US$19,500.00
US$69,000.00
… …
Reasoned Database:
employee AS T1 JOIN position AS T2 ON T1.positionID = T2.positionID
WHERE T1.performance = 'Poor' AND T2.positiontitle = 'Manager'
SQL
2
: efficient semantic parser
SELECT coachID FROM coaches WHERE lgID=‘NBA’ AND post_wins !=0
AND post_losses !=0 AND EXISTS (SELECT 1 FROM coaches AS coaches1
WHERE (coaches1.lgID=‘NBA’) AND (coaches.coachID=coaches1.coachID)
GROUP BY coaches1.coachID HAVING count(coaches1.tmID) >= 2
ORDER BY NULL ) ORDER BY coaches.post_losses ASC LIMIT 1
How many accounts are eligible for loans in New Yo rk City?
The condition of loans is that
the type of the account should
be “OWNER”.
SELECT COUNT(*) FROM account WHERE account.type
= ‘OWNER’ AND city = ‘NY’;
External Knowledge:
How many accounts are eligible for loans in New Yo rk City?
The condition of loans is that
the type of the account should
be “OWNER”.
SELECT COUNT(*) FROM account WHERE account.type
= ‘OWNER’ AND disp_id = ‘NY’;
External Knowl edge:
List account id who chooses weekly issue issuance statement?
‘POPLATEK TYDNE’ stands
for weekly issuance.
SELECT account_id FROM account WHERE account.frequency
= ‘POPLATEK TYDNE‘;
External Knowl edge:
How many accounts are eligible for loans in New Yo rk City?
The condition of loans is that
the type of the account should
be “OWNER”.
SELECT COUNT(*) FROM account WHERE account.type
= ‘OWNER’ AND disp_id = ‘NY’;
External Knowledge:
List account id who chooses weekly issue issuance statement?
‘POPLATEK TYDNE’ stands
for weekly issuance.
SELECT account_id FROM account WHERE account.frequency
= ‘POPLATEK TYDNE‘;
External Knowledge:
What is the average salary of the worst performing managers?
Figure 1: Examples of challenges in our BIRD benchmark. 1) databases contain values of noisy data
types [
14
,
24
,
19
,
32
]. In the left example, the average salary could be fetched by processing the
data type from string (
TEXT
in SQLite) to float (
REAL
in SQLite) after deleting the special tokens,
"US$"
and
","
. 2) external knowledge and reasoning are required. In the middle example, models
must handle that only
"OWNER"
accounts are eligible for loans. 3) query execution efficiency needs
to be considered. In the right example, the adoption of more efficient SQL queries leads to significant
gains in speed, which is of great value in industries.
models, including those based on large language models (LLMs), have led to impressive performance
on existing benchmarks such as Spider [
55
] and WikiSQL [
60
]. For instance, the execution accuracy
of the top-performing model in Spider leaderboard has increased from 53.5% [
61
] to 85.3% [
35
] over
the past three years. The latest SOTA parser [
35
] in Spider benefits from the powerful understanding
and coding capabilities of the large language model (LLM), and such excellent performance leads us
to ask a question: Can LLM already serve as a database interface ?
The answer is no, as shown in Figure.1, we discovered that current state-of-the-art models still struggle
to generalize to more realistic situations characterized by large database sizes and noisy content.
Besides, the mysteries hidden behind the huge database values require external knowledge and
reasoning to reveal. Furthermore, existing benchmarks do not account for SQL execution efficiency,
which holds significant practical importance in real-life applications, notably in the case of large
databases. Motivated by these observations, we aim to develop a new text-to-SQL benchmark that
better represents real-life scenarios and narrows the gap between experimental and practical settings.
In this work, we propose BIRD, a BIg Bench for LaRge-Scale Database Grounded in Text-to-SQLs
for real-world applications. BIRD contains complex 12,751 examples of querying information over
95 big databases with a total size of 33.4 GB spanning 37 professional domains. For training, we
collected 80 open-source relational databases from real analysis platforms (Kaggle, Relation.vit);
for evaluation, we curated 15 additional relational databases. Given these databases, we rely on
crowdsourcing to collect natural language instructions and the corresponding SQLs. First, our
database experts create a description file explaining all column names, abbreviated values, value
types, and external knowledge for each database to help annotators better understand the database
contents. Then we hire and train native speakers to ask questions facing these databases on one
side; on the other side, a SQL annotation team consisting of data engineers and database students
is recruited to generate SQLs to answer questions. To accommodate efficiency, we propose a new
metric Valid Efficiency Score (VES) to evaluate the efficiency of generated SQLs in addition to the
standard execution accuracy. To the best of our knowledge, BIRD is the first text-to-SQL benchmark
to incorporate efficiency, promoting more efficient query methods within the context of massive and
noisy database contents.
We evaluate the performance of state-of-the-art text-to-SQL parsers using two popular methodologies:
fine-tuning with T5 [
38
], and in-context learning with large language models (LLMs) such as Codex
[
6
] (
code-davinci-002
) and ChatGPT [
33
] (
gpt-3.5-turbo
). Our experimental results
reveal that the current models struggle to generalize well. Specifically, the Spider SOTA model,
which depends solely on the database schema, achieves execution accuracies of only 25.88% and
28.95% on the development and test sets, respectively. In comparison, the performance still lags far
behind human performance, which we also provide in this benchmark. We encourage further research
to address the more realistic settings presented in this benchmark.
2
评论