
pandasql工具
pip install -U pandasql
pandasql 中的主要函数是 sqldf,它接收两个参数:一个 SQL 查询语句,还有一组环境变量 globals() 或 locals()。
类型为DataFrame的变量名会自动会注册为pandasql的表名,查询时表名用DataFrame的变量名即可。
关于pandasql支持的sql语法可参考:https://www.sqlite.org/lang.html
首先,导入库,并准备测试数据:
import pandas as pdfrom pandasql import sqldfdf1 = pd.DataFrame({'name': ['ZhangFei', 'GuanYu', 'a', 'b', 'c'],'data1': range(5)})df1
Out[1]:
| name | data1 | |
|---|---|---|
| 0 | ZhangFei | 0 |
| 1 | GuanYu | 1 |
| 2 | a | 2 |
| 3 | b | 3 |
| 4 | c | 4 |
然后测试pandasql:
sqldf("select * from df1 where name ='ZhangFei'", globals())
Out[2]:
| name | data1 | |
|---|---|---|
| 0 | ZhangFei | 0 |
为了不用每次写sql都传globals(),可以定义一个函数:
pysqldf = lambda sql: sqldf(sql, globals())
再次写sql时就只需要:
pysqldf("select * from df1 order by data1 desc")
Out[3]:
| name | data1 | |
|---|---|---|
| 0 | c | 4 |
| 1 | b | 3 |
| 2 | a | 2 |
| 3 | GuanYu | 1 |
| 4 | ZhangFei | 0 |
官方测试数据:
from pandasql import sqldf, load_meat, load_birthspysqldf = lambda q: sqldf(q, globals())meat = load_meat()births = load_births()pysqldf("SELECT * FROM meat LIMIT 10;").head()Out:date beef veal pork lamb_and_mutton broilers other_chicken turkey0 1944-01-01 00:00:00 751 85 1280 89 None None None1 1944-02-01 00:00:00 713 77 1169 72 None None None2 1944-03-01 00:00:00 741 90 1128 75 None None None3 1944-04-01 00:00:00 650 89 978 66 None None None4 1944-05-01 00:00:00 681 106 1029 78 None None None
连接查询:
q = """SELECTm.date, m.beef, b.birthsFROM meat mINNER JOIN births b ON m.date = b.date;"""joined = pysqldf(q)joined.head()
Out:
| date | beef | births | |
|---|---|---|---|
| 0 | 1975-01-01 00:00:00.000000 | 2106.0 | 265775 |
| 1 | 1975-02-01 00:00:00.000000 | 1845.0 | 241045 |
| 2 | 1975-03-01 00:00:00.000000 | 1891.0 | 268849 |
| 3 | 1975-04-01 00:00:00.000000 | 1895.0 | 247455 |
| 4 | 1975-05-01 00:00:00.000000 | 1849.0 | 254545 |
聚合查询:
q = """selectstrftime('%Y', date) as year,SUM(beef) as beef_totalFROM meatGROUP BY year;"""pysqldf(q).head()
Out:
| year | beef_total | |
|---|---|---|
| 0 | 1944 | 8801.0 |
| 1 | 1945 | 9936.0 |
| 2 | 1946 | 9010.0 |
| 3 | 1947 | 10096.0 |
| 4 | 1948 | 8766.0 |

DataFrame的query方法过滤数据
pip install pandas --upgrade
df = pd.DataFrame({'A': range(1, 6),'B': range(10, 0, -2),'C C': range(10, 5, -1)})df
| A | B | C C | |
|---|---|---|---|
| 0 | 1 | 10 | 10 |
| 1 | 2 | 8 | 9 |
| 2 | 3 | 6 | 8 |
| 3 | 4 | 4 | 7 |
| 4 | 5 | 2 | 6 |
df[df.A > df.B]
| A | B | C C | |
|---|---|---|---|
| 4 | 5 | 2 | 6 |
df.query('A > B')
| A | B | C C | |
|---|---|---|---|
| 4 | 5 | 2 | 6 |
对于存在空格的列名,可以使用反引号`:
df.query('B == `C C`')
Out:
| A | B | C C | |
|---|---|---|---|
| 0 | 1 | 10 | 10 |
df[(df['A'] > df["A"].mean()) & (df['B'] < 8)]
| A | B | C C | |
|---|---|---|---|
| 3 | 4 | 4 | 7 |
| 4 | 5 | 2 | 6 |
mean = df["A"].mean()df.query("A > @mean & B < 8")
| A | B | C C | |
|---|---|---|---|
| 3 | 4 | 4 | 7 |
| 4 | 5 | 2 | 6 |
在query函数中行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四则运算符都是合法的。例如对于一个大数据集:
df.head()
| Address | Weight | |
|---|---|---|
| ID | ||
| 1101 | street_1 | 63 |
| 1102 | street_2 | 73 |
| 1103 | street_2 | 82 |
| 1104 | street_2 | 81 |
| 1105 | street_4 | 64 |
df[df['Address'].isin(["street_6", "street_7"]) & (df['Weight'] > (70 + 10)) &(df.index.isin([1303, 2304, 2402]))]
| Address | Weight | |
|---|---|---|
| ID | ||
| 1303 | street_7 | 82 |
| 2304 | street_6 | 81 |
| 2402 | street_7 | 82 |
df.query('(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])')
| Address | Weight | |
|---|---|---|
| ID | ||
| 1303 | street_7 | 82 |
| 2304 | street_6 | 81 |
| 2402 | street_7 | 82 |
- END -


文章转载自漫谈大数据与数据分析,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




