How to write SQL window functions in pandas
Question:
Is there an idiomatic equivalent to SQL’s window functions in Pandas? For example, what’s the most compact way to write the equivalent of this in Pandas?
SELECT state_name,
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name
Or this?:
SELECT state_name,
state_population,
region,
SUM(state_population)
OVER(PARTITION BY region) AS regional_population
FROM population
ORDER BY state_name
Answers:
For the first SQL:
SELECT state_name,
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name
Pandas:
df.assign(national_population=df.state_population.sum()).sort_values('state_name')
For the second SQL:
SELECT state_name,
state_population,
region,
SUM(state_population)
OVER(PARTITION BY region) AS regional_population
FROM population
ORDER BY state_name
Pandas:
df.assign(regional_population=df.groupby('region')['state_population'].transform('sum'))
.sort_values('state_name')
DEMO:
In [238]: df
Out[238]:
region state_name state_population
0 1 aaa 100
1 1 bbb 110
2 2 ccc 200
3 2 ddd 100
4 2 eee 100
5 3 xxx 55
national_population:
In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
region state_name state_population national_population
0 1 aaa 100 665
1 1 bbb 110 665
2 2 ccc 200 665
3 2 ddd 100 665
4 2 eee 100 665
5 3 xxx 55 665
regional_population:
In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum'))
...: .sort_values('state_name')
Out[239]:
region state_name state_population regional_population
0 1 aaa 100 210
1 1 bbb 110 210
2 2 ccc 200 400
3 2 ddd 100 400
4 2 eee 100 400
5 3 xxx 55 55
Another common window is OVER(ORDER BY ...)
. For example, the following.
SELECT *
,SUM(values) OVER(ORDER BY date) AS cum_sum
FROM df;
The pandas equivalent is cumsum()
df['cum_sum'] = df['values'].sort_values(by='date').cumsum()
Another common window function is ROW_NUMBER()
.
SELECT *
,ROW_NUMBER() OVER () AS row_number
FROM df;
It’s equivalent pandas equivalent is range()
.
df['row_number'] = range(1, len(df)+1)
Also there is a module pandasql
that’s built on pandas that lets you run sql queries on local dataframes. So if you’re comfortable with sql, then you can run a query directly on a dataframe.
# !pip isntall pandasql
from pandasql import sqldf
df = sqldf("""
SELECT state_name,
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name
""")
Is there an idiomatic equivalent to SQL’s window functions in Pandas? For example, what’s the most compact way to write the equivalent of this in Pandas?
SELECT state_name,
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name
Or this?:
SELECT state_name,
state_population,
region,
SUM(state_population)
OVER(PARTITION BY region) AS regional_population
FROM population
ORDER BY state_name
For the first SQL:
SELECT state_name,
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name
Pandas:
df.assign(national_population=df.state_population.sum()).sort_values('state_name')
For the second SQL:
SELECT state_name,
state_population,
region,
SUM(state_population)
OVER(PARTITION BY region) AS regional_population
FROM population
ORDER BY state_name
Pandas:
df.assign(regional_population=df.groupby('region')['state_population'].transform('sum'))
.sort_values('state_name')
DEMO:
In [238]: df
Out[238]:
region state_name state_population
0 1 aaa 100
1 1 bbb 110
2 2 ccc 200
3 2 ddd 100
4 2 eee 100
5 3 xxx 55
national_population:
In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
region state_name state_population national_population
0 1 aaa 100 665
1 1 bbb 110 665
2 2 ccc 200 665
3 2 ddd 100 665
4 2 eee 100 665
5 3 xxx 55 665
regional_population:
In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum'))
...: .sort_values('state_name')
Out[239]:
region state_name state_population regional_population
0 1 aaa 100 210
1 1 bbb 110 210
2 2 ccc 200 400
3 2 ddd 100 400
4 2 eee 100 400
5 3 xxx 55 55
Another common window is OVER(ORDER BY ...)
. For example, the following.
SELECT *
,SUM(values) OVER(ORDER BY date) AS cum_sum
FROM df;
The pandas equivalent is cumsum()
df['cum_sum'] = df['values'].sort_values(by='date').cumsum()
Another common window function is ROW_NUMBER()
.
SELECT *
,ROW_NUMBER() OVER () AS row_number
FROM df;
It’s equivalent pandas equivalent is range()
.
df['row_number'] = range(1, len(df)+1)
Also there is a module pandasql
that’s built on pandas that lets you run sql queries on local dataframes. So if you’re comfortable with sql, then you can run a query directly on a dataframe.
# !pip isntall pandasql
from pandasql import sqldf
df = sqldf("""
SELECT state_name,
state_population,
SUM(state_population)
OVER() AS national_population
FROM population
ORDER BY state_name
""")