How to write SQL window functions in pandas

Question:

Is there an idiomatic equivalent to SQL’s window functions in Pandas? For example, what’s the most compact way to write the equivalent of this in Pandas?

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name 

Or this?:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name
Asked By: 2daaa

||

Answers:

For the first SQL:

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name 

Pandas:

df.assign(national_population=df.state_population.sum()).sort_values('state_name')

For the second SQL:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name

Pandas:

df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) 
  .sort_values('state_name')

DEMO:

In [238]: df
Out[238]:
   region state_name  state_population
0       1        aaa               100
1       1        bbb               110
2       2        ccc               200
3       2        ddd               100
4       2        eee               100
5       3        xxx                55

national_population:

In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
   region state_name  state_population  national_population
0       1        aaa               100                  665
1       1        bbb               110                  665
2       2        ccc               200                  665
3       2        ddd               100                  665
4       2        eee               100                  665
5       3        xxx                55                  665

regional_population:

In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) 
     ...:   .sort_values('state_name')
Out[239]:
   region state_name  state_population  regional_population
0       1        aaa               100                  210
1       1        bbb               110                  210
2       2        ccc               200                  400
3       2        ddd               100                  400
4       2        eee               100                  400
5       3        xxx                55                   55

Another common window is OVER(ORDER BY ...). For example, the following.

SELECT *
    ,SUM(values) OVER(ORDER BY date) AS cum_sum
FROM df;

The pandas equivalent is cumsum()

df['cum_sum'] = df['values'].sort_values(by='date').cumsum()

Another common window function is ROW_NUMBER().

SELECT *
    ,ROW_NUMBER() OVER () AS row_number
FROM df;

It’s equivalent pandas equivalent is range().

df['row_number'] = range(1, len(df)+1)

Also there is a module pandasql that’s built on pandas that lets you run sql queries on local dataframes. So if you’re comfortable with sql, then you can run a query directly on a dataframe.

# !pip isntall pandasql
from pandasql import sqldf
df = sqldf("""
SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name 
""")
Answered By: cottontail