How to write SQL window functions in pandas

Question

Is there an idiomatic equivalent to SQL’s window functions in Pandas? For example, what’s the most compact way to write the equivalent of this in Pandas?

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name

Or this?:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name

Asked By: 2daaa

||

Source

Answer 1

For the first SQL:

SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name

Pandas:

df.assign(national_population=df.state_population.sum()).sort_values('state_name')

For the second SQL:

SELECT state_name,  
       state_population,
       region,
       SUM(state_population)
        OVER(PARTITION BY region) AS regional_population
FROM population    
ORDER BY state_name

Pandas:

df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) 
  .sort_values('state_name')

DEMO:

In [238]: df
Out[238]:
   region state_name  state_population
0       1        aaa               100
1       1        bbb               110
2       2        ccc               200
3       2        ddd               100
4       2        eee               100
5       3        xxx                55

national_population:

In [246]: df.assign(national_population=df.state_population.sum()).sort_values('state_name')
Out[246]:
   region state_name  state_population  national_population
0       1        aaa               100                  665
1       1        bbb               110                  665
2       2        ccc               200                  665
3       2        ddd               100                  665
4       2        eee               100                  665
5       3        xxx                55                  665

regional_population:

In [239]: df.assign(regional_population=df.groupby('region')['state_population'].transform('sum')) 
     ...:   .sort_values('state_name')
Out[239]:
   region state_name  state_population  regional_population
0       1        aaa               100                  210
1       1        bbb               110                  210
2       2        ccc               200                  400
3       2        ddd               100                  400
4       2        eee               100                  400
5       3        xxx                55                   55

Answered By: MaxU – stop russian terror

Answer 2

Another common window is OVER(ORDER BY ...). For example, the following.

SELECT *
    ,SUM(values) OVER(ORDER BY date) AS cum_sum
FROM df;

The pandas equivalent is cumsum()

df['cum_sum'] = df['values'].sort_values(by='date').cumsum()

Another common window function is ROW_NUMBER().

SELECT *
    ,ROW_NUMBER() OVER () AS row_number
FROM df;

It’s equivalent pandas equivalent is range().

df['row_number'] = range(1, len(df)+1)

Also there is a module pandasql that’s built on pandas that lets you run sql queries on local dataframes. So if you’re comfortable with sql, then you can run a query directly on a dataframe.

# !pip isntall pandasql
from pandasql import sqldf
df = sqldf("""
SELECT state_name,  
       state_population,
       SUM(state_population)
        OVER() AS national_population
FROM population   
ORDER BY state_name 
""")

Answered By: cottontail

How to write SQL window functions in pandas

Question:

Answers: