Seaborn: cumulative sum and hue

Question:

I have the following dataframe in pandas:

data = {
    'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
    'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
    'value': np.random.rand(30),
}

df = pd.DataFrame(data)

Now I want to have a line plot with the cumulative sum over the value by following the "idx" for each "hue_val".
So in the end it would be three curves going strictly up (since they are positive numbers), one for "A", "B" and "C".

I found this code in several sources:

sns.lineplot(x="idx", y="value", hue="hue_val", data=df, estimator="cumsum")

That is not doing the trick, since both the curve and the x-axis are false:
enter image description here

Asked By: Hemmelig

||

Answers:

You can calculate the cumsum separately and plot the result:

df['cumsum'] = df.groupby('hue_val').value.transform('cumsum')
sns.lineplot(x="idx", y="cumsum", hue="hue_val", data=df)

enter image description here

Answered By: pieterbons

Given OPs dataframe

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = {
    'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
    'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
    'value': np.random.rand(30)
}

df = pd.DataFrame(data)

There are two things one needs to do:

  1. Calculate the cum sum for each hue_val

  2. Plot it


1. Calculate the cum sum for each hue_val

In order to calculate the cumulative sum, one can use pandas.DataFrame.groupby and pandas.Series.cumsum. As per OP’s request, use a variable column as a way to select the column one wants to consider as follows

column = 'value'
df['cum_sum'] = df.groupby('hue_val')[column].cumsum()

As one is using Numpy to generate some dataframe values, one can also use it to calculate the cum sum with pandas.DataFrame.apply and numpy.cumsum as follows

df['cum_sum'] = df.groupby('hue_val')[column].apply(lambda x: np.cumsum(x))

2. Plot it

Then one can plot it with seaborn.lineplot as follows

sns.lineplot(data=df, x='idx', y='cum_sum', hue='hue_val')

enter image description here


Notes:

  • Having a variable to specify the column, makes it more user friendly if the dataframe has more columns, as the one below, (one of OP’s concern), as one will simply change the variable to, let’s say value3

    data = {
        'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
        'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
        'value': np.random.rand(30),
        'value1': np.random.rand(30),
        'value2': np.random.rand(30),
        'value3': np.random.rand(30),
    }
    
    df = pd.DataFrame(data)
    
    column = 'value3'
    df['cum_sum'] = df.groupby('hue_val')[column].cumsum()
    
    sns.lineplot(data=df, x='idx', y='cum_sum', hue='hue_val')
    

    enter image description here

  • There are strong opinions on using .apply(). Would recommend reading this: When should I (not) want to use pandas apply() in my code?

Answered By: Gonçalo Peres