Seaborn: cumulative sum and hue

Question

I have the following dataframe in pandas:

data = {
    'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
    'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
    'value': np.random.rand(30),
}

df = pd.DataFrame(data)

Now I want to have a line plot with the cumulative sum over the value by following the "idx" for each "hue_val".
So in the end it would be three curves going strictly up (since they are positive numbers), one for "A", "B" and "C".

I found this code in several sources:

sns.lineplot(x="idx", y="value", hue="hue_val", data=df, estimator="cumsum")

That is not doing the trick, since both the curve and the x-axis are false:

Asked By: Hemmelig

||

Source

Answer 1

You can calculate the cumsum separately and plot the result:

df['cumsum'] = df.groupby('hue_val').value.transform('cumsum')
sns.lineplot(x="idx", y="cumsum", hue="hue_val", data=df)

Answered By: pieterbons

Answer 2

Given OPs dataframe

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = {
    'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
    'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
    'value': np.random.rand(30)
}

df = pd.DataFrame(data)

There are two things one needs to do:

Calculate the cum sum for each hue_val
Plot it

1. Calculate the cum sum for each hue_val

In order to calculate the cumulative sum, one can use pandas.DataFrame.groupby and pandas.Series.cumsum. As per OP’s request, use a variable column as a way to select the column one wants to consider as follows

column = 'value'
df['cum_sum'] = df.groupby('hue_val')[column].cumsum()

As one is using Numpy to generate some dataframe values, one can also use it to calculate the cum sum with pandas.DataFrame.apply and numpy.cumsum as follows

df['cum_sum'] = df.groupby('hue_val')[column].apply(lambda x: np.cumsum(x))

2. Plot it

Then one can plot it with seaborn.lineplot as follows

sns.lineplot(data=df, x='idx', y='cum_sum', hue='hue_val')

Notes:

Having a variable to specify the column, makes it more user friendly if the dataframe has more columns, as the one below, (one of OP’s concern), as one will simply change the variable to, let’s say value3

data = {
    'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
    'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
    'value': np.random.rand(30),
    'value1': np.random.rand(30),
    'value2': np.random.rand(30),
    'value3': np.random.rand(30),
}

df = pd.DataFrame(data)

column = 'value3'
df['cum_sum'] = df.groupby('hue_val')[column].cumsum()

sns.lineplot(data=df, x='idx', y='cum_sum', hue='hue_val')

There are strong opinions on using .apply(). Would recommend reading this: When should I (not) want to use pandas apply() in my code?

Answered By: Gonçalo Peres

Seaborn: cumulative sum and hue

Question:

Answers: