Seaborn: cumulative sum and hue
Question:
I have the following dataframe in pandas:
data = {
'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
'value': np.random.rand(30),
}
df = pd.DataFrame(data)
Now I want to have a line plot with the cumulative sum over the value by following the "idx" for each "hue_val".
So in the end it would be three curves going strictly up (since they are positive numbers), one for "A", "B" and "C".
I found this code in several sources:
sns.lineplot(x="idx", y="value", hue="hue_val", data=df, estimator="cumsum")
That is not doing the trick, since both the curve and the x-axis are false:
Answers:
Given OPs dataframe
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = {
'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
'value': np.random.rand(30)
}
df = pd.DataFrame(data)
There are two things one needs to do:
-
Calculate the cum sum for each hue_val
-
Plot it
1. Calculate the cum sum for each hue_val
In order to calculate the cumulative sum, one can use pandas.DataFrame.groupby and pandas.Series.cumsum
. As per OP’s request, use a variable column
as a way to select the column one wants to consider as follows
column = 'value'
df['cum_sum'] = df.groupby('hue_val')[column].cumsum()
As one is using Numpy to generate some dataframe values, one can also use it to calculate the cum sum with pandas.DataFrame.apply
and numpy.cumsum
as follows
df['cum_sum'] = df.groupby('hue_val')[column].apply(lambda x: np.cumsum(x))
2. Plot it
Then one can plot it with seaborn.lineplot
as follows
sns.lineplot(data=df, x='idx', y='cum_sum', hue='hue_val')
Notes:
-
Having a variable to specify the column, makes it more user friendly if the dataframe has more columns, as the one below, (one of OP’s concern), as one will simply change the variable to, let’s say value3
data = {
'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
'value': np.random.rand(30),
'value1': np.random.rand(30),
'value2': np.random.rand(30),
'value3': np.random.rand(30),
}
df = pd.DataFrame(data)
column = 'value3'
df['cum_sum'] = df.groupby('hue_val')[column].cumsum()
sns.lineplot(data=df, x='idx', y='cum_sum', hue='hue_val')
-
There are strong opinions on using .apply()
. Would recommend reading this: When should I (not) want to use pandas apply() in my code?
I have the following dataframe in pandas:
data = {
'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
'value': np.random.rand(30),
}
df = pd.DataFrame(data)
Now I want to have a line plot with the cumulative sum over the value by following the "idx" for each "hue_val".
So in the end it would be three curves going strictly up (since they are positive numbers), one for "A", "B" and "C".
I found this code in several sources:
sns.lineplot(x="idx", y="value", hue="hue_val", data=df, estimator="cumsum")
That is not doing the trick, since both the curve and the x-axis are false:
Given OPs dataframe
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = {
'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10],
'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",],
'value': np.random.rand(30)
}
df = pd.DataFrame(data)
There are two things one needs to do:
-
Calculate the cum sum for each
hue_val
-
Plot it
1. Calculate the cum sum for each hue_val
In order to calculate the cumulative sum, one can use pandas.DataFrame.groupby and pandas.Series.cumsum
. As per OP’s request, use a variable column
as a way to select the column one wants to consider as follows
column = 'value'
df['cum_sum'] = df.groupby('hue_val')[column].cumsum()
As one is using Numpy to generate some dataframe values, one can also use it to calculate the cum sum with pandas.DataFrame.apply
and numpy.cumsum
as follows
df['cum_sum'] = df.groupby('hue_val')[column].apply(lambda x: np.cumsum(x))
2. Plot it
Then one can plot it with seaborn.lineplot
as follows
sns.lineplot(data=df, x='idx', y='cum_sum', hue='hue_val')
Notes:
-
Having a variable to specify the column, makes it more user friendly if the dataframe has more columns, as the one below, (one of OP’s concern), as one will simply change the variable to, let’s say
value3
data = { 'idx': [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10], 'hue_val': ["A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C","C","C",], 'value': np.random.rand(30), 'value1': np.random.rand(30), 'value2': np.random.rand(30), 'value3': np.random.rand(30), } df = pd.DataFrame(data) column = 'value3' df['cum_sum'] = df.groupby('hue_val')[column].cumsum() sns.lineplot(data=df, x='idx', y='cum_sum', hue='hue_val')
-
There are strong opinions on using
.apply()
. Would recommend reading this: When should I (not) want to use pandas apply() in my code?