Python plotly express line chart with cumulative sum
Question:
In essence, I would like to plot a line chart of my data y ~ x | g
, that is I would like to plot the cumulative sums of y separately and colored by groups, without having to add these to the data. Why? Because there are many such columns which I would like to plot and I do not want to add a cumulative column for each one. Consider the following example.
import pandas as pd
df = pd.DataFrame({
"y" : [1,1,1,2,2,2],
"x" : [1,2,3,1,2,3],
"g" : ["a","a","a","b","b","b"]
})
import plotly.express as px
px.line(df, y="y", x="x", color="g")
I am looking for a way to add an argument of sorts to tell plotly to plot the cumulative sum by groups. Is there such a feature or workaround?
Answers:
- simple pandas
join()
to cumsum()
with required groupby()
- dataframe for plotly express can now access any cumsum column or column that has not been summed
import pandas as pd
df = pd.DataFrame({
"y" : [1,1,1,2,2,2],
"x" : [1,2,3,1,2,3],
"g" : ["a","a","a","b","b","b"]
})
import plotly.express as px
px.line(df.join(df.groupby("g", as_index=False).cumsum(), rsuffix="_cumsum"), y="y_cumsum", x="x", color="g")
I thought @rob-raymond ‘s answer would be able to save me …
but it took me another days to crack that up.
Best cases: all "categories" have the same ‘x’ values
… then the proposed solution works. Problem solved
However, in real life, we often have different x values for the different categories… and the generated cumulative area charts quickly turn super odd
Each time a category is missing an ‘x’ value, the category shows a cumulative value of zero for those ‘x’, while the points before and after are themself well cumulating.
The heck of an headache to figure out … and then some more intensive try & error to fix.
Here is the complete solution that handles both data with and without gaps:
df = pd.DataFrame(df_data)
# my 3 columns: ('type', 'nb_objects', 'dt')
# Create cumulative sum:
df.set_index('type', inplace = True)
cumsum = df.groupby(level=0).apply(lambda x: pd.Series(x['nb_objects'].cumsum().values, index=x['dt']))
# filling in gaps to enable plotly to add data for same x :
try:
no_gap_df_wide_format = cumsum.unstack(level=1).fillna(method='ffill', axis=1)
# getting back to column format:
no_gap_df = no_gap_df_wide_format.stack().rename_axis(index={None: 'type', 'dt':'dt'}).rename('nb_objects_cumsum').reset_index()
except ValueError:
# there is no gap to fill
no_gap_df = pd.DataFrame(df_data).join(pd.DataFrame(df_data).groupby("type", as_index=False).cumsum(), rsuffix="_cumsum")
fig = px.area(no_gap_df,x='dt',
y='nb_objects_cumsum',
color="type",
)
In essence, I would like to plot a line chart of my data y ~ x | g
, that is I would like to plot the cumulative sums of y separately and colored by groups, without having to add these to the data. Why? Because there are many such columns which I would like to plot and I do not want to add a cumulative column for each one. Consider the following example.
import pandas as pd
df = pd.DataFrame({
"y" : [1,1,1,2,2,2],
"x" : [1,2,3,1,2,3],
"g" : ["a","a","a","b","b","b"]
})
import plotly.express as px
px.line(df, y="y", x="x", color="g")
I am looking for a way to add an argument of sorts to tell plotly to plot the cumulative sum by groups. Is there such a feature or workaround?
- simple pandas
join()
tocumsum()
with requiredgroupby()
- dataframe for plotly express can now access any cumsum column or column that has not been summed
import pandas as pd
df = pd.DataFrame({
"y" : [1,1,1,2,2,2],
"x" : [1,2,3,1,2,3],
"g" : ["a","a","a","b","b","b"]
})
import plotly.express as px
px.line(df.join(df.groupby("g", as_index=False).cumsum(), rsuffix="_cumsum"), y="y_cumsum", x="x", color="g")
I thought @rob-raymond ‘s answer would be able to save me …
but it took me another days to crack that up.
Best cases: all "categories" have the same ‘x’ values
… then the proposed solution works. Problem solved
However, in real life, we often have different x values for the different categories… and the generated cumulative area charts quickly turn super odd
Each time a category is missing an ‘x’ value, the category shows a cumulative value of zero for those ‘x’, while the points before and after are themself well cumulating.
The heck of an headache to figure out … and then some more intensive try & error to fix.
Here is the complete solution that handles both data with and without gaps:
df = pd.DataFrame(df_data)
# my 3 columns: ('type', 'nb_objects', 'dt')
# Create cumulative sum:
df.set_index('type', inplace = True)
cumsum = df.groupby(level=0).apply(lambda x: pd.Series(x['nb_objects'].cumsum().values, index=x['dt']))
# filling in gaps to enable plotly to add data for same x :
try:
no_gap_df_wide_format = cumsum.unstack(level=1).fillna(method='ffill', axis=1)
# getting back to column format:
no_gap_df = no_gap_df_wide_format.stack().rename_axis(index={None: 'type', 'dt':'dt'}).rename('nb_objects_cumsum').reset_index()
except ValueError:
# there is no gap to fill
no_gap_df = pd.DataFrame(df_data).join(pd.DataFrame(df_data).groupby("type", as_index=False).cumsum(), rsuffix="_cumsum")
fig = px.area(no_gap_df,x='dt',
y='nb_objects_cumsum',
color="type",
)