Cumulative count with altair

Question:

I’m trying to plot data with a date as the X axis and several cumulative counts as Ys.

I have a set of items such as:

id1 date1 user1
id2 date2 user1
id3 date3 user2

With this example, I’d want the plot to have 2 lines, the X axis would have three entries (date1, date2, date3) and user1 would have a Y-value of 1 at date1, 2 at date2, 2 at date3; user2 would have 0 at date1, 0 at date2, 1 at date3.

Directly making a Chart, I can’t see what I should use to have this cumulative count. E.g.

Chart(data).mark_line().encode(x='date:T', y='count(*)', color='username')

obviously creates a chart where most values are 0 (few entries have exactly the same date).

Ideally,

Chart(data).mark_line().encode(x='date:T', y='cumcount(*)', color='username')

would work, but there doesn’t seem to be an equivalent in the documentation.

In my real case, I have ~10 users and a few thousand entries over a few months.

Asked By: Sildar

||

Answers:

I don’t think cumulative count aggregation is yet available in Altair. In the meanwhile, one can do the corresponding manipulation in Pandas. Here is one such way. I am sure there can be more efficient ways of doing so.

import pandas as pd
import numpy as np
np.random.seed(0)
user_list = ['user1', 'user2']
df = pd.DataFrame({'date':range(2000, 2010),
                  'username':np.random.choice(user_list, 10)})

This is how df looks like.

    date    username
0   2000    user1
1   2001    user2
2   2002    user2
3   2003    user1
4   2004    user2
5   2005    user2
6   2006    user2
7   2007    user2
8   2008    user2
9   2009    user2

Cross-tabulation

d = pd.crosstab(df.date, columns=df.username).cumsum()
d = d.stack().reset_index()
d = d.rename(columns={0:'CummulativeCount'})

This is the output of d.head().

date    username    CummulativeCount
0   2000    user1   1
1   2000    user2   0
2   2001    user1   1
3   2001    user2   1
4   2002    user1   1

Now, we can use Altair without bothering about any aggregation.

from altair import Chart
c = Chart(d)
c.mark_line().encode(x='date:T', y='CummulativeCount:Q', color='username')

enter image description here

Answered By: Nipun Batra

You can now do this directly in Altair as in this example in the docs. Here is another example showing a cumulatitve count with a line chart (often called an empirical cumulative distribution function (ecdf) although technically I think the y-axis is often normalized for those):

import pandas as pd
import altair as alt


df = pd.DataFrame({'values': np.random.normal(size=100)})

alt.Chart(df).mark_line().encode(
    x='values',
    y='cumulative_count:Q',
).transform_window(
    cumulative_count='count()',
    sort=[{'field': 'values'}]
)

enter image description here

Answered By: joelostblom