How to create subindex efficiently?

Question:

I would like to create a subindex for my dataframe based on the index. For example, I have a dataframe like this:

      Content        Date
ID                       
Bob  birthday  2010.03.01
Bob    school  2010.04.01
Tom  shopping  2010.02.01
Tom      work  2010.09.01
Tom   holiday  2010.10.01

I’d like create a subindex for for my ID and the resulting dataframe looks like below:

               Content        Date
ID  subindex                      
Bob 1         birthday  2010.03.01
    2           school  2010.04.01
Tom 1         shopping  2010.02.01
    2             work  2010.09.01
    3          holiday  2010.10.01

To do this I need to first create my subindex list. I searched in the help document and it seems to most neat way is to use transform:

subindex = df['Date'].groupby(df.index).transform(lambda x: np.arange(1, len(x) + 1))

However, it is really slow. I looked around and found apply can do the work too:

subindex = df['Date'].groupby(df.index).apply(lambda x: np.arange(1, len(x) + 1))

Of course the subindex needed to be flattened since it is a list of lists here. This works much faster than the transform method. Then I tested with a for loop of my own:

subindex_size = df.groupby(df.index, sort = False).size()
subindex = []
for i in np.arange(len(subindex_size)):
    subindex.extend(np.arange(1,subindex_size[i]+1))

It’s even faster. With my larger dataset (about 90k rows), the transform method takes about 44 secs on my computer, apply takes ~2 secs and the for loop takes only ~1 secs. I need to work on much larger dataset so even the time difference between the apply and for loop makes a difference to me. However, the for loop looks ugly and may not be easily applied if I need to create other group-based variables.

So my question is, why the built-in functions that are supposed to do the right thing are slower? Am I missing something here or is there a reason for this? Is there any other way to improve this process?

Asked By: Zhen Sun

||

Answers:

You can use cumcount to do this:

In [11]: df.groupby(level=0).cumcount()
Out[11]: 
ID
Bob    0
Bob    1
Tom    0
Tom    1
Tom    2
dtype: int64

In [12]: df['subindex'] = df.groupby(level=0).cumcount()  # possibly + 1 here.

In [13]: df.set_index('subindex', append=True)
Out[13]: 
               Content        Date
ID  subindex                      
Bob 0         birthday  2010.03.01
    1           school  2010.04.01
Tom 0         shopping  2010.02.01
    1             work  2010.09.01
    2          holiday  2010.10.01

To start at 1 (rather than 0) just add 1 to the result of cumcount.

Answered By: Andy Hayden