Pydatatable enumerate rows within each group

Question:

Given the following datatable

DT = dt.Frame({'A':['A','A','A','B','B','B'],
               'B':['a','a','b','a','a','a'],
               })

I’d like to create column ‘C’, which numbers the rows within each group in columns A and B like this:

     A   B   C
 0   A   a   1
 1   A   a   2
 2   A   b   1
 3   B   a   1
 4   B   a   2
 5   B   a   3

According to this thread for pandas cumcount() or rank() would be options, but it does not seem to be defined for pydatatable:

DT = DT[:, f[:].extend({'C': cumcount()}),by(f.A,f.B)]
DT = DT[:, f[:].extend({'C': rank(f.B)}),by(f.A,f.B)]

a) How can I number the rows within groups?

b) Is there a comprehensive resource with all the currently available functions for pydatatable?

Asked By: Zappageck

||

Answers:

Update:

Datatable now has a cumcount function in dev :

DT[:, {'C':dt.cumcount() + 1}, by('A', 'B')]

   | A      B          C
   | str32  str32  int64
-- + -----  -----  -----
 0 | A      a          1
 1 | A      a          2
 2 | A      b          1
 3 | B      a          1
 4 | B      a          2
 5 | B      a          3
[6 rows x 3 columns]

Old answer:

This is a hack, in time there should be an in-built way to do cumulative count, or even take advantage of itertools or other performant tools within python while still being very fast :

Step 1 : Get count of columns A and B and export to list

result = DT[:, dt.count(), by("A","B")][:,'count'].to_list()

Step 2 : Use a combination of itertools chain and list comprehension to get the cumulative counts :

from itertools import chain

cumcount = chain.from_iterable([i+1 for i in range(n)] for n in result[0])

Step 3: Assign result back to DT

DT['C'] = dt.Frame(tuple(cumcount))

print(DT)


    A       B   C
   ▪▪▪▪   ▪▪▪▪  ▪▪▪▪
0   A      a    1
1   A      a    2
2   A      b    1
3   B      a    1
4   B      a    2
5   B      a    3
6 rows × 3 columns
Answered By: sammywemmy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.