How to find unique values by group in datatable Frame
Question:
I have created a datatable frame as follows,
DT_EX = dt.Frame({'cid':[1,2,1,2,3,2,4,2,4,5],
'cust_life_cycle':['Lead','Active','Lead','Active','Inactive','Lead','Active','Lead','Inactive','Lead']})
Here I have three unique customer life cycles and each of these counts are found as
DT_EX[:, count(), by(f.cust_life_cycle)]
Along with it, I have five customer IDs and these counts are as
DT_EX[:, count(), by(f.cid)]
Now I would like to see how many of unique customer ID’s existed per each of customer life cycle,
DT_EX[:, {'unique_cids':dt.unique(f.cid)}, by(f.cust_life_cycle)]
It should display as Lead customer has got 3 unique customer ID’s such as (1,2,5), Active user has got 2 unique customer ID’s (2,4) so on forth.
I couldn’t get it as expected, Could you please let me know how to get it fixed?.
FYI: I have tried to reproduce the same on R data.table frame, its working.
DT_EX[, uniqueN(cid), by=cust_life_cycle]
Answers:
The dt.unique
function does not apply by groups (yet). So, one way to achieve what you need would be to first group by the lifecycle + customerID, and then in the second step re-group by lifecycle only:
>>> DT_EX[:, count(), by(f.cust_life_cycle, f.cid)]
... [:, {"unique_cids": count()}, by(f.cust_life_cycle)]
| cust_life_cycle unique_cids
-- + --------------- -----------
0 | Active 2
1 | Inactive 2
2 | Lead 3
[3 rows x 2 columns]
@pasha
I have also created a custom function for my practice as below,
def pydt_unique_per_group(DT,by_col,uni_col):
DT_dict = DT[:,(f[by_col],f[uni_col])].to_dict()
pairs = list(zip(DT_dict[by_col], DT_dict[uni_col]))
unique_per_col_dict = {k : list(map(itemgetter(1), v)) for k,v in groupby(sorted(pairs, key=itemgetter(0)), key=itemgetter(0))}
unique_per_col_count = {drink:len(set(ingr)) for drink,ingr in unique_per_col_dict.items()}
unique_per_col_count_sort = {k:v for k,v in sorted(unique_per_col_count.items(),key=lambda x:x[1],reverse=True)}
by_group_summary_dict = {by_col:[],'count':[]}
for k, v in unique_per_col_count_sort.items():
by_group_summary_dict[by_col].append(k)
by_group_summary_dict['count'].append(v)
return dt.Frame(by_group_summary_dict)
output:
In [8]: pydt_unique_per_group(DT_EX,'cust_life_cycle','cid')
Out[8]:
| cust_life_cycle count
-- + --------------- -----
0 | Lead 3
1 | Active 2
2 | Inactive 2
[3 rows x 2 columns]
There is now a nunique implementation :
DT_EX[:, f.cid.nunique(), 'cust_life_cycle']
| cust_life_cycle cid
| str32 int64
-- + --------------- -----
0 | Active 2
1 | Inactive 2
2 | Lead 3
[3 rows x 2 columns]
I have created a datatable frame as follows,
DT_EX = dt.Frame({'cid':[1,2,1,2,3,2,4,2,4,5],
'cust_life_cycle':['Lead','Active','Lead','Active','Inactive','Lead','Active','Lead','Inactive','Lead']})
Here I have three unique customer life cycles and each of these counts are found as
DT_EX[:, count(), by(f.cust_life_cycle)]
Along with it, I have five customer IDs and these counts are as
DT_EX[:, count(), by(f.cid)]
Now I would like to see how many of unique customer ID’s existed per each of customer life cycle,
DT_EX[:, {'unique_cids':dt.unique(f.cid)}, by(f.cust_life_cycle)]
It should display as Lead customer has got 3 unique customer ID’s such as (1,2,5), Active user has got 2 unique customer ID’s (2,4) so on forth.
I couldn’t get it as expected, Could you please let me know how to get it fixed?.
FYI: I have tried to reproduce the same on R data.table frame, its working.
DT_EX[, uniqueN(cid), by=cust_life_cycle]
The dt.unique
function does not apply by groups (yet). So, one way to achieve what you need would be to first group by the lifecycle + customerID, and then in the second step re-group by lifecycle only:
>>> DT_EX[:, count(), by(f.cust_life_cycle, f.cid)]
... [:, {"unique_cids": count()}, by(f.cust_life_cycle)]
| cust_life_cycle unique_cids
-- + --------------- -----------
0 | Active 2
1 | Inactive 2
2 | Lead 3
[3 rows x 2 columns]
@pasha
I have also created a custom function for my practice as below,
def pydt_unique_per_group(DT,by_col,uni_col):
DT_dict = DT[:,(f[by_col],f[uni_col])].to_dict()
pairs = list(zip(DT_dict[by_col], DT_dict[uni_col]))
unique_per_col_dict = {k : list(map(itemgetter(1), v)) for k,v in groupby(sorted(pairs, key=itemgetter(0)), key=itemgetter(0))}
unique_per_col_count = {drink:len(set(ingr)) for drink,ingr in unique_per_col_dict.items()}
unique_per_col_count_sort = {k:v for k,v in sorted(unique_per_col_count.items(),key=lambda x:x[1],reverse=True)}
by_group_summary_dict = {by_col:[],'count':[]}
for k, v in unique_per_col_count_sort.items():
by_group_summary_dict[by_col].append(k)
by_group_summary_dict['count'].append(v)
return dt.Frame(by_group_summary_dict)
output:
In [8]: pydt_unique_per_group(DT_EX,'cust_life_cycle','cid')
Out[8]:
| cust_life_cycle count
-- + --------------- -----
0 | Lead 3
1 | Active 2
2 | Inactive 2
[3 rows x 2 columns]
There is now a nunique implementation :
DT_EX[:, f.cid.nunique(), 'cust_life_cycle']
| cust_life_cycle cid
| str32 int64
-- + --------------- -----
0 | Active 2
1 | Inactive 2
2 | Lead 3
[3 rows x 2 columns]