The most efficient way to number the instance of a value in a Series
Question:
I have a dataframe of visits that includes a person ID column, and a given person may have more than one visit. I want to number the visits for a given person.
I can sort the dataframe by visit date, then iterate and count the number of times a person ID has occurred, or its occurrences between the first index and current index. Is there a more efficient (ideally vectorized) way? (Note that I don’t want the total counts as from value_counts()
, I want a new series that is a running count of visits per person.)
Here is a dummy dataframe where person ‘a’ has two visits:
In[1]: df = pd.DataFrame({'Visit': [0, 1, 2, 3, 4, 5], 'Person_ID': ['a', 'b', 'c', 'a', 'd', 'e']})
In[2]: df['Person_ID'].value_counts()
Out[2]:
a 2
b 1
c 1
d 1
e 1
Name: Person_ID, dtype: int64
Example 1: incrementing dictionary counter
In[3]: person_visit_count = {}
for index, row in df.iterrows():
if row['Person_ID'] not in person_visit_count:
person_visit_count[row['Person_ID']] = 1
else:
person_visit_count[row['Person_ID']] += 1
df.loc[index, 'Person_Visit'] = person_visit_count[row['Person_ID']]
In[4]: df
Out[4]:
Visit Person_ID Person_Visit
0 0 a 1.0
1 1 b 1.0
2 2 c 1.0
3 3 a 2.0
4 4 d 1.0
5 5 e 1.0
Example 2: index-based counter
In[5]: for index, row in df.iterrows():
df.loc[index, 'Person_Visit2'] = (df.loc[0:index, 'Person_ID'] == row['Person_ID']).sum()
In[6]: df
Out[6]:
Visit Person_ID Person_Visit Person_Visit2
0 0 a 1.0 1.0
1 1 b 1.0 1.0
2 2 c 1.0 1.0
3 3 a 2.0 2.0
4 4 d 1.0 1.0
5 5 e 1.0 1.0
Answers:
Here’s one idea. I thought of it looking at value_counts
output you created
d = df['Person_ID'].value_counts().apply(lambda n: iter(range(1, n+1)))
df['PersonVisit'] = df['Person_ID'].apply(lambda id: next(d[id]))
In this solution, d
acts as a key-value data structure where key is the person id and the value is an iterable, where each call to next() increments the value by 1.
Iterating is rarely the best option, you can group and do a cumulative count of the rows
df['Person_Visit']=df.groupby('Person_ID').transform('cumcount')+1
Visit Person_ID Person_Visit
0 0 a 1
1 1 b 1
2 2 c 1
3 3 a 2
4 4 d 1
5 5 e 1
Just for comparison, here are the timings for the solutions you gave and this one (given by the timeit module)
groupby with cumcount
1.37 ms ± 177 µs per loop
incrementing dictionary counter
2.02 ms ± 321 µs per loop
index-based counter
3.61 ms ± 389 µs per loop
I have a dataframe of visits that includes a person ID column, and a given person may have more than one visit. I want to number the visits for a given person.
I can sort the dataframe by visit date, then iterate and count the number of times a person ID has occurred, or its occurrences between the first index and current index. Is there a more efficient (ideally vectorized) way? (Note that I don’t want the total counts as from value_counts()
, I want a new series that is a running count of visits per person.)
Here is a dummy dataframe where person ‘a’ has two visits:
In[1]: df = pd.DataFrame({'Visit': [0, 1, 2, 3, 4, 5], 'Person_ID': ['a', 'b', 'c', 'a', 'd', 'e']})
In[2]: df['Person_ID'].value_counts()
Out[2]:
a 2
b 1
c 1
d 1
e 1
Name: Person_ID, dtype: int64
Example 1: incrementing dictionary counter
In[3]: person_visit_count = {}
for index, row in df.iterrows():
if row['Person_ID'] not in person_visit_count:
person_visit_count[row['Person_ID']] = 1
else:
person_visit_count[row['Person_ID']] += 1
df.loc[index, 'Person_Visit'] = person_visit_count[row['Person_ID']]
In[4]: df
Out[4]:
Visit Person_ID Person_Visit
0 0 a 1.0
1 1 b 1.0
2 2 c 1.0
3 3 a 2.0
4 4 d 1.0
5 5 e 1.0
Example 2: index-based counter
In[5]: for index, row in df.iterrows():
df.loc[index, 'Person_Visit2'] = (df.loc[0:index, 'Person_ID'] == row['Person_ID']).sum()
In[6]: df
Out[6]:
Visit Person_ID Person_Visit Person_Visit2
0 0 a 1.0 1.0
1 1 b 1.0 1.0
2 2 c 1.0 1.0
3 3 a 2.0 2.0
4 4 d 1.0 1.0
5 5 e 1.0 1.0
Here’s one idea. I thought of it looking at value_counts
output you created
d = df['Person_ID'].value_counts().apply(lambda n: iter(range(1, n+1)))
df['PersonVisit'] = df['Person_ID'].apply(lambda id: next(d[id]))
In this solution, d
acts as a key-value data structure where key is the person id and the value is an iterable, where each call to next() increments the value by 1.
Iterating is rarely the best option, you can group and do a cumulative count of the rows
df['Person_Visit']=df.groupby('Person_ID').transform('cumcount')+1
Visit Person_ID Person_Visit
0 0 a 1
1 1 b 1
2 2 c 1
3 3 a 2
4 4 d 1
5 5 e 1
Just for comparison, here are the timings for the solutions you gave and this one (given by the timeit module)
groupby with cumcount
1.37 ms ± 177 µs per loop
incrementing dictionary counter
2.02 ms ± 321 µs per loop
index-based counter
3.61 ms ± 389 µs per loop