Add a sequence number to each element in a group using python
Question:
I have a dataframe of individuals who each have multiple records. I want to enumerate the record in the sequence for each individual in python. Essentially I would like to create the ‘sequence’ column in the following table:
patient date sequence
145 20Jun2009 1
145 24Jun2009 2
145 15Jul2009 3
582 09Feb2008 1
582 21Feb2008 2
987 14Mar2010 1
987 02May2010 2
987 12May2010 3
This is essentially the same question as here, but I am working in python and unable to implement the sql solution. I suspect I can use a groupby statement with an iterable count, but have so far been unsuccessful.
Answers:
The question is how do I sort on multiple columns of data.
One simple trick is to use the key
parameter to the sorted function.
You’ll be sorting by a string built from the columns of the array.
rows = ...# your source data
def date_to_sortable_string(date):
# use datetime package to convert string to sortable date.
pass
# Assume x[0] === patient_id and x[1] === encounter date
# Sort by patient_id and date
rows_sorted = sorted(rows, key=lambda x: "%0.5d-%s" % (x[0], date_to_sortable_string(x[1])))
for row in rows_sorted:
print row
I stumbled upon the answer which was embarrassingly simple. The groupby statement has a ‘cumcount()’ option which will enumerate group items.
df['sequence']=df.groupby('patient').cumcount()
The caveat is that the records have to be in the order you want them enumerated.
Firstly you want to convert the date column to be a pandas datetime (rather than strings):
In [11]: pd.to_datetime(df['date'], format='%d%b%Y')
Out[11]:
0 2009-06-20
1 2009-06-24
2 2009-07-15
3 2008-02-09
4 2008-02-21
5 2010-03-14
6 2010-05-02
7 2010-05-12
Name: date, dtype: datetime64[ns]
Note: see docs for possible format options.
In [12]: df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')
In [13]: df
Out[13]:
patient date sequence
0 145 2009-06-20 1
1 145 2009-06-24 2
2 145 2009-07-15 3
3 582 2008-02-09 1
4 582 2008-02-21 2
5 987 2010-03-14 1
6 987 2010-05-02 2
7 987 2010-05-12 3
If this isn’t in date order (for each patient), I would sort it first:
In [14]: df = df.sort('date')
Now you can groupby and cumcount:
In [15]: g = df.groupby('patient')
In [16]: g.cumcount() + 1
Out[16]:
2 1
3 2
0 1
1 2
4 1
5 2
6 3
dtype: int64
Which is what you want (althout it’s out of order):
In [17]: df['sequence'] = g.cumcount() + 1
In [18]: df
Out[18]:
patient date sequence
2 582 2008-02-09 1
3 582 2008-02-21 2
0 145 2009-06-24 1
1 145 2009-07-15 2
4 987 2010-03-14 1
5 987 2010-05-02 2
6 987 2010-05-12 3
To rearrange (though you may not need to) use sort_index
(or we could reindex if we saved the initial DataFrame’s index):*
In [19]: df.sort_index()
Out[19]:
patient date sequence
0 145 2009-06-24 1
1 145 2009-07-15 2
2 582 2008-02-09 1
3 582 2008-02-21 2
4 987 2010-03-14 1
5 987 2010-05-02 2
6 987 2010-05-12 3
This will start orderings starting at 1 instead of 0.
df['sequence']=df.groupby('patient').cumcount()+1
If you want the sequence to be sorted based on the values of another column, first sort the dataframe and then add the new sequence column.
For instance, if we want the sequence for patients visits sorted based on their visit date, the following code can be used.
df['sequence'] = df.sort_values(by=['patient', 'date']).groupby('patient']).cumcount() + 1
I have a dataframe of individuals who each have multiple records. I want to enumerate the record in the sequence for each individual in python. Essentially I would like to create the ‘sequence’ column in the following table:
patient date sequence
145 20Jun2009 1
145 24Jun2009 2
145 15Jul2009 3
582 09Feb2008 1
582 21Feb2008 2
987 14Mar2010 1
987 02May2010 2
987 12May2010 3
This is essentially the same question as here, but I am working in python and unable to implement the sql solution. I suspect I can use a groupby statement with an iterable count, but have so far been unsuccessful.
The question is how do I sort on multiple columns of data.
One simple trick is to use the key
parameter to the sorted function.
You’ll be sorting by a string built from the columns of the array.
rows = ...# your source data
def date_to_sortable_string(date):
# use datetime package to convert string to sortable date.
pass
# Assume x[0] === patient_id and x[1] === encounter date
# Sort by patient_id and date
rows_sorted = sorted(rows, key=lambda x: "%0.5d-%s" % (x[0], date_to_sortable_string(x[1])))
for row in rows_sorted:
print row
I stumbled upon the answer which was embarrassingly simple. The groupby statement has a ‘cumcount()’ option which will enumerate group items.
df['sequence']=df.groupby('patient').cumcount()
The caveat is that the records have to be in the order you want them enumerated.
Firstly you want to convert the date column to be a pandas datetime (rather than strings):
In [11]: pd.to_datetime(df['date'], format='%d%b%Y')
Out[11]:
0 2009-06-20
1 2009-06-24
2 2009-07-15
3 2008-02-09
4 2008-02-21
5 2010-03-14
6 2010-05-02
7 2010-05-12
Name: date, dtype: datetime64[ns]
Note: see docs for possible format options.
In [12]: df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')
In [13]: df
Out[13]:
patient date sequence
0 145 2009-06-20 1
1 145 2009-06-24 2
2 145 2009-07-15 3
3 582 2008-02-09 1
4 582 2008-02-21 2
5 987 2010-03-14 1
6 987 2010-05-02 2
7 987 2010-05-12 3
If this isn’t in date order (for each patient), I would sort it first:
In [14]: df = df.sort('date')
Now you can groupby and cumcount:
In [15]: g = df.groupby('patient')
In [16]: g.cumcount() + 1
Out[16]:
2 1
3 2
0 1
1 2
4 1
5 2
6 3
dtype: int64
Which is what you want (althout it’s out of order):
In [17]: df['sequence'] = g.cumcount() + 1
In [18]: df
Out[18]:
patient date sequence
2 582 2008-02-09 1
3 582 2008-02-21 2
0 145 2009-06-24 1
1 145 2009-07-15 2
4 987 2010-03-14 1
5 987 2010-05-02 2
6 987 2010-05-12 3
To rearrange (though you may not need to) use sort_index
(or we could reindex if we saved the initial DataFrame’s index):*
In [19]: df.sort_index()
Out[19]:
patient date sequence
0 145 2009-06-24 1
1 145 2009-07-15 2
2 582 2008-02-09 1
3 582 2008-02-21 2
4 987 2010-03-14 1
5 987 2010-05-02 2
6 987 2010-05-12 3
This will start orderings starting at 1 instead of 0.
df['sequence']=df.groupby('patient').cumcount()+1
If you want the sequence to be sorted based on the values of another column, first sort the dataframe and then add the new sequence column.
For instance, if we want the sequence for patients visits sorted based on their visit date, the following code can be used.
df['sequence'] = df.sort_values(by=['patient', 'date']).groupby('patient']).cumcount() + 1