Calculate difference between successive date column with groupby on another column in pandas?

Question:

I have a pandas dataframe,

data = pd.DataFrame([['Car','2019-01-06T21:44:09Z'],
                     ['Train','2019-01-06T19:44:09Z'],
                     ['Train','2019-01-02T19:44:09Z'],
                     ['Car','2019-01-08T06:44:09Z'],
                     ['Car','2019-01-06T18:44:09Z'],
                     ['Train','2019-01-04T19:44:09Z'],
                     ['Car','2019-01-05T16:34:09Z'],
                     ['Train','2019-01-08T19:44:09Z'],
                     ['Car','2019-01-07T14:44:09Z'],
                     ['Car','2019-01-06T11:44:09Z'],
                     ['Train','2019-01-10T19:44:09Z'],
                     ], 
                    columns=['Type', 'Date'])

Need to find the difference between successive dates for each type, after sorting them by dates

Final data looks like

data = pd.DataFrame([['Car','2019-01-06T21:44:09Z',1],
                     ['Train','2019-01-06T19:44:09Z',4],
                     ['Train','2019-01-02T19:44:09Z',0],
                     ['Car','2019-01-08T06:44:09Z',3],
                     ['Car','2019-01-06T18:44:09Z',1],
                     ['Train','2019-01-04T19:44:09Z',2],
                     ['Car','2019-01-05T16:34:09Z',0],
                     ['Train','2019-01-08T19:44:09Z',6],
                     ['Car','2019-01-07T14:44:09Z',2],
                     ['Car','2019-01-06T11:44:09Z',1],
                     ['Train','2019-01-10T19:44:09Z',8],
                     ], 
                    columns=['Type', 'Date','diff'])

Here, Type Car min(Date) is 2019-01-05T16:34:09Z, so the diff starts as 0, then second date is 2019-01-06T18:44:09Z and 2019-01-06T11:44:09Z, so diff is 1 day (here not sure whether time can be included) and so on..
For Type Train min(Date) is 2019-01-02T19:44:09Z, so diff is 0 then 2019-01-04T19:44:09Z so 2 days diff

I tried groupby, but not sure how to include sort on date

data['diff'] = data.groupby('Type')['Date'].diff() / np.timedelta64(1, 'D')
Asked By: hanzgs

||

Answers:

Use pandas.DataFrame.groupby with dt.date:

df['diff'] = df.groupby('Type')['Date'].apply(lambda x: x.dt.date - x.min().date())

Output:

     Type                      Date   diff
0     Car 2019-01-06 21:44:09+00:00 1 days
1   Train 2019-01-06 19:44:09+00:00 4 days
2   Train 2019-01-02 19:44:09+00:00 0 days
3     Car 2019-01-08 06:44:09+00:00 3 days
4     Car 2019-01-06 18:44:09+00:00 1 days
5   Train 2019-01-04 19:44:09+00:00 2 days
6     Car 2019-01-05 16:34:09+00:00 0 days
7   Train 2019-01-08 19:44:09+00:00 6 days
8     Car 2019-01-07 14:44:09+00:00 2 days
9     Car 2019-01-06 11:44:09+00:00 1 days
10  Train 2019-01-10 19:44:09+00:00 8 days

If you want them to be int, add dt.days:

df['diff'] = df.groupby('Type')['Date'].apply(lambda x: x.dt.date - x.min().date()).dt.days

Output:

     Type                      Date  diff
0     Car 2019-01-06 21:44:09+00:00     1
1   Train 2019-01-06 19:44:09+00:00     4
2   Train 2019-01-02 19:44:09+00:00     0
3     Car 2019-01-08 06:44:09+00:00     3
4     Car 2019-01-06 18:44:09+00:00     1
5   Train 2019-01-04 19:44:09+00:00     2
6     Car 2019-01-05 16:34:09+00:00     0
7   Train 2019-01-08 19:44:09+00:00     6
8     Car 2019-01-07 14:44:09+00:00     2
9     Car 2019-01-06 11:44:09+00:00     1
10  Train 2019-01-10 19:44:09+00:00     8
Answered By: Chris
  • first convert Date into date into some other column
  • use lambda function to subtract min of date and find days using dt.days
  • Then Drop the extra date column
data['Date_date'] = pd.to_datetime(data['Date']).dt.date
data['diff'] = data.groupby(['Type'])['Date_date'].apply(lambda x:(x-x.min()).dt.days)
data.drop(['Date_date'],axis=1,inplace=True,errors='ignore')
print(data)
     Type                  Date  diff
0     Car  2019-01-06T21:44:09Z     1
1   Train  2019-01-06T19:44:09Z     4
2   Train  2019-01-02T19:44:09Z     0
3     Car  2019-01-08T06:44:09Z     3
4     Car  2019-01-06T18:44:09Z     1
5   Train  2019-01-04T19:44:09Z     2
6     Car  2019-01-05T16:34:09Z     0
7   Train  2019-01-08T19:44:09Z     6
8     Car  2019-01-07T14:44:09Z     2
9     Car  2019-01-06T11:44:09Z     1
10  Train  2019-01-10T19:44:09Z     8
Answered By: tawab_shakeel

Direct subtraction from transform

s = pd.to_datetime(data['Date']).dt.date
data['diff'] = (s - s.groupby(data.Type).transform('min')).dt.days

Out[36]:
     Type                  Date  diff
0     Car  2019-01-06T21:44:09Z     1
1   Train  2019-01-06T19:44:09Z     4
2   Train  2019-01-02T19:44:09Z     0
3     Car  2019-01-08T06:44:09Z     3
4     Car  2019-01-06T18:44:09Z     1
5   Train  2019-01-04T19:44:09Z     2
6     Car  2019-01-05T16:34:09Z     0
7   Train  2019-01-08T19:44:09Z     6
8     Car  2019-01-07T14:44:09Z     2
9     Car  2019-01-06T11:44:09Z     1
10  Train  2019-01-10T19:44:09Z     8
Answered By: Andy L.

Just to add, need help on similar data, but how can we find the difference between the successive Months.

Output:
| Type| Date | Month Diff|
|:—- |:——: | —–:|
| Car | 2019-01-06| 0 |
| Car | 2019-03-02| 2 |
| Car | 2019-07-06| 4 |
| Car | 2019-08-23| 1 |
| Car | 2019-11-23| 3 |
| Train | 2020-01-23| 0 |
| Train | 2019-03-23| 2 |
| Train | 2019-09-23| 6 |

Answered By: Hussain Madarwala
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.