How to extract sequence of rows in output data pandas

Question:

I have a datetime based dataframe as below,

                   timestamp       value  ...         metric        
36 2014-04-02 17:20:00  125.098263  ...           25.098263         
14 2014-04-06 16:25:00  140.072787  ...           265.171050        
10 2014-04-11 09:00:00  127.882020  ...           393.053070        
45 2014-04-11 09:05:00  115.705719  ...           508.758789        
24 2014-04-11 09:15:00  127.261178  ...           636.019967        
17 2014-04-11 09:20:00  121.157997  ...           757.177965        
49 2014-04-11 09:25:00  120.468468  ...           877.646433        
8  2014-04-11 09:45:00  135.642696  ...           1013.289128       
33 2014-04-11 09:55:00  125.210049  ...           1138.499178       
19 2014-04-11 10:05:00  159.259713  ...           1297.758890       
52 2014-04-11 10:20:00  150.082482  ...           1447.841373

I want to create new column named as ‘diff_col’ contains either ‘same’ or ‘diff’ values.
If a date is not continuous, it will taken as ‘diff’ otherwise it is ‘same’.
In the above dataframe, 2014-04-02 17:20:00 and 2014-04-06 16:25:00 are different dates compare to remaining datetime values.

How to create the diff_col .

I tried,
df[‘diff_col’]=df.groupby(pd.Grouper(key = ‘timestamp’, freq=’1D’))

but it didn’t correctly create the expected column.
My required dataframe is as below,

         timestamp       value  ...         metric               diff_col
36 2014-04-02 17:20:00  125.098263  ...           25.098263         diff
14 2014-04-06 16:25:00  140.072787  ...           265.171050        diff
10 2014-04-11 09:00:00  127.882020  ...           393.053070        same
45 2014-04-11 09:05:00  115.705719  ...           508.758789        same
24 2014-04-11 09:15:00  127.261178  ...           636.019967        same
17 2014-04-11 09:20:00  121.157997  ...           757.177965        same
49 2014-04-11 09:25:00  120.468468  ...           877.646433        same
8  2014-04-11 09:45:00  135.642696  ...           1013.289128       same
33 2014-04-11 09:55:00  125.210049  ...           1138.499178       same
19 2014-04-11 10:05:00  159.259713  ...           1297.758890       same
52 2014-04-11 10:20:00  150.082482  ...           1447.841373       same

Please provide suggestion on this.

Thanks,
Kumar

Asked By: user680288

||

Answers:

You can compare the successive rows to see if this is the same date (extracted with dt.normalize) and use this as grouper to get the size with groupby.transform('size'), if the size is > 1, set ‘same’ else ‘diff’ with help of numpy.where:

import numpy as np

# ensure datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# get day
s = df['timestamp'].dt.normalize()

# compare successive rows and identify group size
df['diff_col'] = np.where(df.groupby(s.ne(s.shift()).cumsum())
                            .transform('size').gt(1),
                          'same', 'diff')

Output:

             timestamp       value  ...       metric diff_col
36 2014-04-02 17:20:00  125.098263  ...    25.098263     diff
14 2014-04-06 16:25:00  140.072787  ...   265.171050     diff
10 2014-04-11 09:00:00  127.882020  ...   393.053070     same
45 2014-04-11 09:05:00  115.705719  ...   508.758789     same
24 2014-04-11 09:15:00  127.261178  ...   636.019967     same
17 2014-04-11 09:20:00  121.157997  ...   757.177965     same
49 2014-04-11 09:25:00  120.468468  ...   877.646433     same
8  2014-04-11 09:45:00  135.642696  ...  1013.289128     same
33 2014-04-11 09:55:00  125.210049  ...  1138.499178     same
19 2014-04-11 10:05:00  159.259713  ...  1297.758890     same
52 2014-04-11 10:20:00  150.082482  ...  1447.841373     same
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.