Python Pandas: groupby.diff calculates difference between the last element of a group and the first element of the following group

Question:

I have the following – already sorted – pandas dataframe:

instrumentExtId Date                proxyMethod     isForceXS   xsValue     curveValue
.ID1            2008-03-28 00:00:00 CrossSectional  FALSE       6.86046681  6.86046681
.ID1            2008-03-31 00:00:00 CrossSectional  FALSE       6.97468855  6.97468855
.ID1            2008-04-01 00:00:00 CrossSectional  FALSE       6.83893432  6.83893432
.ID1            2008-04-02 00:00:00 CrossSectional  FALSE       6.70250452  6.70250452
.ID2            2008-03-28 00:00:00 CrossSectional  FALSE       3.10441877  3.10441877
.ID2            2008-03-31 00:00:00 CrossSectional  FALSE       3.5104612   3.5104612
.ID2            2008-04-01 00:00:00 CrossSectional  FALSE       3.52994089  3.52994089
.ID2            2008-04-02 00:00:00 CrossSectional  FALSE       3.24236585  3.24236585

For each ID and for each date, I want to apply the inverse hyperbolic sine function (np.arcsinh) on columns "xsValue" and "curveValue" and then, calculate the date to date difference for each ID. I want columns "proxyMethod" and "isForceXS" to be preserved.

I wrote the following code, but it seems that for the first row of the second ID, the difference is calculated between the first observation of the second ID and the last observation of the first ID. I was expecting a nan there (i.e. what I would like to see). What do I miss?

df = df.groupby(['instrumentExtId', 'Date', 'proxyMethod', 'isForceXS'], group_keys = True)[["xsValue","curveValue"]].
        apply(lambda x:np.arcsinh(x)). 
        diff(). 
        reset_index().
        drop(["level_4"], axis=1)
instrumentExtId Date                proxyMethod    isForceXS    xsValue      curveValue
.ID1            2008-03-28 00:00:00 CrossSectional  FALSE       nan          nan
.ID1            2008-03-31 00:00:00 CrossSectional  FALSE       0.016342295   0.016342295
.ID1            2008-04-01 00:00:00 CrossSectional  FALSE       -0.01945289  -0.01945289
.ID1            2008-04-02 00:00:00 CrossSectional  FALSE       -0.019934368 -0.019934368
.ID2            2008-03-28 00:00:00 CrossSectional  FALSE       -0.750188187 -0.75018818
.ID2            2008-03-31 00:00:00 CrossSectional  FALSE        0.117631041  0.117631041
.ID2            2008-04-01 00:00:00 CrossSectional  FALSE        0.005323083  0.005323083
.ID2            2008-04-02 00:00:00 CrossSectional  FALSE       -0.081488875 -0.081488875

Asked By: Whitebeard13

||

Answers:

I think if you break apart that line into separate steps and print the output after each step, you’ll figure out what’s going wrong.

If I understand your intention correctly:

  • It sounds like you should apply archsinh BEFORE grouping.
  • You should include only instrumentExtId in the groupby, nothing else.

I think this code will achieve what you want, unless I misunderstand your goal:

from io import StringIO
import pandas as pd

s = """
instrumentExtId   Date               proxyMethod     isForceXS   xsValue     curveValue
.ID1            2008-03-28 00:00:00  CrossSectional  FALSE       6.86046681  6.86046681
.ID1            2008-03-31 00:00:00  CrossSectional  FALSE       6.97468855  6.97468855
.ID1            2008-04-01 00:00:00  CrossSectional  FALSE       6.83893432  6.83893432
.ID1            2008-04-02 00:00:00  CrossSectional  FALSE       6.70250452  6.70250452
.ID2            2008-03-28 00:00:00  CrossSectional  FALSE       3.10441877  3.10441877
.ID2            2008-03-31 00:00:00  CrossSectional  FALSE       3.5104612   3.5104612
.ID2            2008-04-01 00:00:00  CrossSectional  FALSE       3.52994089  3.52994089
.ID2            2008-04-02 00:00:00  CrossSectional  FALSE       3.24236585  3.24236585
"""
df = pd.read_csv(StringIO(s), sep='ss+', engine='python')
df = df.sort_values(['instrumentExtId', 'Date'])
df['xs'] = np.arcsinh(df['xsValue'])
df['curve'] = np.arcsinh(df['curveValue'])

df[['xs_diff', 'curve_diff']] = df.groupby('instrumentExtId')[['xs', 'curve']].diff()

print(df)
  instrumentExtId                 Date     proxyMethod  isForceXS   xsValue  curveValue        xs     curve   xs_diff  curve_diff
0            .ID1  2008-03-28 00:00:00  CrossSectional      False  6.860467    6.860467  2.624193  2.624193       NaN         NaN
1            .ID1  2008-03-31 00:00:00  CrossSectional      False  6.974689    6.974689  2.640535  2.640535  0.016342    0.016342
2            .ID1  2008-04-01 00:00:00  CrossSectional      False  6.838934    6.838934  2.621082  2.621082 -0.019453   -0.019453
3            .ID1  2008-04-02 00:00:00  CrossSectional      False  6.702505    6.702505  2.601148  2.601148 -0.019934   -0.019934
4            .ID2  2008-03-28 00:00:00  CrossSectional      False  3.104419    3.104419  1.850959  1.850959       NaN         NaN
5            .ID2  2008-03-31 00:00:00  CrossSectional      False  3.510461    3.510461  1.968590  1.968590  0.117631    0.117631
6            .ID2  2008-04-01 00:00:00  CrossSectional      False  3.529941    3.529941  1.973914  1.973914  0.005323    0.005323
7            .ID2  2008-04-02 00:00:00  CrossSectional      False  3.242366    3.242366  1.892425  1.892425 -0.081489   -0.081489

PS — In your example data, xsValue and curveValue are identical. Was that intentional?

Answered By: Stuart Berg
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.