setting -t, +t around the event date, NaN giving a headache

Question:

I have a df

id   date   eventdate
A    2020Q1 2020Q3
A    2020Q2 2020Q3
A    2020Q3 2020Q3
A    2020Q4 2020Q3
B    2019Q1 2019Q2
B    2019Q2 2019Q2
B    2019Q3 2019Q2
B    2019Q4 2019Q2
C    2020Q1 NaN
C    2020Q2 NaN
C    2020Q3 NaN
C    2020Q4 NaN
D    2019Q2 NaN
D    2019Q3 NaN
D    2019Q4 NaN
...

I want to calculate the time index around the event date such that if both dates match, t=0, and -t and +t around the event date.

desired df

id   date   eventdate  t
A    2020Q1 2020Q3     -2
A    2020Q2 2020Q3     -1
A    2020Q3 2020Q3      0
A    2020Q4 2020Q3      1
B    2019Q1 2019Q2     -1
B    2019Q2 2019Q2      0
B    2019Q3 2019Q2      1
B    2019Q4 2019Q2      2
C    2020Q1 NaN         NaN
C    2020Q2 NaN         NaN
C    2020Q3 NaN         NaN
C    2020Q4 NaN         NaN
D    2019Q2 NaN         NaN
D    2019Q3 NaN         NaN
D    2019Q4 NaN         NaN

So i use the following code

df['t'] = (pd.PeriodIndex(df['date'], freq='Q').astype('int') - pd.PeriodIndex(df['eventdate'], freq='Q').astype('int'))

and i get strangely big negative numbers for which eventdates aren’t available

count    8.812475e+06
mean    -4.565033e+18
std      4.611450e+18
min     -9.223372e+18
25%     -9.223372e+18
50%     -8.000000e+00
75%      3.000000e+00
max      2.300000e+01
Name: t, dtype: float64

Any idea what’s causing this issue? thanks.

Asked By: Olive

||

Answers:

Problem is that pd.PeriodIndex converts NaN value to NaT. When you convert a NaT value to int with .astype('int'), it gives -9223372036854775808.

You can check the data is NaT when accessing the n attribute

df['t'] = ((pd.PeriodIndex(df['date'], freq='Q') -
            pd.PeriodIndex(df['eventdate'], freq='Q'))
           .map(lambda x: np.nan if x != x else x.n))
           # or use pd.isna
           # .map(lambda x: np.nan if pd.isna(x) else x.n)

Or mask the NaN value

df['t'] = ((pd.PeriodIndex(df['date'], freq='Q').astype('int') -
            pd.PeriodIndex(df['eventdate'], freq='Q').astype('int'))
           .to_series(range(len(df))).mask(df[['date', 'eventdate']].isna().any(axis=1)))
print(df)

   id    date eventdate    t
0   A  2020Q1    2020Q3 -2.0
1   A  2020Q2    2020Q3 -1.0
2   A  2020Q3    2020Q3  0.0
3   A  2020Q4    2020Q3  1.0
4   B  2019Q1    2019Q2 -1.0
5   B  2019Q2    2019Q2  0.0
6   B  2019Q3    2019Q2  1.0
7   B  2019Q4    2019Q2  2.0
8   C  2020Q1       NaN  NaN
9   C  2020Q2       NaN  NaN
10  C  2020Q3       NaN  NaN
11  C  2020Q4       NaN  NaN
12  D  2019Q2       NaN  NaN
13  D  2019Q3       NaN  NaN
14  D  2019Q4       NaN  NaN
Answered By: Ynjxsjmh
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.