How to keep NaN in pivot table?
Question:
Looking to preserve NaN values when changing the shape of the dataframe.
These two questions may be related:
- How to preserve NaN instead of filling with zeros in pivot table?
- How to make two NaN as NaN after the operation instead of making it zero?
but not been able to use the answers provided – can I set a min count for np.sum somehow?
import pandas as pd
import numpy as np
df = pd.DataFrame([['Y1', np.nan], ['Y2', np.nan], ['Y1', 6], ['Y2',8]], columns=['A', 'B'], index=['1988-01-01','1988-01-01', '1988-01-04', '1988-01-04'])
df.index.name = 'Date'
df
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=np.sum)
pivot_df
The output is:
A Y1 Y2
Date
1988-01-01 0.0 0.0
1988-01-04 6.0 8.0
and the desired output is:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
Answers:
In this case, I would resolve by groupby
:
(df.groupby(['Date', 'A']).B
.apply(lambda x: np.nan if x.isna().all() else x.sum())
.unstack('A')
)
output:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
Change isna().all()
to isna().any()
if needed.
From the helpful comments the following solution meets my requirements:
pivot_df_2 = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=min, dropna=False)
pivot_df_2
Values are supposed to be unique per slot so replacing the sum function with a min function shouldn’t make a difference (in my case)
It is possible to count the values, and drop when 0 (or less than the expected count):
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],
aggfunc=['sum','count'])
# build the mask from count
mask = (pivot_df.xs('count', axis=1) == 0) # or ...<min_limit
#build the actual pivot_df from sum
pivot_df = pivot_df.xs('sum', axis=1)
# and reset to NaN when not enough values
pivot_df[mask] = np.nan
It gives as expected:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
This one will give sensible result when you sum more than one value.
If you have no duplicate entries, use set_index
+ unstack
df.set_index('A', append=True)['B'].unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
If you have duplicates, use a groupby
with min_count
>> df
A B
Date
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
df.set_index('A', append=True).groupby(level=[0, 1])['B'].sum(min_count=1).unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 12.0 16.0
Try to add 'dropna= False'
to your pivot_table function?
I also wanted to keep NaN values, and I also wanted to keep using the pivot_table
function. So here is my solution:
pivot_df = pd.pivot_table(
df,
values='B',
index=['Date'],
columns=['A'],
aggfunc=lambda x: x.sum(min_count=1),
dropna=False
)
The downside is that this is less efficient in terms of computation time.
Looking to preserve NaN values when changing the shape of the dataframe.
These two questions may be related:
- How to preserve NaN instead of filling with zeros in pivot table?
- How to make two NaN as NaN after the operation instead of making it zero?
but not been able to use the answers provided – can I set a min count for np.sum somehow?
import pandas as pd
import numpy as np
df = pd.DataFrame([['Y1', np.nan], ['Y2', np.nan], ['Y1', 6], ['Y2',8]], columns=['A', 'B'], index=['1988-01-01','1988-01-01', '1988-01-04', '1988-01-04'])
df.index.name = 'Date'
df
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=np.sum)
pivot_df
The output is:
A Y1 Y2
Date
1988-01-01 0.0 0.0
1988-01-04 6.0 8.0
and the desired output is:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
In this case, I would resolve by groupby
:
(df.groupby(['Date', 'A']).B
.apply(lambda x: np.nan if x.isna().all() else x.sum())
.unstack('A')
)
output:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
Change isna().all()
to isna().any()
if needed.
From the helpful comments the following solution meets my requirements:
pivot_df_2 = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],aggfunc=min, dropna=False)
pivot_df_2
Values are supposed to be unique per slot so replacing the sum function with a min function shouldn’t make a difference (in my case)
It is possible to count the values, and drop when 0 (or less than the expected count):
pivot_df = pd.pivot_table(df, values='B', index=['Date'], columns=['A'],
aggfunc=['sum','count'])
# build the mask from count
mask = (pivot_df.xs('count', axis=1) == 0) # or ...<min_limit
#build the actual pivot_df from sum
pivot_df = pivot_df.xs('sum', axis=1)
# and reset to NaN when not enough values
pivot_df[mask] = np.nan
It gives as expected:
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
This one will give sensible result when you sum more than one value.
If you have no duplicate entries, use set_index
+ unstack
df.set_index('A', append=True)['B'].unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 6.0 8.0
If you have duplicates, use a groupby
with min_count
>> df
A B
Date
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
1988-01-01 Y1 NaN
1988-01-01 Y2 NaN
1988-01-04 Y1 6.0
1988-01-04 Y2 8.0
df.set_index('A', append=True).groupby(level=[0, 1])['B'].sum(min_count=1).unstack(-1)
A Y1 Y2
Date
1988-01-01 NaN NaN
1988-01-04 12.0 16.0
Try to add 'dropna= False'
to your pivot_table function?
I also wanted to keep NaN values, and I also wanted to keep using the pivot_table
function. So here is my solution:
pivot_df = pd.pivot_table(
df,
values='B',
index=['Date'],
columns=['A'],
aggfunc=lambda x: x.sum(min_count=1),
dropna=False
)
The downside is that this is less efficient in terms of computation time.