Pandas groupby transform yields Series instead of DataFrame on empty DataFrames
Question:
Running this:
for periods in [8, 4, 0]:
print(f'--- periods {periods}')
df = pandas.DataFrame(dict(
v1=numpy.arange(periods),
v2=numpy.arange(periods) * 2),
index=pandas.date_range('2023-01-01', periods=periods, freq='6H'))
dft = df.between_time('00:00', '06:00')
dft = dft.reindex_like(df)
dfc = dft['v1'] > 3
df = df[dfc.groupby(dfc.index.date).transform(any)]
print(df)
print(df.dtypes)
print(df.index)
print()
results in:
--- periods 8
v1 v2
2023-01-02 00:00:00 4 8
2023-01-02 06:00:00 5 10
2023-01-02 12:00:00 6 12
2023-01-02 18:00:00 7 14
v1 int64
v2 int64
dtype: object
DatetimeIndex(['2023-01-02 00:00:00', '2023-01-02 06:00:00',
'2023-01-02 12:00:00', '2023-01-02 18:00:00'],
dtype='datetime64[ns]', freq='6H')
--- periods 4
Empty DataFrame
Columns: [v1, v2]
Index: []
v1 int64
v2 int64
dtype: object
DatetimeIndex([], dtype='datetime64[ns]', freq='6H')
--- periods 0
Empty DataFrame
Columns: []
Index: []
Series([], dtype: object)
DatetimeIndex([], dtype='datetime64[ns]', freq='6H')
Why is the result for periods = 0 (i.e. empty DataFrame) a Series and not a DataFrame with columns v1
and v2
?
Aside from checking whether df is empty beforehand, is there a way to return a DataFrame with both v1
and v2
?
Answers:
Why is the result for periods = 0 (i.e. empty DataFrame) a Series and not a DataFrame with columns v1 and v2?
because the mask is now an empty series – it no longer contains any values, and so indexing an empty series will return an empty series
Aside from checking whether df is empty beforehand, is there a way to return a DataFrame with both v1 and v2?
Here is one way to achieve what you are asking for
df.loc[dfc.groupby(dfc.index.date).transform(any), ["v1", "v2"]]
This also works (without explicitly specifying the columns)
df = df.loc[dfc.groupby(dfc.index.date).transform(any), :]
Running this:
for periods in [8, 4, 0]:
print(f'--- periods {periods}')
df = pandas.DataFrame(dict(
v1=numpy.arange(periods),
v2=numpy.arange(periods) * 2),
index=pandas.date_range('2023-01-01', periods=periods, freq='6H'))
dft = df.between_time('00:00', '06:00')
dft = dft.reindex_like(df)
dfc = dft['v1'] > 3
df = df[dfc.groupby(dfc.index.date).transform(any)]
print(df)
print(df.dtypes)
print(df.index)
print()
results in:
--- periods 8
v1 v2
2023-01-02 00:00:00 4 8
2023-01-02 06:00:00 5 10
2023-01-02 12:00:00 6 12
2023-01-02 18:00:00 7 14
v1 int64
v2 int64
dtype: object
DatetimeIndex(['2023-01-02 00:00:00', '2023-01-02 06:00:00',
'2023-01-02 12:00:00', '2023-01-02 18:00:00'],
dtype='datetime64[ns]', freq='6H')
--- periods 4
Empty DataFrame
Columns: [v1, v2]
Index: []
v1 int64
v2 int64
dtype: object
DatetimeIndex([], dtype='datetime64[ns]', freq='6H')
--- periods 0
Empty DataFrame
Columns: []
Index: []
Series([], dtype: object)
DatetimeIndex([], dtype='datetime64[ns]', freq='6H')
Why is the result for periods = 0 (i.e. empty DataFrame) a Series and not a DataFrame with columns v1
and v2
?
Aside from checking whether df is empty beforehand, is there a way to return a DataFrame with both v1
and v2
?
Why is the result for periods = 0 (i.e. empty DataFrame) a Series and not a DataFrame with columns v1 and v2?
because the mask is now an empty series – it no longer contains any values, and so indexing an empty series will return an empty series
Aside from checking whether df is empty beforehand, is there a way to return a DataFrame with both v1 and v2?
Here is one way to achieve what you are asking for
df.loc[dfc.groupby(dfc.index.date).transform(any), ["v1", "v2"]]
This also works (without explicitly specifying the columns)
df = df.loc[dfc.groupby(dfc.index.date).transform(any), :]