Pandas groupby for zero values
Question:
I have data like this in a csv file
Symbol Action Year
AAPL Buy 2001
AAPL Buy 2001
BAC Sell 2002
BAC Sell 2002
I am able to read it and groupby like this
df.groupby(['Symbol','Year']).count()
I get
Action
Symbol Year
AAPL 2001 2
BAC 2002 2
I desire this (order does not matter)
Action
Symbol Year
AAPL 2001 2
AAPL 2002 0
BAC 2001 0
BAC 2002 2
I want to know if its possible to count for zero occurances
Answers:
You can use pivot_table
with unstack
:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
dtype: int64
If you need output as DataFrame
use to_frame
:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
.to_frame()
.rename(columns={0:'Action'})
Action
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
If you want to do this without using pivot_table, you can try the below approach:
midx = pd.MultiIndex.from_product([ df['Symbol'].unique(), df['Year'].unique()], names=['Symbol', 'Year'])
df_grouped_by = df_grouped_by.reindex(midx, fill_value=0)
What we are essentially doing above is creating a multi-index of all the possible values multiplying the two columns and then using that multi-index to fill zeroes into our group-by dataframe.
Step 1: Create a dataframe that stores the count of each non-zero class in the column counts
count_df = df.groupby(['Symbol','Year']).size().reset_index(name='counts')
Step 2: Now use pivot_table to get the desired dataframe with counts for both existing and non-existing classes.
df_final = pd.pivot_table(count_df,
index=['Symbol','Year'],
values='counts',
fill_value = 0,
dropna=False,
aggfunc=np.sum)
Now the values of the counts can be extracted as a list with the command
list(df_final['counts'])
You can use this:
df = df.groupby(['Symbol','Year']).count().unstack(fill_value=0).stack()
print (df)
Output:
Action
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2
Datatype category
Maybe this feature didn’t exist back when this thread was opened, however the datatype "category" can help here:
# create a dataframe with one combination of a,b missing
df = pd.DataFrame({"a":[0,1,1], "b": [0,1,0]})
df = df.astype({"a":"category", "b":"category"})
print(df)
Dataframe looks like this:
a b
0 0 0
1 1 1
2 1 0
And now, grouping by a and b
print(df.groupby(["a","b"]).size())
yields:
a b
0 0 1
1 0
1 0 1
1 1
Note the 0 in the rightmost column. This behavior is also documented in the pandas userguide (search on page for "groupby").
All the answers above are focusing on groupby or pivot table. However, as is well described in this article and in this question, this is a beautiful case for pandas’ crosstab
function:
import pandas as pd
df = pd.DataFrame({
"Symbol": 2*['AAPL', 'BAC'],
"Action": 2*['Buy', 'Sell'],
"Year": 2*[2001,2002]
})
pd.crosstab(df["Symbol"], df["Year"]).stack()
yielding:
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2
I have data like this in a csv file
Symbol Action Year
AAPL Buy 2001
AAPL Buy 2001
BAC Sell 2002
BAC Sell 2002
I am able to read it and groupby like this
df.groupby(['Symbol','Year']).count()
I get
Action
Symbol Year
AAPL 2001 2
BAC 2002 2
I desire this (order does not matter)
Action
Symbol Year
AAPL 2001 2
AAPL 2002 0
BAC 2001 0
BAC 2002 2
I want to know if its possible to count for zero occurances
You can use pivot_table
with unstack
:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
dtype: int64
If you need output as DataFrame
use to_frame
:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
.to_frame()
.rename(columns={0:'Action'})
Action
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
If you want to do this without using pivot_table, you can try the below approach:
midx = pd.MultiIndex.from_product([ df['Symbol'].unique(), df['Year'].unique()], names=['Symbol', 'Year'])
df_grouped_by = df_grouped_by.reindex(midx, fill_value=0)
What we are essentially doing above is creating a multi-index of all the possible values multiplying the two columns and then using that multi-index to fill zeroes into our group-by dataframe.
Step 1: Create a dataframe that stores the count of each non-zero class in the column counts
count_df = df.groupby(['Symbol','Year']).size().reset_index(name='counts')
Step 2: Now use pivot_table to get the desired dataframe with counts for both existing and non-existing classes.
df_final = pd.pivot_table(count_df,
index=['Symbol','Year'],
values='counts',
fill_value = 0,
dropna=False,
aggfunc=np.sum)
Now the values of the counts can be extracted as a list with the command
list(df_final['counts'])
You can use this:
df = df.groupby(['Symbol','Year']).count().unstack(fill_value=0).stack()
print (df)
Output:
Action
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2
Datatype category
Maybe this feature didn’t exist back when this thread was opened, however the datatype "category" can help here:
# create a dataframe with one combination of a,b missing
df = pd.DataFrame({"a":[0,1,1], "b": [0,1,0]})
df = df.astype({"a":"category", "b":"category"})
print(df)
Dataframe looks like this:
a b
0 0 0
1 1 1
2 1 0
And now, grouping by a and b
print(df.groupby(["a","b"]).size())
yields:
a b
0 0 1
1 0
1 0 1
1 1
Note the 0 in the rightmost column. This behavior is also documented in the pandas userguide (search on page for "groupby").
All the answers above are focusing on groupby or pivot table. However, as is well described in this article and in this question, this is a beautiful case for pandas’ crosstab
function:
import pandas as pd
df = pd.DataFrame({
"Symbol": 2*['AAPL', 'BAC'],
"Action": 2*['Buy', 'Sell'],
"Year": 2*[2001,2002]
})
pd.crosstab(df["Symbol"], df["Year"]).stack()
yielding:
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2