Sum last n rows of df.count() results into one row
Question:
I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!
Answers:
For DataFrame with percentages use Series.iloc
with indexing, crate DataFrame by Series.to_frame
, add new row and new column filled by percentages:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other ({n})'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
I would use tail(-3)
to get the last values except for the first 3:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
Output:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
This snippet
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
will give you what you need. Also in a nice format;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%
I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!
For DataFrame with percentages use Series.iloc
with indexing, crate DataFrame by Series.to_frame
, add new row and new column filled by percentages:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other ({n})'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
I would use tail(-3)
to get the last values except for the first 3:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
Output:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
This snippet
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
will give you what you need. Also in a nice format;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%