Pandas / Matplotlib bar plot with multi index dataframe

Question:

I have a sorted Multi-Index pandas data frame, which I need to plot in a bar chart. My data frame.

I either didn’t find the solution yet, or the simple one doesn’t exist, but I need to plot a bar chart on this data with Content and Category to be on x-axis and Installs to be the height.

In simple terms, I need to show what each bar consist of e.g. 20% of it would be by Everyone, 40% by Teen etc… I’m not sure that is even possible, as the mean of means wouldn’t be possible, as different sample size, hence I made an Uploads column to calculate it, but haven’t gotten that far to plot by mean.

I think plotting by cumulative would give a wrong result though.

I need to plot a bar chart with X-ticks to be the Category, (Preferably just the first 10) then each X-tick have a bar of Content not always 3, could be just "Everyone" and "Teen" and the height of each bar to be Installs.

Ideally, it should look like so: Bar Chart

but each bar have bars for Content for this specific Category.

I have tried flattening out with DataFrame.unstack(), but it ruins the sorting of the data frame, so used that Cat2 = Cat1.reset_index(level = [0,1]), but need help with plotting still.

So far I have:

Cat = Popular.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum"})
Uploads = Popular[["Category","Content"]].value_counts().rename_axis(["Category","Content"]).reset_index(name = "Uploads")
Cat = pd.merge(Cat, Uploads, on = ["Category","Content"])
Cat = Cat.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum", "Uploads": "sum"})

which gives this

result

Then I sort it like so

Cat1 = Cat.unstack() 
Cat1 = Cat1.sort_index(key = (Cat1["Installs"].sum(axis = 1)/Cat1["Uploads"].sum(axis = 1)).get, ascending = False).stack()

Thanks to one of those solutions

That’s all I have.

Data Set is from Kaggle, over 600MB, don’t expect anyone to download it, but at least a simple guide towards a solution.

P.S.
This should help me out with splitting each dots in scatter plot below in the same way, but if not, that’s fine.

P.S.S
I don’t have enough reputation to post pictures, so apologies for the links

Asked By: Dmitrii Ponomarev

||

Answers:

Edit: added the code to compute "Installs" percentage per "Category".

The dataset is big, but you should have provided mock data to easily reproduce the example, as follows:

import pandas as pd
import numpy as np


categories = ["Productivity", "Arcade", "Business", "Social"]
contents = ["Everyone", "Matute", "Teen"]

index = pd.MultiIndex.from_product(
    [categories, contents], names=["Category", "Content"]
)
installs = np.random.randint(low=100, high=999, size=len(index))

df = pd.DataFrame({"Installs": installs}, index=index)
>>> df

                       Installs
Category     Content
Productivity Everyone       149
             Matute         564
             Teen           301
Arcade       Everyone       926
             Matute         542
             Teen           556
Business     Everyone       879
             Matute         921
             Teen           323
Social       Everyone       329
             Matute         320
             Teen           426

If you want to compute "Installs" percentage per "Category", use groupby().apply():

>>> df["Installs (%)"] = (
...     df["Installs"]
...     .groupby(by="Category", group_keys=False)
...     .apply(lambda df: df / df.sum() * 100)
... )
>>> df

                       Installs  Installs (%)
Category     Content
Productivity Everyone       513     22.246314
             Matute         839     36.383348
             Teen           954     41.370338
Arcade       Everyone       122     10.581093
             Matute         519     45.013010
             Teen           512     44.405898
Business     Everyone       412     31.164902
             Matute         698     52.798790
             Teen           212     16.036309
Social       Everyone       874     52.555622
             Matute         326     19.603127
             Teen           463     27.841251

Then you can just .unstack() once:

>>> df = df.unstack()
>>> df

             Installs             Installs (%)
Content      Everyone Matute Teen     Everyone     Matute       Teen
Category
Arcade            499    904  645    24.365234  44.140625  31.494141
Business          856    819  438    40.511122  38.760057  20.728822
Productivity      705    815  657    32.384015  37.436840  30.179146
Social            416    482  238    36.619718  42.429577  20.950704

And then bar plot the feature you want:

fig, (ax, ax_percent) = plt.subplots(ncols=2, figsize=(14, 5))

df["Installs"].plot(kind="bar", rot=True, ax=ax)
ax.set_ylabel("Installs")

df["Installs (%)"].plot(kind="bar", rot=True, ax=ax_percent)
ax_percent.set_ylabel("Installs (%)")
ax_percent.set_ylim([0, 100])

plt.show()

grouped bar plot

Answered By: paime

ChatGPT has answered my question

import pandas as pd
import matplotlib.pyplot as plt

# create a dictionary of data for the DataFrame
data = {
    'app_name': ['Google Maps', 'Uber', 'Waze', 'Spotify', 'Pandora'],
    'category': ['Navigation', 'Transportation', 'Navigation', 'Music', 'Music'],
    'rating': [4.5, 4.0, 4.5, 4.5, 4.0],
    'reviews': [1000000, 50000, 100000, 500000, 250000]
}

# create the DataFrame
df = pd.DataFrame(data)

# set the 'app_name' and 'category' columns as the index
df = df.set_index(['app_name', 'category'])

# add a new column called "content_rating" to the DataFrame, and assign a content rating to each app
df['content_rating'] = ['Everyone', 'Teen', 'Everyone', 'Everyone', 'Teen']

# Grouping the Data by category and content_rating and getting the mean of reviews
df_grouped = df.groupby(['category','content_rating']).agg({'reviews':'mean'})

# Reset the index to make it easier to plot
df_grouped = df_grouped.reset_index()

# Plotting the stacked bar chart
df_grouped.pivot(index='category', columns='content_rating', values='reviews').plot(kind='bar', stacked=True)

This is a sample data set

What I did is I added a sum column to the dataset and sorted it by this sum.

piv = qw1.reset_index()
piv = piv.pivot_table(index='Category', columns='Content', values='per')#.plot(kind='bar', stacked = True)
piv["Sum"] = piv.sum(axis=1)
piv_10 = piv.sort_values(by = "Sum", ascending = False)[["Adult", "Everyone", "Mature", "Teen"]].head(10)

where qw1 is the multi-index data frame.

Then all had to do is to plot it:

piv_10.plot.bar(stacked = True, logy = False)
Answered By: Dmitrii Ponomarev