Incorrect value and positioning of proportions

Question:

Why do some percentages not add up to 100% and how do I correct the positioning of the proportion value for the last bar?
The data is accessible from https://www.kaggle.com/datasets/mfaisalqureshi/hr-analytics-and-job-prediction?select=HR_comma_sep.csv.
Thanks!

enter image description here


# Group the data by 'work_accident', 'salary', and 'left' columns and calculate the count
grouped_data = df1.groupby(['work_accident', 'salary', 'left']).size().unstack().reset_index()

# Calculate the total count for each 'work_accident' and 'salary' combination
grouped_data['Total'] = grouped_data.sum(axis=1)

# Calculate the proportions by dividing the count by the total count
grouped_data['Stay'] = grouped_data[0] / grouped_data['Total']
grouped_data['Left'] = grouped_data[1] / grouped_data['Total']

# Plot the stacked bar chart
fig, ax = plt.subplots(figsize=(14, 4))
grouped_data[['Stay', 'Left']].plot(kind='bar', stacked=True, ax=ax)

# Set the labels and title
ax.set_xlabel('Work Accident and Salary')
ax.set_ylabel('Proportion')
ax.set_title('Relationship between Work Accident, Salary, and Employee Retention')

# Adding proportion values within each bar
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy()
    ax.annotate(f'{height:.4%}', (x + width / 2, y + height / 2), ha='center', va='center')

# Set the x-axis tick labels
x_labels = ['No Work Accident - Low', 'No Work Accident - Medium', 'No Work Accident - High',
            'Work Accident - Low', 'Work Accident - Medium', 'Work Accident - High']
ax.set_xticklabels(x_labels, rotation=45, ha='right')

# Set the legend
#ax.legend(['Stay', 'Left'], loc='upper right')
# Moving the legend outside the plot area to the top right
ax.legend(['Stay', 'Left'], loc='bottom left', bbox_to_anchor=(1, 1.02))

# Show the plot
plt.show()
Asked By: gracenz

||

Answers:

The new version of matplotlib has bar_label, which is what you should be using instead of annotations that you are currently using. However, if you are still on an older version of matplotlib or want this code to be fixed, the below changes should suffice. The changes required in your code…

  1. When creating the grouped_data, there is a NAN for one of the entries. So, you need to add .fillna(0) while unstacking… replace the grouped_data creation line with:
grouped_data = df1.groupby(['Work_accident', 'salary', 'left']).size().unstack().fillna(0).reset_index()
  1. While creating grouped_data['Total'], you are doing a sum(), which will add all rows, so a 1 is getting added to some of the rows, which increases the total by 1, leading to < 100%. So, please replace that with:
grouped_data['Total'] = grouped_data[[0,1]].sum(axis=1)

To remove labels which are 0%, add an if condition before adding the annotation, like this:

    if height > 0: ## New IF
        ax.annotate(f'{height:.4%}', (x + width / 2, y + height / 2), ha='center', va='center')

This will give you below plot.

enter image description here

Answered By: Redox
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.