Calculate the percent occurrence of values from summed rows and plot as bars

Question:

I have survey data based on 6 questions, where each column corresponds to one question (case), and each row corresponds to one survey respondent who answered all 6 questions. 1.0 means the respondent would give fluid, and 0.0 means they would not give fluid:

    case_1  case_2  case_3  case_4  case_5  case_6
0   0.0     0.0     0.0     0.0     0.0     1.0
1   1.0     1.0     1.0     0.0     0.0     1.0
2   1.0     1.0     0.0     1.0     1.0     1.0
3   0.0     0.0     1.0     0.0     1.0     1.0
4   0.0     0.0     1.0     1.0     1.0     0.0
... ... ... ... ... ... ...
517 1.0     1.0     0.0     1.0     0.0     1.0
518 0.0     0.0     1.0     1.0     1.0     1.0
519 1.0     1.0     1.0     0.0     1.0     1.0
520 1.0     0.0     0.0     0.0     1.0     0.0
521 0.0     1.0     0.0     1.0     1.0     1.0

I want to generate the following plot, which illustrates the percentage of respondents who never gave fluid, gave fluid in only one case, gave fluid in 2 cases etc.

enter image description here

Asked By: hulio_entredas

||

Answers:

You can sum along the columns to get the total of fluid per row, then use value_counts with normalize=True to get the percentage for each total number of fluid, and finally use pandas.DataFrame.plot with kind='bar'.

import pandas as pd
import numpy as np  # for sample data

# dummy data
np.random.seed(2)
df = pd.DataFrame(
    np.random.choice([0,1], size=(500,6)),
    columns=map(str,range(1,7))).add_prefix('case_')

print(df)
#      case_1  case_2  case_3  case_4  case_5  case_6
# 0         0       1       1       0       0       1
# 1         0       1       0       1       0       1
# 2         1       1       1       1       1       1
# 3         0       0       0       0       1       1
# 4         1       0       0       0       1       1
# 5         1       0       0       1       0       0
# 6         1       1       1       0       0       0


# calculate the percent of occurences for each row and sort
per = df.sum(axis=1).value_counts(normalize=True).mul(100).sort_index()

# plot the percents
ax = per.plot(kind='bar', rot=0, figsize=(10,6),
              xlabel='Number of cases where respondant gave fluid', ylabel='Percent of Respondents', title='Percentage of respondents')

# Update the xtick labels and catch the output with _ = so it's not printed
_ = ax.set_xticklabels([f'{i.get_text()} case(s)' for i in ax.get_xticklabels()])

enter image description here

Answered By: Ben.T