Calculate the percent occurrence of values from summed rows and plot as bars
Question:
I have survey data based on 6 questions, where each column corresponds to one question (case), and each row corresponds to one survey respondent who answered all 6 questions. 1.0
means the respondent would give fluid, and 0.0
means they would not give fluid:
case_1 case_2 case_3 case_4 case_5 case_6
0 0.0 0.0 0.0 0.0 0.0 1.0
1 1.0 1.0 1.0 0.0 0.0 1.0
2 1.0 1.0 0.0 1.0 1.0 1.0
3 0.0 0.0 1.0 0.0 1.0 1.0
4 0.0 0.0 1.0 1.0 1.0 0.0
... ... ... ... ... ... ...
517 1.0 1.0 0.0 1.0 0.0 1.0
518 0.0 0.0 1.0 1.0 1.0 1.0
519 1.0 1.0 1.0 0.0 1.0 1.0
520 1.0 0.0 0.0 0.0 1.0 0.0
521 0.0 1.0 0.0 1.0 1.0 1.0
I want to generate the following plot, which illustrates the percentage of respondents who never gave fluid, gave fluid in only one case, gave fluid in 2 cases etc.
Answers:
You can sum
along the columns to get the total of fluid per row, then use value_counts
with normalize=True
to get the percentage for each total number of fluid, and finally use pandas.DataFrame.plot
with kind='bar'
.
import pandas as pd
import numpy as np # for sample data
# dummy data
np.random.seed(2)
df = pd.DataFrame(
np.random.choice([0,1], size=(500,6)),
columns=map(str,range(1,7))).add_prefix('case_')
print(df)
# case_1 case_2 case_3 case_4 case_5 case_6
# 0 0 1 1 0 0 1
# 1 0 1 0 1 0 1
# 2 1 1 1 1 1 1
# 3 0 0 0 0 1 1
# 4 1 0 0 0 1 1
# 5 1 0 0 1 0 0
# 6 1 1 1 0 0 0
# calculate the percent of occurences for each row and sort
per = df.sum(axis=1).value_counts(normalize=True).mul(100).sort_index()
# plot the percents
ax = per.plot(kind='bar', rot=0, figsize=(10,6),
xlabel='Number of cases where respondant gave fluid', ylabel='Percent of Respondents', title='Percentage of respondents')
# Update the xtick labels and catch the output with _ = so it's not printed
_ = ax.set_xticklabels([f'{i.get_text()} case(s)' for i in ax.get_xticklabels()])
I have survey data based on 6 questions, where each column corresponds to one question (case), and each row corresponds to one survey respondent who answered all 6 questions. 1.0
means the respondent would give fluid, and 0.0
means they would not give fluid:
case_1 case_2 case_3 case_4 case_5 case_6
0 0.0 0.0 0.0 0.0 0.0 1.0
1 1.0 1.0 1.0 0.0 0.0 1.0
2 1.0 1.0 0.0 1.0 1.0 1.0
3 0.0 0.0 1.0 0.0 1.0 1.0
4 0.0 0.0 1.0 1.0 1.0 0.0
... ... ... ... ... ... ...
517 1.0 1.0 0.0 1.0 0.0 1.0
518 0.0 0.0 1.0 1.0 1.0 1.0
519 1.0 1.0 1.0 0.0 1.0 1.0
520 1.0 0.0 0.0 0.0 1.0 0.0
521 0.0 1.0 0.0 1.0 1.0 1.0
I want to generate the following plot, which illustrates the percentage of respondents who never gave fluid, gave fluid in only one case, gave fluid in 2 cases etc.
You can sum
along the columns to get the total of fluid per row, then use value_counts
with normalize=True
to get the percentage for each total number of fluid, and finally use pandas.DataFrame.plot
with kind='bar'
.
import pandas as pd
import numpy as np # for sample data
# dummy data
np.random.seed(2)
df = pd.DataFrame(
np.random.choice([0,1], size=(500,6)),
columns=map(str,range(1,7))).add_prefix('case_')
print(df)
# case_1 case_2 case_3 case_4 case_5 case_6
# 0 0 1 1 0 0 1
# 1 0 1 0 1 0 1
# 2 1 1 1 1 1 1
# 3 0 0 0 0 1 1
# 4 1 0 0 0 1 1
# 5 1 0 0 1 0 0
# 6 1 1 1 0 0 0
# calculate the percent of occurences for each row and sort
per = df.sum(axis=1).value_counts(normalize=True).mul(100).sort_index()
# plot the percents
ax = per.plot(kind='bar', rot=0, figsize=(10,6),
xlabel='Number of cases where respondant gave fluid', ylabel='Percent of Respondents', title='Percentage of respondents')
# Update the xtick labels and catch the output with _ = so it's not printed
_ = ax.set_xticklabels([f'{i.get_text()} case(s)' for i in ax.get_xticklabels()])