Calculation percentage and plot using pivot-table

Question

I need to plot a percentage graphic with a pivot table in pandas do you have any ideas? how I can calculate the percentage? thanks for your orientation!

This the code that I have :

df.pivot_table(index='obito', values=['asma', 'cardiopatia','diabetes','doenca_renal','obesidade']).T.plot(kind ='bar' , stacked = True)

This part of the dataframe that I have:

{'nome_munic': {66: 'Ferraz de Vasconcelos',
  97: 'São Paulo',
  100: 'São José dos Campos',
  207: 'Mauá',
  249: 'Cajamar',
  258: 'Votuporanga',
  285: 'Ferraz de Vasconcelos',
  290: 'São Paulo',
  345: 'São Pedro',
  378: 'São Paulo'},
 'codigo_ibge': {66: 3515707,
  97: 3550308,
  100: 3549904,
  207: 3529401,
  249: 3509205,
  258: 3557105,
  285: 3515707,
  290: 3550308,
  345: 3550407,
  378: 3550308},
 'idade': {66: 86,
  97: 62,
  100: 58,
  207: 54,
  249: 62,
  258: 37,
  285: 54,
  290: 71,
  345: 79,
  378: 61},
 'sexo': {66: 0,
  97: 0,
  100: 0,
  207: 1,
  249: 0,
  258: 1,
  285: 0,
  290: 0,
  345: 0,
  378: 0},
 'obito': {66: 1,
  97: 0,
  100: 0,
  207: 1,
  249: 1,
  258: 1,
  285: 0,
  290: 1,
  345: 1,
  378: 0},
 'asma': {66: 0,
  97: 0,
  100: 0,
  207: 1,
  249: 0,
  258: 0,
  285: 0,
  290: 0,
  345: 0,
  378: 0},
 'cardiopatia': {66: 1,
  97: 0,
  100: 1,
  207: 1,
  249: 1,
  258: 0,
  285: 1,
  290: 1,
  345: 0,
  378: 0},
 'diabetes': {66: 1,
  97: 1,
  100: 0,
  207: 0,
  249: 1,
  258: 1,
  285: 0,
  290: 0,
  345: 1,
  378: 0},
 'doenca_hematologica': {66: 0,
  97: 0,
  100: 0,
  207: 0,
  249: 0,
  258: 0,
  285: 0,
  290: 0,
  345: 0,
  378: 0},
 'doenca_hepatica': {66: 0,
  97: 0,
  100: 0,
  207: 0,
  249: 0,
  258: 0,
  285: 0,
  290: 0,
  345: 0,
  378: 0},
 'doenca_neurologica': {66: 0,
  97: 0,
  100: 0,
  207: 0,
  249: 0,
  258: 0,
  285: 0,
  290: 1,
  345: 0,
  378: 0},
 'doenca_renal': {66: 0,
  97: 0,
  100: 0,
  207: 0,
  249: 0,
  258: 0,
  285: 0,
  290: 0,
  345: 0,
  378: 0},
 'imunodepressao': {66: 0,
  97: 0,
  100: 1,
  207: 0,
  249: 0,
  258: 0,
  285: 0,
  290: 0,
  345: 0,
  378: 0},
 'obesidade': {66: 0,
  97: 0,
  100: 0,
  207: 0,
  249: 1,
  258: 1,
  285: 0,
  290: 0,
  345: 0,
  378: 0},
 'outros_fatores_de_risco': {66: 0,
  97: 0,
  100: 1,
  207: 0,
  249: 0,
  258: 0,
  285: 0,
  290: 0,
  345: 0,
  378: 1},
 'pneumopatia': {66: 0,
  97: 1,
  100: 0,
  207: 0,
  249: 0,
  258: 0,
  285: 0,
  290: 0,
  345: 0,
  378: 0},
 'puerpera': {66: 0,
  97: 0,
  100: 0,
  207: 0,
  249: 0,
  258: 0,
  285: 0,
  290: 0,
  345: 0,
  378: 0},
 'sindrome_de_down': {66: 0,
  97: 0,
  100: 0,
  207: 0,
  249: 0,
  258: 0,
  285: 0,
  290: 0,
  345: 0,
  378: 0}}

Asked By: Juan Carlos

||

Source

Answer 1

The default aggfunc is np.mean however it doesn’t compute the average per column or something like that, it computes the average of cells. In short it’s for every case the number of 1 divided by the number of 1 + the number of 0. This makes sense sometimes, but not for your data.

>>> df[['asma', 'cardiopatia','diabetes','doenca_renal','obesidade', 'obito']]
     asma  cardiopatia  diabetes  doenca_renal  obesidade  obito
66      0            1         1             0          0      1
97      0            0         1             0          0      0
100     0            1         0             0          0      0
207     1            1         0             0          0      1
249     0            1         1             0          1      1
258     0            0         1             0          1      1
285     0            1         0             0          0      0
290     0            1         0             0          0      1
345     0            0         1             0          0      1
378     0            0         0             0          0      0
>>> df.pivot_table(index='obito', values=['asma', 'cardiopatia','diabetes','doenca_renal','obesidade'])
           asma  cardiopatia  diabetes  doenca_renal  obesidade
obito                                                          
0      0.000000     0.500000  0.250000             0   0.000000
1      0.166667     0.666667  0.666667             0   0.333333

Instead you probably want to count the total of patients per cell, then divide by column totals:

>>> counts = df.pivot_table(index='obito', values=['asma', 'cardiopatia','diabetes','doenca_renal','obesidade'], aggfunc=np.sum)
>>> counts / counts.sum()
       asma  cardiopatia  diabetes  doenca_renal  obesidade
obito                                                      
0       0.0     0.333333       0.2           NaN        0.0
1       1.0     0.666667       0.8           NaN        1.0

Note how every column now sums to 1. doenca_renal is NaN because it has no patients at all in the sample, so the percentage is not defined. If you only want to plot the percentages that have obito=1, then you can do this:

>>> from matplotlib import ticker
>>> ax = (counts / counts.sum()).loc[1].plot.bar(rot=0)
>>> ax.yaxis.set_major_formatter(ticker.PercentFormatter(xmax=1))

If you want to annotate the percentages on the bar, see this other question

Answered By: Cimbali

Calculation percentage and plot using pivot-table

Question:

Answers: