# How to make a pandas crosstab with percentages?

## Question:

Given a dataframe with different categorical variables, how do I return a cross-tabulation with percentages instead of frequencies?

``````df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
'B' : ['A', 'B', 'C'] * 8,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
'D' : np.random.randn(24),
'E' : np.random.randn(24)})

pd.crosstab(df.A,df.B)

B       A    B    C
A
one     4    4    4
three   2    2    2
two     2    2    2
``````

Using the margins option in crosstab to compute row and column totals gets us close enough to think that it should be possible using an aggfunc or groupby, but my meager brain can’t think it through.

``````B       A     B    C
A
one     .33  .33  .33
three   .33  .33  .33
two     .33  .33  .33
``````

``````pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)
``````

Basically you just have the function that does `row/row.sum()`, and you use `apply` with `axis=1` to apply it by row.

(If doing this in Python 2, you should use `from __future__ import division` to make sure division always returns a float.)

Another option is to use div rather than apply:

``````In [11]: res = pd.crosstab(df.A, df.B)
``````

Divide by the sum over the index:

``````In [12]: res.sum(axis=1)
Out[12]:
A
one      12
three     6
two       6
dtype: int64
``````

Similar to above, you need to do something about integer division (I use astype(‘float’)):

``````In [13]: res.astype('float').div(res.sum(axis=1), axis=0)
Out[13]:
B             A         B         C
A
one    0.333333  0.333333  0.333333
three  0.333333  0.333333  0.333333
two    0.333333  0.333333  0.333333
``````

If you’re looking for a percentage of the total, you can divide by the len of the df instead of the row sum:

``````pd.crosstab(df.A, df.B).apply(lambda r: r/len(df), axis=1)
``````

From Pandas 0.18.1 onwards, there’s a `normalize` option:

``````In [1]: pd.crosstab(df.A,df.B, normalize='index')
Out[1]:

B              A           B           C
A
one     0.333333    0.333333    0.333333
three   0.333333    0.333333    0.333333
two     0.333333    0.333333    0.333333
``````

Where you can normalise across either `all`, `index` (rows), or `columns`.

More details are available in the documentation.

We can show it as percentages by multiplying by `100`:

``````pd.crosstab(df.A,df.B, normalize='index')
.round(4)*100

B          A      B      C
A
one    33.33  33.33  33.33
three  33.33  33.33  33.33
two    33.33  33.33  33.33
``````

Where I’ve rounded for convenience.

Normalizing the index will simply work out. Use parameter, `normalize = "index"` in `pd.crosstab()`.

Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.