How can I pivot a dataframe?
Question:
- What is pivot?
- How do I pivot?
- Long format to wide format?
I’ve seen a lot of questions that ask about pivot tables, even if they don’t know it. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting… But I’m going to give it a go.
The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it’s a daunting task). Look at a few examples from my Google search:
- How to pivot a dataframe in Pandas? – Good question and answer. But the answer only answers the specific question with little explanation.
- pandas pivot table to data frame – OP is concerned with the output of the pivot, namely how the columns look. OP wanted it to look like R. This isn’t very helpful for pandas users.
- pandas pivoting a dataframe, duplicate rows – Another decent question but the answer focuses on one method, namely
pd.DataFrame.pivot
Setup
I conspicuously named my columns and relevant column values to correspond with how I’m going to pivot in the answers below.
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed([3,1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
Questions
-
Why do I get ValueError: Index contains duplicate entries, cannot reshape
?
-
How do I pivot df
such that the col
values are columns, row
values are the index, and mean of val0
are the values?
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
-
How do I make it so that missing values are 0
?
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
-
Can I get something other than mean
, like maybe sum
?
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24
-
Can I do more that one aggregation at a time?
sum mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24
-
Can I aggregate over multiple value columns?
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
-
Can I subdivide by multiple columns?
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
-
Or
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
-
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
-
How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,
np.random.seed([3, 1415])
df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})
df2
A B
0 a 0
1 a 11
2 a 2
3 a 11
4 b 10
5 b 10
6 b 14
7 c 7
The expected should look something like
a b c
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
-
How do I flatten the multiple index to single index after pivot
?
From
1 2
1 1 2
a 2 1 1
b 2 1 0
c 1 0 0
To
1|1 2|1 2|2
a 2 1 1
b 2 1 0
c 1 0 0
Answers:
Here is a list of idioms we can use to pivot
-
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And it is the intended approach by the developers.
- Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
-
pd.DataFrame.groupby
+ pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
-
pd.DataFrame.set_index
+ pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
-
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index
, columns
, values
.
- Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
-
- This a specialized version of
pivot_table
and in its purest form is the most intuitive way to perform several tasks.
-
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
-
pd.get_dummies
+ pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
See also:
- Reshaping and pivot tables — pandas User Guide
Question 1
Why do I get ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys on which it is being asked to pivot. For example: Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
Examples
What I’m going to do for each subsequent question is to answer it using pd.DataFrame.pivot_table
. Then I’ll provide alternatives to perform the same task.
Questions 2 and 3
How do I pivot df
such that the col
values are columns, row
values are the index, and mean of val0
are the values?
-
df.pivot_table(
values='val0', index='row', columns='col',
aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
aggfunc='mean'
is the default and I didn’t have to set it. I included it to be explicit.
How do I make it so that missing values are 0?
-
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to 0
.
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
-
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
-
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
Question 4
Can I get something other than mean
, like maybe sum
?
-
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24
-
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
-
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)
Question 5
Can I do more that one aggregation at a time?
Notice that for pivot_table
and crosstab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
-
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24
-
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
-
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
Can I aggregate over multiple value columns?
-
pd.DataFrame.pivot_table
we pass values=['val0', 'val1']
but we could’ve left that off completely
df.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
-
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
Can I subdivide by multiple columns?
-
df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
-
df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
Can I subdivide by multiple columns?
-
df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
-
df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
-
pd.DataFrame.set_index
because the set of keys are unique for both rows and columns
df.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
-
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
-
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
-
pd.crosstab(df['row'], df['col'])
-
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)
col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1
-
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
Question 10
How do I convert a DataFrame from long to wide by pivoting on ONLY two
columns?
-
The first step is to assign a number to each row – this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount
:
df2.insert(0, 'count', df2.groupby('A').cumcount())
df2
count A B
0 0 a 0
1 1 a 11
2 2 a 2
3 3 a 11
4 0 b 10
5 1 b 10
6 2 b 14
7 0 c 7
The second step is to use the newly created column as the index to call DataFrame.pivot
.
df2.pivot(*df2)
# df2.pivot(index='count', columns='A', values='B')
A a b c
count
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
-
Whereas DataFrame.pivot
only accepts columns, DataFrame.pivot_table
also accepts arrays, so the GroupBy.cumcount
can be passed directly as the index
without creating an explicit column.
df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')
A a b c
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
Question 11
How do I flatten the multiple index to single index after pivot
If columns
type object
with string join
df.columns = df.columns.map('|'.join)
else format
df.columns = df.columns.map('{0[0]}|{0[1]}'.format)
To extend @piRSquared’s answer another version of Question 10
Question 10.1
DataFrame:
d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)
A B
0 1 a
1 1 b
2 1 c
3 2 a
4 2 b
5 3 a
6 5 c
Output:
0 1 2
A
1 a b c
2 a b None
3 a None None
5 c None None
Using df.groupby
and pd.Series.tolist
t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
0 1 2
A
1 a b c
2 a b None
3 a None None
5 c None None
Or
A much better alternative using pd.pivot_table
with df.squeeze.
t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)
To better understand how the function pivot works you can look at the example from Pandas documentation. However pivot
will fail if you have repeating index-columns (foo
–bar
) combinations (like df
in the second example):

In opposite to pivot
the function pivot_table supports data aggregation using the mean
function by default. Here is an example with the sum
aggregation function:
Call reset_index()
(along with add_suffix()
)
Oftentimes, reset_index()
is needed after you call pivot_table
or pivot
. For example, to make the following transformation (where one column became column names)
you use the following code, where after pivot
, you add prefix to the newly created column names and convert the index (in this case "movies"
) back into a column and remove the name of the axis name:
df.pivot(*df).add_prefix('week_').reset_index().rename_axis(columns=None)
As the other answers mentioned, "pivot" may refer to 2 different operations:
- Unstacked aggregation (i.e. make the results of
groupby.agg
wider.)
- Reshaping (similar to pivot in Excel,
reshape
in numpy or pivot_wider
in R)
1. Aggregation
pivot_table
or crosstab
are simply unstacked results of groupby.agg
operation. In fact, the source code shows that, under the hood, the following are true:
pivot_table
= groupby
+ unstack
(read here for more info.)
crosstab
= pivot_table
N.B. You can use list of column names as index
, columns
and values
arguments.
df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols)
# equivalently,
df.pivot_table(vals, rows, cols, aggfuncs)
1.1. crosstab
is a special case of pivot_table
; thus of groupby
+ unstack
The following are equivalent:
pd.crosstab(df['colA'], df['colB'])
df.pivot_table(index='colA', columns='colB', aggfunc='size', fill_value=0)
df.groupby(['colA', 'colB']).size().unstack(fill_value=0)
Note that pd.crosstab
has a significantly larger overhead, so it’s significantly slower than both pivot_table
and groupby
+ unstack
. In fact, as noted here, pivot_table
is slower than groupby
+ unstack
as well.
2. Reshaping
pivot
is a more limited version of pivot_table
where its purpose is to reshape a long dataframe into a long one.
df.set_index(rows+cols)[vals].unstack(cols)
# equivalently,
df.pivot(rows, cols, vals)
2.1. Augment rows/columns as in Question 10
You can also apply the insight from Question 10 to multi-column pivot operation as well. There are two cases:
-
"long-to-long": reshape by augmenting the indices
Code:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2], 'B': [*'xxyyzz'],
'C': [*'CCDCDD'], 'E': [100, 200, 300, 400, 500, 600]})
rows, cols, vals = ['A', 'B'], ['C'], 'E'
# using pivot syntax
df1 = (
df.assign(ix=df.groupby(rows+cols).cumcount())
.pivot([*rows, 'ix'], cols, vals)
.fillna(0, downcast='infer')
.droplevel(-1).reset_index().rename_axis(columns=None)
)
# equivalently, using set_index + unstack syntax
df1 = (
df
.set_index([*rows, df.groupby(rows+cols).cumcount(), *cols])[vals]
.unstack(fill_value=0)
.droplevel(-1).reset_index().rename_axis(columns=None)
)
-
"long-to-wide": reshape by augmenting the columns
Code:
df1 = (
df.assign(ix=df.groupby(rows+cols).cumcount())
.pivot(rows, [*cols, 'ix'])[vals]
.fillna(0, downcast='infer')
)
df1 = df1.set_axis([f"{c[0]}_{c[1]}" for c in df1], axis=1).reset_index()
# equivalently, using the set_index + unstack syntax
df1 = (
df
.set_index([*rows, df.groupby(rows+cols).cumcount(), *cols])[vals]
.unstack([-1, *range(-2, -len(cols)-2, -1)], fill_value=0)
)
df1 = df1.set_axis([f"{c[0]}_{c[1]}" for c in df1], axis=1).reset_index()
-
minimum case using the set_index
+ unstack
syntax:
Code:
df1 = df.set_index(['A', df.groupby('A').cumcount()])['E'].unstack(fill_value=0).add_prefix('Col').reset_index()
1 pivot_table()
aggregates the values and unstacks it. Specifically, it creates a single flat list out of index and columns, calls groupby()
with this list as the grouper and aggregates using the passed aggregator methods (the default is mean
). Then after aggregation, it calls unstack()
by the list of columns. So internally, pivot_table = groupby + unstack. Moreover, if fill_value
is passed, fillna()
is called.
In other words, the method that produces pv_1
is the same as the method that produces gb_1
in the example below.
pv_1 = df.pivot_table(index=rows, columns=cols, values=vals, aggfunc=aggfuncs, fill_value=0)
# internal operation of `pivot_table()`
gb_1 = df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols).fillna(0, downcast="infer")
pv_1.equals(gb_1) # True
2 crosstab()
calls pivot_table()
, i.e., crosstab = pivot_table. Specifically, it builds a DataFrame out of the passed arrays of values, filters it by the common indices and calls pivot_table()
. It’s more limited than pivot_table()
because it only allows a one-dimensional array-like as values
, unlike pivot_table()
that can have multiple columns as values
.
The pivot function in pandas has the same functionality as the pivot operation in excel. We can transform a dataset from a long format to a wide format.
Lets have a example
We want to convert the dataset into a form such that each country becomes a column and the new confirmed cases as values corresponding to the countries. We can perform this data manipulation using the pivot function.
Pivot the dataset
pivot_df = pd.pivot(df, index =['Date'], columns ='Country', values =['NewConfirmed'])
## renaming the columns
pivot_df.columns = df['Country'].sort_values().unique()
We can bring the new columns to the same level as the index column Data by resetting the index.
reset the index to modify the column levels
pivot_df = pivot_df.reset_index()
- What is pivot?
- How do I pivot?
- Long format to wide format?
I’ve seen a lot of questions that ask about pivot tables, even if they don’t know it. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting… But I’m going to give it a go.
The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it’s a daunting task). Look at a few examples from my Google search:
- How to pivot a dataframe in Pandas? – Good question and answer. But the answer only answers the specific question with little explanation.
- pandas pivot table to data frame – OP is concerned with the output of the pivot, namely how the columns look. OP wanted it to look like R. This isn’t very helpful for pandas users.
- pandas pivoting a dataframe, duplicate rows – Another decent question but the answer focuses on one method, namely
pd.DataFrame.pivot
Setup
I conspicuously named my columns and relevant column values to correspond with how I’m going to pivot in the answers below.
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed([3,1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
Questions
-
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
? -
How do I pivot
df
such that thecol
values are columns,row
values are the index, and mean ofval0
are the values?col col0 col1 col2 col3 col4 row row0 0.77 0.605 NaN 0.860 0.65 row2 0.13 NaN 0.395 0.500 0.25 row3 NaN 0.310 NaN 0.545 NaN row4 NaN 0.100 0.395 0.760 0.24
-
How do I make it so that missing values are
0
?col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24
-
Can I get something other than
mean
, like maybesum
?col col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 row2 0.13 0.00 0.79 0.50 0.50 row3 0.00 0.31 0.00 1.09 0.00 row4 0.00 0.10 0.79 1.52 0.24
-
Can I do more that one aggregation at a time?
sum mean col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24
-
Can I aggregate over multiple value columns?
val0 val1 col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02 row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79 row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00 row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
-
Can I subdivide by multiple columns?
item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 row row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65 row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00 row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
-
Or
item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 key row key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65 row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13 row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
-
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4 row row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
-
How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,
np.random.seed([3, 1415]) df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)}) df2 A B 0 a 0 1 a 11 2 a 2 3 a 11 4 b 10 5 b 10 6 b 14 7 c 7
The expected should look something like
a b c 0 0.0 10.0 7.0 1 11.0 10.0 NaN 2 2.0 14.0 NaN 3 11.0 NaN NaN
-
How do I flatten the multiple index to single index after
pivot
?From
1 2 1 1 2 a 2 1 1 b 2 1 0 c 1 0 0
To
1|1 2|1 2|2 a 2 1 1 b 2 1 0 c 1 0 0
Here is a list of idioms we can use to pivot
-
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And it is the intended approach by the developers. - Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
- A glorified version of
-
pd.DataFrame.groupby
+pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
-
pd.DataFrame.set_index
+pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We thenunstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
-
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values forindex
,columns
,values
. - Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
- Very similar to
-
- This a specialized version of
pivot_table
and in its purest form is the most intuitive way to perform several tasks.
- This a specialized version of
-
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
-
pd.get_dummies
+pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
See also:
- Reshaping and pivot tables — pandas User Guide
Question 1
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys on which it is being asked to pivot. For example: Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
Examples
What I’m going to do for each subsequent question is to answer it using pd.DataFrame.pivot_table
. Then I’ll provide alternatives to perform the same task.
Questions 2 and 3
How do I pivot
df
such that thecol
values are columns,row
values are the index, and mean ofval0
are the values?
-
df.pivot_table( values='val0', index='row', columns='col', aggfunc='mean') col col0 col1 col2 col3 col4 row row0 0.77 0.605 NaN 0.860 0.65 row2 0.13 NaN 0.395 0.500 0.25 row3 NaN 0.310 NaN 0.545 NaN row4 NaN 0.100 0.395 0.760 0.24
aggfunc='mean'
is the default and I didn’t have to set it. I included it to be explicit.
How do I make it so that missing values are 0?
-
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to0
.
df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='mean') col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24
-
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
-
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='mean').fillna(0)
Question 4
Can I get something other than
mean
, like maybesum
?
-
df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='sum') col col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 row2 0.13 0.00 0.79 0.50 0.50 row3 0.00 0.31 0.00 1.09 0.00 row4 0.00 0.10 0.79 1.52 0.24
-
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
-
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='sum').fillna(0)
Question 5
Can I do more that one aggregation at a time?
Notice that for pivot_table
and crosstab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
-
df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc=[np.size, np.mean]) size mean col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65 row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25 row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00 row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24
-
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
-
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
Can I aggregate over multiple value columns?
-
pd.DataFrame.pivot_table
we passvalues=['val0', 'val1']
but we could’ve left that off completelydf.pivot_table( values=['val0', 'val1'], index='row', columns='col', fill_value=0, aggfunc='mean') val0 val1 col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02 row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79 row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00 row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
-
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
Can I subdivide by multiple columns?
-
df.pivot_table( values='val0', index='row', columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 row row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65 row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00 row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
-
df.groupby( ['row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
Can I subdivide by multiple columns?
-
df.pivot_table( values='val0', index=['key', 'row'], columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 key row key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65 row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13 row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
-
df.groupby( ['key', 'row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
-
pd.DataFrame.set_index
because the set of keys are unique for both rows and columnsdf.set_index( ['key', 'row', 'item', 'col'] )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
-
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size') col col0 col1 col2 col3 col4 row row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
-
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
-
pd.crosstab(df['row'], df['col'])
-
# get integer factorization `i` and unique values `r` # for column `'row'` i, r = pd.factorize(df['row'].values) # get integer factorization `j` and unique values `c` # for column `'col'` j, c = pd.factorize(df['col'].values) # `n` will be the number of rows # `m` will be the number of columns n, m = r.size, c.size # `i * m + j` is a clever way of counting the # factorization bins assuming a flat array of length # `n * m`. Which is why we subsequently reshape as `(n, m)` b = np.bincount(i * m + j, minlength=n * m).reshape(n, m) # BTW, whenever I read this, I think 'Bean, Rice, and Cheese' pd.DataFrame(b, r, c) col3 col2 col0 col1 col4 row3 2 0 0 1 0 row2 1 2 1 0 2 row0 1 0 1 2 1 row4 2 2 0 1 1
-
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col'])) col0 col1 col2 col3 col4 row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
Question 10
How do I convert a DataFrame from long to wide by pivoting on ONLY two
columns?
-
The first step is to assign a number to each row – this number will be the row index of that value in the pivoted result. This is done using
GroupBy.cumcount
:df2.insert(0, 'count', df2.groupby('A').cumcount()) df2 count A B 0 0 a 0 1 1 a 11 2 2 a 2 3 3 a 11 4 0 b 10 5 1 b 10 6 2 b 14 7 0 c 7
The second step is to use the newly created column as the index to call
DataFrame.pivot
.df2.pivot(*df2) # df2.pivot(index='count', columns='A', values='B') A a b c count 0 0.0 10.0 7.0 1 11.0 10.0 NaN 2 2.0 14.0 NaN 3 11.0 NaN NaN
-
Whereas
DataFrame.pivot
only accepts columns,DataFrame.pivot_table
also accepts arrays, so theGroupBy.cumcount
can be passed directly as theindex
without creating an explicit column.df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B') A a b c 0 0.0 10.0 7.0 1 11.0 10.0 NaN 2 2.0 14.0 NaN 3 11.0 NaN NaN
Question 11
How do I flatten the multiple index to single index after
pivot
If columns
type object
with string join
df.columns = df.columns.map('|'.join)
else format
df.columns = df.columns.map('{0[0]}|{0[1]}'.format)
To extend @piRSquared’s answer another version of Question 10
Question 10.1
DataFrame:
d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)
A B
0 1 a
1 1 b
2 1 c
3 2 a
4 2 b
5 3 a
6 5 c
Output:
0 1 2
A
1 a b c
2 a b None
3 a None None
5 c None None
Using df.groupby
and pd.Series.tolist
t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
0 1 2
A
1 a b c
2 a b None
3 a None None
5 c None None
Or
A much better alternative using pd.pivot_table
with df.squeeze.
t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)
To better understand how the function pivot works you can look at the example from Pandas documentation. However pivot
will fail if you have repeating index-columns (foo
–bar
) combinations (like df
in the second example):
In opposite to pivot
the function pivot_table supports data aggregation using the mean
function by default. Here is an example with the sum
aggregation function:
Call reset_index()
(along with add_suffix()
)
Oftentimes, reset_index()
is needed after you call pivot_table
or pivot
. For example, to make the following transformation (where one column became column names)
you use the following code, where after pivot
, you add prefix to the newly created column names and convert the index (in this case "movies"
) back into a column and remove the name of the axis name:
df.pivot(*df).add_prefix('week_').reset_index().rename_axis(columns=None)
As the other answers mentioned, "pivot" may refer to 2 different operations:
- Unstacked aggregation (i.e. make the results of
groupby.agg
wider.) - Reshaping (similar to pivot in Excel,
reshape
in numpy orpivot_wider
in R)
1. Aggregation
pivot_table
or crosstab
are simply unstacked results of groupby.agg
operation. In fact, the source code shows that, under the hood, the following are true:
pivot_table
=groupby
+unstack
(read here for more info.)crosstab
=pivot_table
N.B. You can use list of column names as index
, columns
and values
arguments.
df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols)
# equivalently,
df.pivot_table(vals, rows, cols, aggfuncs)
1.1. crosstab
is a special case of pivot_table
; thus of groupby
+ unstack
The following are equivalent:
pd.crosstab(df['colA'], df['colB'])
df.pivot_table(index='colA', columns='colB', aggfunc='size', fill_value=0)
df.groupby(['colA', 'colB']).size().unstack(fill_value=0)
Note that pd.crosstab
has a significantly larger overhead, so it’s significantly slower than both pivot_table
and groupby
+ unstack
. In fact, as noted here, pivot_table
is slower than groupby
+ unstack
as well.
2. Reshaping
pivot
is a more limited version of pivot_table
where its purpose is to reshape a long dataframe into a long one.
df.set_index(rows+cols)[vals].unstack(cols)
# equivalently,
df.pivot(rows, cols, vals)
2.1. Augment rows/columns as in Question 10
You can also apply the insight from Question 10 to multi-column pivot operation as well. There are two cases:
-
"long-to-long": reshape by augmenting the indices
Code:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2], 'B': [*'xxyyzz'], 'C': [*'CCDCDD'], 'E': [100, 200, 300, 400, 500, 600]}) rows, cols, vals = ['A', 'B'], ['C'], 'E' # using pivot syntax df1 = ( df.assign(ix=df.groupby(rows+cols).cumcount()) .pivot([*rows, 'ix'], cols, vals) .fillna(0, downcast='infer') .droplevel(-1).reset_index().rename_axis(columns=None) ) # equivalently, using set_index + unstack syntax df1 = ( df .set_index([*rows, df.groupby(rows+cols).cumcount(), *cols])[vals] .unstack(fill_value=0) .droplevel(-1).reset_index().rename_axis(columns=None) )
-
"long-to-wide": reshape by augmenting the columns
Code:
df1 = ( df.assign(ix=df.groupby(rows+cols).cumcount()) .pivot(rows, [*cols, 'ix'])[vals] .fillna(0, downcast='infer') ) df1 = df1.set_axis([f"{c[0]}_{c[1]}" for c in df1], axis=1).reset_index() # equivalently, using the set_index + unstack syntax df1 = ( df .set_index([*rows, df.groupby(rows+cols).cumcount(), *cols])[vals] .unstack([-1, *range(-2, -len(cols)-2, -1)], fill_value=0) ) df1 = df1.set_axis([f"{c[0]}_{c[1]}" for c in df1], axis=1).reset_index()
-
minimum case using the
set_index
+unstack
syntax:Code:
df1 = df.set_index(['A', df.groupby('A').cumcount()])['E'].unstack(fill_value=0).add_prefix('Col').reset_index()
1 pivot_table()
aggregates the values and unstacks it. Specifically, it creates a single flat list out of index and columns, calls groupby()
with this list as the grouper and aggregates using the passed aggregator methods (the default is mean
). Then after aggregation, it calls unstack()
by the list of columns. So internally, pivot_table = groupby + unstack. Moreover, if fill_value
is passed, fillna()
is called.
In other words, the method that produces pv_1
is the same as the method that produces gb_1
in the example below.
pv_1 = df.pivot_table(index=rows, columns=cols, values=vals, aggfunc=aggfuncs, fill_value=0)
# internal operation of `pivot_table()`
gb_1 = df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols).fillna(0, downcast="infer")
pv_1.equals(gb_1) # True
2 crosstab()
calls pivot_table()
, i.e., crosstab = pivot_table. Specifically, it builds a DataFrame out of the passed arrays of values, filters it by the common indices and calls pivot_table()
. It’s more limited than pivot_table()
because it only allows a one-dimensional array-like as values
, unlike pivot_table()
that can have multiple columns as values
.
The pivot function in pandas has the same functionality as the pivot operation in excel. We can transform a dataset from a long format to a wide format.
Lets have a example
We want to convert the dataset into a form such that each country becomes a column and the new confirmed cases as values corresponding to the countries. We can perform this data manipulation using the pivot function.
Pivot the dataset
pivot_df = pd.pivot(df, index =['Date'], columns ='Country', values =['NewConfirmed'])
## renaming the columns
pivot_df.columns = df['Country'].sort_values().unique()
We can bring the new columns to the same level as the index column Data by resetting the index.
reset the index to modify the column levels
pivot_df = pivot_df.reset_index()