Pandas pivot warning about repeated entries on index
Question:
On Pandas documentation of the pivot
method, we have:
Examples
--------
>>> df
foo bar baz
0 one A 1.
1 one B 2.
2 one C 3.
3 two A 4.
4 two B 5.
5 two C 6.
>>> df.pivot('foo', 'bar', 'baz')
A B C
one 1 2 3
two 4 5 6
My DataFrame
is structured like this:
name id x
----------------------
0 john 1 0
1 john 2 0
2 mike 1 1
3 mike 2 0
And I want something like this:
1 2 # (this is the id as columns)
----------------------
mike 0 0 # (and this is the 'x' as values)
john 1 0
But when I run the pivot
method, it is saying:
*** ReshapeError: Index contains duplicate entries, cannot reshape
Which doesn’t makes sense, even in example there are repeated entries on the foo
column. I’m using the name
column as the index of the pivot, the first argument of the pivot
method call.
Answers:
Works fine for me? Can you post the exact pivot method call you’re using?
In [4]: df.pivot('name', 'id', 'x')
Out[4]:
id 1 2
name
john 0 0
mike 1 0
Try this,
#drop_duplicates removes entries which have same values for 'foo' and 'bar'
df = df.drop_duplicates(['foo','bar'])
df.pivot('foo','bar','baz')
As far as I can tell with updates to pandas, you have to use pivot_table() instead of pivot().
pandas.pivot_table(df,values='count',index='site_id',columns='week')
friends, I’ve had such problem. In my case problem was in data – my column ‘information’ contained 1 unique value and it caused error.
UPD: to correct work ‘pivot’ pairs (id_user,information) mustn’t have dublicates
It works:
df2 = pd.DataFrame({'id_user':[1,2,3,4,4,5,5],
'information':['phon','phon','phone','phone1','phone','phone1','phone'],
'value': [1, '01.01.00', '01.02.00', 2, '01.03.00', 3, '01.04.00']})
df2.pivot(index='id_user', columns='information', values='value')
it doesn’t work:
df2 = pd.DataFrame({'id_user':[1,2,3,4,4,5,5],
'information':['phone','phone','phone','phone','phone','phone','phone'],
'value': [1, '01.01.00', '01.02.00', 2, '01.03.00', 3, '01.04.00']})
df2.pivot(index='id_user', columns='information', values='value')
My data has no duplicated pivot pairs and still pivot_table throws a key error 🙁 drop_duplicates() truncates my data to the first value of the pair.
Input:
Well Reading Filter 4
0 A2 1 116041
1 B2 1 105191
2 C2 1 93942
3 D2 1 96821
4 E2 1 85622
5 F2 1 90227
6 G2 1 95801
7 H2 1 107833
8 A2 2 115765
9 B2 2 104395
10 C2 2 93986
...
1630 G2 204 388682
1631 H2 204 444708
1632 rows × 3 columns
df_X2.pivot_table(‘Reading’, ‘Well’, ‘Filter 4’)
throws: KeyError: ‘Reading’
df_X2_uniq=df_X2.drop_duplicates([‘Well’, ‘Reading’])
truncates the data to the first 8 rows:
Well Reading Filter 4
0 A2 1 116041
1 B2 1 105191
2 C2 1 93942
3 D2 1 96821
4 E2 1 85622
5 F2 1 90227
6 G2 1 95801
7 H2 1 107833
After 2 hours of combing through the posts I’m none the wiser… any hints of what I should try to get the pivot to work?
As pointed out by several answers, the problem isn’t that you have duplicate values in your index (the error message certainly does not help here), but rather that you have duplicates of (index, column)
. Some answers are suggesting that you just drop these duplicates but I’d be careful about doing so – in my experience this is rarely the correct choice. More often than not, you probably want to aggregate your data somehow, and then pivot.
I’m pulling some examples and quotes from this blog post, which I suggest you read for more details, below.
Given data like this:
df = pd.DataFrame([
['a', 'x', 1],
['a', 'x', 2],
['b', 'x', 3],
['b', 'y', 4]
], columns=['g1', 'g2', 'value'])
which prints like this:
>>> print(df)
g1 g2 value
0 a x 1
1 a x 2
2 b x 3
3 b y 4
we get a ValueError
when attempting to pivot with g1
as the index and g2
as the columns:
>>> df.pivot(index='g1', columns='g2', values='value')
...
ValueError: Index contains duplicate entries, cannot reshape
Notice that rows 0 and 1 have the same values for g1
and g2
: (a, x)
. So when pandas creates your pivoted dataframe, for the a
index, g1
column, how do pick just one value: 1 or 2? The answer is… we can’t! This is why dropping duplicates works, but it may not be you want, since you’re losing potentially useful data. So what can we do instead?
Solution 1: Aggregate
There won’t always be an aggregate function that makes sense for your use case, but if there is, there are several ways to accomplish this.
df.pivot_table(index='g1', columns='g2', values='value', aggfunc='sum')
df_agg = df.groupby(by=['g1', 'g2']).value.sum().reset_index()
df_agg.pivot(index='g1', columns='g2', values='value')
df.groupby(by=['g1', 'g2']).value.sum().unstack()
All of these yield the same result:
g2 x y
g1
a 3.0 NaN
b 3.0 4.0
But what if you don’t need the sum? Maybe comma separated values are useful in your case?
df.pivot_table(
index='g1',
columns='g2',
values='value',
aggfunc=lambda x: ','.join(x.astype('str'))
)
# we need to convert to strings before we can join
to get:
g2 x y
g1
a 1,2 NaN
b 3 4
or you can use list
as your aggfunc
:
pv = df.pivot_table(index='g1', columns='g2', values='value', aggfunc=list)
and then we can explode!
>>> pv.explode('x').explode('y')
g2 x y
g1
a 1 NaN
a 2 NaN
b 3 4
Solution 2: Give yourself another key
This is based on this answer
>>> df['key'] = df.groupby(['g1', 'g2']).cumcount()
>>> df
g1 g2 value key
0 a x 1 0
1 a x 2 1
2 b x 3 0
3 b y 4 0
and now we can pivot with a composite index:
>>> df.pivot(index=['key', 'g1'], columns='g2', values='value').reset_index().drop(columns='key')
g2 g1 x y
0 a 1.0 NaN
1 b 3.0 4.0
2 a 2.0 NaN
This is almost the same result as the exploded example above, just a set_index('g1')
away.
Hope this helps! I hit this problem quite often and usually forget all of this..
On Pandas documentation of the pivot
method, we have:
Examples
--------
>>> df
foo bar baz
0 one A 1.
1 one B 2.
2 one C 3.
3 two A 4.
4 two B 5.
5 two C 6.
>>> df.pivot('foo', 'bar', 'baz')
A B C
one 1 2 3
two 4 5 6
My DataFrame
is structured like this:
name id x
----------------------
0 john 1 0
1 john 2 0
2 mike 1 1
3 mike 2 0
And I want something like this:
1 2 # (this is the id as columns)
----------------------
mike 0 0 # (and this is the 'x' as values)
john 1 0
But when I run the pivot
method, it is saying:
*** ReshapeError: Index contains duplicate entries, cannot reshape
Which doesn’t makes sense, even in example there are repeated entries on the foo
column. I’m using the name
column as the index of the pivot, the first argument of the pivot
method call.
Works fine for me? Can you post the exact pivot method call you’re using?
In [4]: df.pivot('name', 'id', 'x')
Out[4]:
id 1 2
name
john 0 0
mike 1 0
Try this,
#drop_duplicates removes entries which have same values for 'foo' and 'bar'
df = df.drop_duplicates(['foo','bar'])
df.pivot('foo','bar','baz')
As far as I can tell with updates to pandas, you have to use pivot_table() instead of pivot().
pandas.pivot_table(df,values='count',index='site_id',columns='week')
friends, I’ve had such problem. In my case problem was in data – my column ‘information’ contained 1 unique value and it caused error.
UPD: to correct work ‘pivot’ pairs (id_user,information) mustn’t have dublicates
It works:
df2 = pd.DataFrame({'id_user':[1,2,3,4,4,5,5],
'information':['phon','phon','phone','phone1','phone','phone1','phone'],
'value': [1, '01.01.00', '01.02.00', 2, '01.03.00', 3, '01.04.00']})
df2.pivot(index='id_user', columns='information', values='value')
it doesn’t work:
df2 = pd.DataFrame({'id_user':[1,2,3,4,4,5,5],
'information':['phone','phone','phone','phone','phone','phone','phone'],
'value': [1, '01.01.00', '01.02.00', 2, '01.03.00', 3, '01.04.00']})
df2.pivot(index='id_user', columns='information', values='value')
My data has no duplicated pivot pairs and still pivot_table throws a key error 🙁 drop_duplicates() truncates my data to the first value of the pair.
Input:
Well Reading Filter 4
0 A2 1 116041
1 B2 1 105191
2 C2 1 93942
3 D2 1 96821
4 E2 1 85622
5 F2 1 90227
6 G2 1 95801
7 H2 1 107833
8 A2 2 115765
9 B2 2 104395
10 C2 2 93986
...
1630 G2 204 388682
1631 H2 204 444708
1632 rows × 3 columns
df_X2.pivot_table(‘Reading’, ‘Well’, ‘Filter 4’)
throws: KeyError: ‘Reading’
df_X2_uniq=df_X2.drop_duplicates([‘Well’, ‘Reading’])
truncates the data to the first 8 rows:
Well Reading Filter 4
0 A2 1 116041
1 B2 1 105191
2 C2 1 93942
3 D2 1 96821
4 E2 1 85622
5 F2 1 90227
6 G2 1 95801
7 H2 1 107833
After 2 hours of combing through the posts I’m none the wiser… any hints of what I should try to get the pivot to work?
As pointed out by several answers, the problem isn’t that you have duplicate values in your index (the error message certainly does not help here), but rather that you have duplicates of (index, column)
. Some answers are suggesting that you just drop these duplicates but I’d be careful about doing so – in my experience this is rarely the correct choice. More often than not, you probably want to aggregate your data somehow, and then pivot.
I’m pulling some examples and quotes from this blog post, which I suggest you read for more details, below.
Given data like this:
df = pd.DataFrame([
['a', 'x', 1],
['a', 'x', 2],
['b', 'x', 3],
['b', 'y', 4]
], columns=['g1', 'g2', 'value'])
which prints like this:
>>> print(df)
g1 g2 value
0 a x 1
1 a x 2
2 b x 3
3 b y 4
we get a ValueError
when attempting to pivot with g1
as the index and g2
as the columns:
>>> df.pivot(index='g1', columns='g2', values='value')
...
ValueError: Index contains duplicate entries, cannot reshape
Notice that rows 0 and 1 have the same values for g1
and g2
: (a, x)
. So when pandas creates your pivoted dataframe, for the a
index, g1
column, how do pick just one value: 1 or 2? The answer is… we can’t! This is why dropping duplicates works, but it may not be you want, since you’re losing potentially useful data. So what can we do instead?
Solution 1: Aggregate
There won’t always be an aggregate function that makes sense for your use case, but if there is, there are several ways to accomplish this.
df.pivot_table(index='g1', columns='g2', values='value', aggfunc='sum')
df_agg = df.groupby(by=['g1', 'g2']).value.sum().reset_index()
df_agg.pivot(index='g1', columns='g2', values='value')
df.groupby(by=['g1', 'g2']).value.sum().unstack()
All of these yield the same result:
g2 x y
g1
a 3.0 NaN
b 3.0 4.0
But what if you don’t need the sum? Maybe comma separated values are useful in your case?
df.pivot_table(
index='g1',
columns='g2',
values='value',
aggfunc=lambda x: ','.join(x.astype('str'))
)
# we need to convert to strings before we can join
to get:
g2 x y
g1
a 1,2 NaN
b 3 4
or you can use list
as your aggfunc
:
pv = df.pivot_table(index='g1', columns='g2', values='value', aggfunc=list)
and then we can explode!
>>> pv.explode('x').explode('y')
g2 x y
g1
a 1 NaN
a 2 NaN
b 3 4
Solution 2: Give yourself another key
This is based on this answer
>>> df['key'] = df.groupby(['g1', 'g2']).cumcount()
>>> df
g1 g2 value key
0 a x 1 0
1 a x 2 1
2 b x 3 0
3 b y 4 0
and now we can pivot with a composite index:
>>> df.pivot(index=['key', 'g1'], columns='g2', values='value').reset_index().drop(columns='key')
g2 g1 x y
0 a 1.0 NaN
1 b 3.0 4.0
2 a 2.0 NaN
This is almost the same result as the exploded example above, just a set_index('g1')
away.
Hope this helps! I hit this problem quite often and usually forget all of this..