pd.Series.explode and ValueError: cannot reindex from a duplicate axis

Question:

I consulted a lot of the posts on ValueError: cannot reindex from a duplicate axis ([What does `ValueError: cannot reindex from a duplicate axis` mean? and other related posts. I understand that the error can arise with duplicate row indices or column names, but I still can’t quite figure out what exactly is throwing me the error.

Below is my best at reproducing the spirit of the dataframe, which does throw the error.

d = {"id" : [1,2,3,4,5], 
"cata" : [['aaa1','bbb2','ccc3'],['aaa4','bbb5','ccc6'],['aaa7','bbb8','ccc9'],['aaa10','bbb11','ccc12'],['aaa13','bbb14','ccc15']],
 "catb" : [['ddd1','eee2','fff3','ggg4'],['ddd5','eee6','fff7','ggg8'],['ddd9','eee10','fff11','ggg12'],['ddd13','eee14','fff15','ggg16'],['ddd17','eee18','fff19','ggg20']],
 "catc" : [['hhh1','iii2','jjj3', 'kkk4', 'lll5'],['hhh6','iii7','jjj8', 'kkk9', 'lll10'],['hhh11','iii12','jjj13', 'kkk14', 'lll15'],['hhh16','iii17','jjj18', 'kkk18', 'lll19'],['hhh20','iii21','jjj22', 'kkk23', 'lll24']]}

df = pd.DataFrame(d)

df.head()

    id  cata    catb    catc
0   1   [aaa1, bbb2, ccc3]  [ddd1, eee2, fff3, ggg4]    [hhh1, iii2, jjj3, kkk4, lll5]
1   2   [aaa4, bbb5, ccc6]  [ddd5, eee6, fff7, ggg8]    [hhh6, iii7, jjj8, kkk9, lll10]
2   3   [aaa7, bbb8, ccc9]  [ddd9, eee10, fff11, ggg12]     [hhh11, iii12, jjj13, kkk14, lll15]
3   4   [aaa10, bbb11, ccc12]   [ddd13, eee14, fff15, ggg16]    [hhh16, iii17, jjj18, kkk18, lll19]
4   5   [aaa13, bbb14, ccc15]   [ddd17, eee18, fff19, ggg20]    [hhh20, iii21, jjj22, kkk23, lll24]

df.set_index(['id']).apply(pd.Series.explode).reset_index()

Here is the error:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-63-17e7c29b180c> in <module>()
----> 1 df.set_index(['id']).apply(pd.Series.explode).reset_index()

14 frames

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

The dataset I’m using is a few hundred MBs and it’s a pain – lots of lists inside lists, but the example of above is a fair representation of where I’m stuck. Even when I try to generate a fake dataframe with unique values, I still don’t understand why I’m getting the ValueError.

I have explored other ways to explode the lists like using df.apply(lambda x: x.apply(pd.Series).stack()).reset_index().drop('level_1', 1), which doesn’t throw a value error, however, it’s definitely not as fast and I’d probably would reconsider how I’m processing the df. Still, I want to understand why I’m getting the ValueError I’m getting when I don’t have any obvious duplicate values.

Thanks!!!!

Adding desired output here, below, which i generated by chaining apply/stack/dropping levels.

    id  cata    catb    catc
0   1   aaa1    ddd1    hhh1
1   1   bbb2    eee2    iii2
2   1   ccc3    fff3    jjj3
3   1   NaN     ggg4    kkk4
4   1   NaN     NaN     lll5
5   2   aaa4    ddd5    hhh6
6   2   bbb5    eee6    iii7
7   2   ccc6    fff7    jjj8
8   2   NaN     ggg8    kkk9
9   2   NaN     NaN     lll10
10  3   aaa7    ddd9    hhh11
11  3   bbb8    eee10   iii12
12  3   ccc9    fff11   jjj13
13  3   NaN     ggg12   kkk14
14  3   NaN     NaN     lll15
15  4   aaa10   ddd13   hhh16
16  4   bbb11   eee14   iii17
17  4   ccc12   fff15   jjj18
18  4   NaN     ggg16   kkk18
19  4   NaN     NaN     lll19
20  5   aaa13   ddd17   hhh20
21  5   bbb14   eee18   iii21
22  5   ccc15   fff19   jjj22
23  5   NaN     ggg20   kkk23
24  5   NaN     NaN     lll24
Asked By: imstuck

||

Answers:

The error of pd.Series.explode() cannot be solved, but a long form with an ‘id’ column is created.

tmp = pd.concat([df['id'],df['cata'].apply(pd.Series),df['catb'].apply(pd.Series),df['catc'].apply(pd.Series)],axis=1)
tmp2 = tmp.unstack().to_frame().reset_index()
tmp2 = tmp2[tmp2['level_0'] != 'id']
tmp2.drop('level_1', axis=1, inplace=True)
tmp2.rename(columns={'level_0':'id', 0:'value'}).set_index()
tmp2.reset_index(drop=True, inplace=True)

    id  value
0   0   aaa1
1   0   aaa4
2   0   aaa7
3   0   aaa10
4   0   aaa13
5   1   bbb2
6   1   bbb5
7   1   bbb8
8   1   bbb11
9   1   bbb14
10  2   ccc3
11  2   ccc6
12  2   ccc9
...
Answered By: r-beginners

I had to rethink how I was parsing the data. What I accidentally omitted from this post was that I got to unbalanced lists as a consequence of using .str.findall(regex_pattern).to_frame() on different columns. Unbalanced lists resulted because certain metadata fields were missing over the years (e.g., “name”) However, because I started with a column of lists of lists, I exploded that using df.explode and then use findall to extract patterns to new cols, which meant that null values could be created too.

For a 500MB dataset of several hundred thousand rows of fields with string type data, the whole process took probably less than 5 min.

Answered By: imstuck
from pandas import DataFrame as df

import numpy as np
import pandas as pd 




df = pd.DataFrame(
    {"id" : [1,2,3], 
        0: [['x', 'y', 'z'], ['a', 'b', 'c'], ['a', 'b', 'c']],
                   1: [['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']],
                   2: [['a', 'b', 'c'], ['x', 'y', 'z'], ['a', 'b', 'c']]},
                  )


print(df)

"""
   id          0          1          2
0   1  [x, y, z]  [a, b, c]  [a, b, c]
1   2  [a, b, c]  [a, b, c]  [x, y, z]
2   3  [a, b, c]  [a, b, c]  [a, b, c]

"""

bb = (
    df.set_index('id').stack().explode()
    .reset_index(name='val')
    .drop(columns='level_1').reindex()
    )
print (bb)
"""

    id val
0    1   x
1    1   y
2    1   z
3    1   a
4    1   b
5    1   c
6    1   a
7    1   b
8    1   c
9    2   a
10   2   b
11   2   c
12   2   a
13   2   b
14   2   c
15   2   x
16   2   y
17   2   z
18   3   a
19   3   b
20   3   c
21   3   a
22   3   b
23   3   c
24   3   a
25   3   b
26   3   c

"""


aa = df.set_index('id').apply(pd.Series.explode).reset_index()
print(aa)
"""
   id  0  1  2
0   1  x  a  a
1   1  y  b  b
2   1  z  c  c
3   2  a  a  x
4   2  b  b  y
5   2  c  c  z
6   3  a  a  a
7   3  b  b  b
8   3  c  c  c

"""
Answered By: Soudipta Dutta
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.