pd.Series.explode and ValueError: cannot reindex from a duplicate axis
Question:
I consulted a lot of the posts on ValueError: cannot reindex from a duplicate axis ([What does `ValueError: cannot reindex from a duplicate axis` mean? and other related posts. I understand that the error can arise with duplicate row indices or column names, but I still can’t quite figure out what exactly is throwing me the error.
Below is my best at reproducing the spirit of the dataframe, which does throw the error.
d = {"id" : [1,2,3,4,5],
"cata" : [['aaa1','bbb2','ccc3'],['aaa4','bbb5','ccc6'],['aaa7','bbb8','ccc9'],['aaa10','bbb11','ccc12'],['aaa13','bbb14','ccc15']],
"catb" : [['ddd1','eee2','fff3','ggg4'],['ddd5','eee6','fff7','ggg8'],['ddd9','eee10','fff11','ggg12'],['ddd13','eee14','fff15','ggg16'],['ddd17','eee18','fff19','ggg20']],
"catc" : [['hhh1','iii2','jjj3', 'kkk4', 'lll5'],['hhh6','iii7','jjj8', 'kkk9', 'lll10'],['hhh11','iii12','jjj13', 'kkk14', 'lll15'],['hhh16','iii17','jjj18', 'kkk18', 'lll19'],['hhh20','iii21','jjj22', 'kkk23', 'lll24']]}
df = pd.DataFrame(d)
df.head()
id cata catb catc
0 1 [aaa1, bbb2, ccc3] [ddd1, eee2, fff3, ggg4] [hhh1, iii2, jjj3, kkk4, lll5]
1 2 [aaa4, bbb5, ccc6] [ddd5, eee6, fff7, ggg8] [hhh6, iii7, jjj8, kkk9, lll10]
2 3 [aaa7, bbb8, ccc9] [ddd9, eee10, fff11, ggg12] [hhh11, iii12, jjj13, kkk14, lll15]
3 4 [aaa10, bbb11, ccc12] [ddd13, eee14, fff15, ggg16] [hhh16, iii17, jjj18, kkk18, lll19]
4 5 [aaa13, bbb14, ccc15] [ddd17, eee18, fff19, ggg20] [hhh20, iii21, jjj22, kkk23, lll24]
df.set_index(['id']).apply(pd.Series.explode).reset_index()
Here is the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-63-17e7c29b180c> in <module>()
----> 1 df.set_index(['id']).apply(pd.Series.explode).reset_index()
14 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3097 # trying to reindex on an axis with duplicates
3098 if not self.is_unique and len(indexer):
-> 3099 raise ValueError("cannot reindex from a duplicate axis")
3100
3101 def reindex(self, target, method=None, level=None, limit=None, tolerance=None):
ValueError: cannot reindex from a duplicate axis
The dataset I’m using is a few hundred MBs and it’s a pain – lots of lists inside lists, but the example of above is a fair representation of where I’m stuck. Even when I try to generate a fake dataframe with unique values, I still don’t understand why I’m getting the ValueError.
I have explored other ways to explode the lists like using df.apply(lambda x: x.apply(pd.Series).stack()).reset_index().drop('level_1', 1)
, which doesn’t throw a value error, however, it’s definitely not as fast and I’d probably would reconsider how I’m processing the df. Still, I want to understand why I’m getting the ValueError I’m getting when I don’t have any obvious duplicate values.
Thanks!!!!
Adding desired output here, below, which i generated by chaining apply/stack/dropping levels.
id cata catb catc
0 1 aaa1 ddd1 hhh1
1 1 bbb2 eee2 iii2
2 1 ccc3 fff3 jjj3
3 1 NaN ggg4 kkk4
4 1 NaN NaN lll5
5 2 aaa4 ddd5 hhh6
6 2 bbb5 eee6 iii7
7 2 ccc6 fff7 jjj8
8 2 NaN ggg8 kkk9
9 2 NaN NaN lll10
10 3 aaa7 ddd9 hhh11
11 3 bbb8 eee10 iii12
12 3 ccc9 fff11 jjj13
13 3 NaN ggg12 kkk14
14 3 NaN NaN lll15
15 4 aaa10 ddd13 hhh16
16 4 bbb11 eee14 iii17
17 4 ccc12 fff15 jjj18
18 4 NaN ggg16 kkk18
19 4 NaN NaN lll19
20 5 aaa13 ddd17 hhh20
21 5 bbb14 eee18 iii21
22 5 ccc15 fff19 jjj22
23 5 NaN ggg20 kkk23
24 5 NaN NaN lll24
Answers:
The error of pd.Series.explode()
cannot be solved, but a long form with an ‘id’ column is created.
tmp = pd.concat([df['id'],df['cata'].apply(pd.Series),df['catb'].apply(pd.Series),df['catc'].apply(pd.Series)],axis=1)
tmp2 = tmp.unstack().to_frame().reset_index()
tmp2 = tmp2[tmp2['level_0'] != 'id']
tmp2.drop('level_1', axis=1, inplace=True)
tmp2.rename(columns={'level_0':'id', 0:'value'}).set_index()
tmp2.reset_index(drop=True, inplace=True)
id value
0 0 aaa1
1 0 aaa4
2 0 aaa7
3 0 aaa10
4 0 aaa13
5 1 bbb2
6 1 bbb5
7 1 bbb8
8 1 bbb11
9 1 bbb14
10 2 ccc3
11 2 ccc6
12 2 ccc9
...
I had to rethink how I was parsing the data. What I accidentally omitted from this post was that I got to unbalanced lists as a consequence of using .str.findall(regex_pattern).to_frame() on different columns. Unbalanced lists resulted because certain metadata fields were missing over the years (e.g., “name”) However, because I started with a column of lists of lists, I exploded that using df.explode and then use findall to extract patterns to new cols, which meant that null values could be created too.
For a 500MB dataset of several hundred thousand rows of fields with string type data, the whole process took probably less than 5 min.
from pandas import DataFrame as df
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"id" : [1,2,3],
0: [['x', 'y', 'z'], ['a', 'b', 'c'], ['a', 'b', 'c']],
1: [['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']],
2: [['a', 'b', 'c'], ['x', 'y', 'z'], ['a', 'b', 'c']]},
)
print(df)
"""
id 0 1 2
0 1 [x, y, z] [a, b, c] [a, b, c]
1 2 [a, b, c] [a, b, c] [x, y, z]
2 3 [a, b, c] [a, b, c] [a, b, c]
"""
bb = (
df.set_index('id').stack().explode()
.reset_index(name='val')
.drop(columns='level_1').reindex()
)
print (bb)
"""
id val
0 1 x
1 1 y
2 1 z
3 1 a
4 1 b
5 1 c
6 1 a
7 1 b
8 1 c
9 2 a
10 2 b
11 2 c
12 2 a
13 2 b
14 2 c
15 2 x
16 2 y
17 2 z
18 3 a
19 3 b
20 3 c
21 3 a
22 3 b
23 3 c
24 3 a
25 3 b
26 3 c
"""
aa = df.set_index('id').apply(pd.Series.explode).reset_index()
print(aa)
"""
id 0 1 2
0 1 x a a
1 1 y b b
2 1 z c c
3 2 a a x
4 2 b b y
5 2 c c z
6 3 a a a
7 3 b b b
8 3 c c c
"""
I consulted a lot of the posts on ValueError: cannot reindex from a duplicate axis ([What does `ValueError: cannot reindex from a duplicate axis` mean? and other related posts. I understand that the error can arise with duplicate row indices or column names, but I still can’t quite figure out what exactly is throwing me the error.
Below is my best at reproducing the spirit of the dataframe, which does throw the error.
d = {"id" : [1,2,3,4,5],
"cata" : [['aaa1','bbb2','ccc3'],['aaa4','bbb5','ccc6'],['aaa7','bbb8','ccc9'],['aaa10','bbb11','ccc12'],['aaa13','bbb14','ccc15']],
"catb" : [['ddd1','eee2','fff3','ggg4'],['ddd5','eee6','fff7','ggg8'],['ddd9','eee10','fff11','ggg12'],['ddd13','eee14','fff15','ggg16'],['ddd17','eee18','fff19','ggg20']],
"catc" : [['hhh1','iii2','jjj3', 'kkk4', 'lll5'],['hhh6','iii7','jjj8', 'kkk9', 'lll10'],['hhh11','iii12','jjj13', 'kkk14', 'lll15'],['hhh16','iii17','jjj18', 'kkk18', 'lll19'],['hhh20','iii21','jjj22', 'kkk23', 'lll24']]}
df = pd.DataFrame(d)
df.head()
id cata catb catc
0 1 [aaa1, bbb2, ccc3] [ddd1, eee2, fff3, ggg4] [hhh1, iii2, jjj3, kkk4, lll5]
1 2 [aaa4, bbb5, ccc6] [ddd5, eee6, fff7, ggg8] [hhh6, iii7, jjj8, kkk9, lll10]
2 3 [aaa7, bbb8, ccc9] [ddd9, eee10, fff11, ggg12] [hhh11, iii12, jjj13, kkk14, lll15]
3 4 [aaa10, bbb11, ccc12] [ddd13, eee14, fff15, ggg16] [hhh16, iii17, jjj18, kkk18, lll19]
4 5 [aaa13, bbb14, ccc15] [ddd17, eee18, fff19, ggg20] [hhh20, iii21, jjj22, kkk23, lll24]
df.set_index(['id']).apply(pd.Series.explode).reset_index()
Here is the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-63-17e7c29b180c> in <module>()
----> 1 df.set_index(['id']).apply(pd.Series.explode).reset_index()
14 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3097 # trying to reindex on an axis with duplicates
3098 if not self.is_unique and len(indexer):
-> 3099 raise ValueError("cannot reindex from a duplicate axis")
3100
3101 def reindex(self, target, method=None, level=None, limit=None, tolerance=None):
ValueError: cannot reindex from a duplicate axis
The dataset I’m using is a few hundred MBs and it’s a pain – lots of lists inside lists, but the example of above is a fair representation of where I’m stuck. Even when I try to generate a fake dataframe with unique values, I still don’t understand why I’m getting the ValueError.
I have explored other ways to explode the lists like using df.apply(lambda x: x.apply(pd.Series).stack()).reset_index().drop('level_1', 1)
, which doesn’t throw a value error, however, it’s definitely not as fast and I’d probably would reconsider how I’m processing the df. Still, I want to understand why I’m getting the ValueError I’m getting when I don’t have any obvious duplicate values.
Thanks!!!!
Adding desired output here, below, which i generated by chaining apply/stack/dropping levels.
id cata catb catc
0 1 aaa1 ddd1 hhh1
1 1 bbb2 eee2 iii2
2 1 ccc3 fff3 jjj3
3 1 NaN ggg4 kkk4
4 1 NaN NaN lll5
5 2 aaa4 ddd5 hhh6
6 2 bbb5 eee6 iii7
7 2 ccc6 fff7 jjj8
8 2 NaN ggg8 kkk9
9 2 NaN NaN lll10
10 3 aaa7 ddd9 hhh11
11 3 bbb8 eee10 iii12
12 3 ccc9 fff11 jjj13
13 3 NaN ggg12 kkk14
14 3 NaN NaN lll15
15 4 aaa10 ddd13 hhh16
16 4 bbb11 eee14 iii17
17 4 ccc12 fff15 jjj18
18 4 NaN ggg16 kkk18
19 4 NaN NaN lll19
20 5 aaa13 ddd17 hhh20
21 5 bbb14 eee18 iii21
22 5 ccc15 fff19 jjj22
23 5 NaN ggg20 kkk23
24 5 NaN NaN lll24
The error of pd.Series.explode()
cannot be solved, but a long form with an ‘id’ column is created.
tmp = pd.concat([df['id'],df['cata'].apply(pd.Series),df['catb'].apply(pd.Series),df['catc'].apply(pd.Series)],axis=1)
tmp2 = tmp.unstack().to_frame().reset_index()
tmp2 = tmp2[tmp2['level_0'] != 'id']
tmp2.drop('level_1', axis=1, inplace=True)
tmp2.rename(columns={'level_0':'id', 0:'value'}).set_index()
tmp2.reset_index(drop=True, inplace=True)
id value
0 0 aaa1
1 0 aaa4
2 0 aaa7
3 0 aaa10
4 0 aaa13
5 1 bbb2
6 1 bbb5
7 1 bbb8
8 1 bbb11
9 1 bbb14
10 2 ccc3
11 2 ccc6
12 2 ccc9
...
I had to rethink how I was parsing the data. What I accidentally omitted from this post was that I got to unbalanced lists as a consequence of using .str.findall(regex_pattern).to_frame() on different columns. Unbalanced lists resulted because certain metadata fields were missing over the years (e.g., “name”) However, because I started with a column of lists of lists, I exploded that using df.explode and then use findall to extract patterns to new cols, which meant that null values could be created too.
For a 500MB dataset of several hundred thousand rows of fields with string type data, the whole process took probably less than 5 min.
from pandas import DataFrame as df
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"id" : [1,2,3],
0: [['x', 'y', 'z'], ['a', 'b', 'c'], ['a', 'b', 'c']],
1: [['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']],
2: [['a', 'b', 'c'], ['x', 'y', 'z'], ['a', 'b', 'c']]},
)
print(df)
"""
id 0 1 2
0 1 [x, y, z] [a, b, c] [a, b, c]
1 2 [a, b, c] [a, b, c] [x, y, z]
2 3 [a, b, c] [a, b, c] [a, b, c]
"""
bb = (
df.set_index('id').stack().explode()
.reset_index(name='val')
.drop(columns='level_1').reindex()
)
print (bb)
"""
id val
0 1 x
1 1 y
2 1 z
3 1 a
4 1 b
5 1 c
6 1 a
7 1 b
8 1 c
9 2 a
10 2 b
11 2 c
12 2 a
13 2 b
14 2 c
15 2 x
16 2 y
17 2 z
18 3 a
19 3 b
20 3 c
21 3 a
22 3 b
23 3 c
24 3 a
25 3 b
26 3 c
"""
aa = df.set_index('id').apply(pd.Series.explode).reset_index()
print(aa)
"""
id 0 1 2
0 1 x a a
1 1 y b b
2 1 z c c
3 2 a a x
4 2 b b y
5 2 c c z
6 3 a a a
7 3 b b b
8 3 c c c
"""