ValueError due to duplicate axis when replace values in pandas dataframe
Question:
I have one dataset, df, including nodes (N and T) and indicators assigned to nodes (IND_N and IND_T):
N T IND_N IND_T
0 John Mark 1 0
1 Mike John 2 1
2 Stephan Simon 1 0
3 Laura Stephan 1 1
4 Matt Simon 3 0
5 Simon Joey 0 2
I split the dataset into two, one (df1) with nodes that keep the indicators from df, the other one (df2) with indicators replaced by a dummy value.
df1
(keeps indicators from df)
N T IND_N IND_T
0 John Mark 1 0
1 Stephan Simon 1 0
2 Simon Joey 0 2
df2 (please note that, after splitting, I assigned a dummy value -1 to all the indicators in df2)
N T IND_N IND_T
0 Laura Stephan -1 -1
1 Matt Simon -1 -1
2 Mike John -1 -1
Since there could be nodes in df2 that can be also found in df1, to avoid the case of nodes being in both the datasets (df1 and df2) but having different indicators (e.g., Simon in the example above), I wanted to keep/replace the indicators of nodes that are both df2 and df1 with their original indicator (i.e., that one from df1), then recombine the two datasets in order to have the final output:
df_out
N T IND_N IND_T
0 John Mark 1 0
1 Stephan Simon 1 0
2 Simon Joey 0 2
3 Laura Stephan -1 1
4 Matt Simon -1 0
5 Mike John -1 1
Following the solution proposed here, I have got the following error:
ValueError: cannot reindex from a duplicate axis
I tried to fix it as follows:
temp = df_unlabel[values]
temp.update(df_label[values].set_index(col, inplace=True))
After checking the values in the final table (df_out), I found that there are no dummy variables assigned (they are replaced again by the original ones).
I’d appreciate your help to fix this error in order to get the final output.
Happy to provide more info if needed.
Answers:
You can use a mapping dict:
# Create a mapping dict with default value
dmap = pd.concat([df1.set_index('N')['IND_N'], df.set_index('T')['IND_T']]).to_dict()
dmap.update({'.*': -1})
df2[['IND_N', 'IND_T']] = df2[['N', 'T']].replace(dmap, regex=True).values
out = pd.concat([df1, df2], axis=0, ignore_index=True)
Output:
>>> out
N T IND_N IND_T
0 John Mark 1 0
1 Stephan Simon 1 0
2 Simon Joey 0 2
3 Laura Stephan -1 1
4 Matt Simon -1 0
5 Mike John -1 1
>>> dmap
{'John': 1, 'Stephan': 1, 'Simon': 0, 'Mark': 0, 'Joey': 2, '.*': -1}
I have one dataset, df, including nodes (N and T) and indicators assigned to nodes (IND_N and IND_T):
N T IND_N IND_T
0 John Mark 1 0
1 Mike John 2 1
2 Stephan Simon 1 0
3 Laura Stephan 1 1
4 Matt Simon 3 0
5 Simon Joey 0 2
I split the dataset into two, one (df1) with nodes that keep the indicators from df, the other one (df2) with indicators replaced by a dummy value.
df1
(keeps indicators from df)
N T IND_N IND_T
0 John Mark 1 0
1 Stephan Simon 1 0
2 Simon Joey 0 2
df2 (please note that, after splitting, I assigned a dummy value -1 to all the indicators in df2)
N T IND_N IND_T
0 Laura Stephan -1 -1
1 Matt Simon -1 -1
2 Mike John -1 -1
Since there could be nodes in df2 that can be also found in df1, to avoid the case of nodes being in both the datasets (df1 and df2) but having different indicators (e.g., Simon in the example above), I wanted to keep/replace the indicators of nodes that are both df2 and df1 with their original indicator (i.e., that one from df1), then recombine the two datasets in order to have the final output:
df_out
N T IND_N IND_T
0 John Mark 1 0
1 Stephan Simon 1 0
2 Simon Joey 0 2
3 Laura Stephan -1 1
4 Matt Simon -1 0
5 Mike John -1 1
Following the solution proposed here, I have got the following error:
ValueError: cannot reindex from a duplicate axis
I tried to fix it as follows:
temp = df_unlabel[values]
temp.update(df_label[values].set_index(col, inplace=True))
After checking the values in the final table (df_out), I found that there are no dummy variables assigned (they are replaced again by the original ones).
I’d appreciate your help to fix this error in order to get the final output.
Happy to provide more info if needed.
You can use a mapping dict:
# Create a mapping dict with default value
dmap = pd.concat([df1.set_index('N')['IND_N'], df.set_index('T')['IND_T']]).to_dict()
dmap.update({'.*': -1})
df2[['IND_N', 'IND_T']] = df2[['N', 'T']].replace(dmap, regex=True).values
out = pd.concat([df1, df2], axis=0, ignore_index=True)
Output:
>>> out
N T IND_N IND_T
0 John Mark 1 0
1 Stephan Simon 1 0
2 Simon Joey 0 2
3 Laura Stephan -1 1
4 Matt Simon -1 0
5 Mike John -1 1
>>> dmap
{'John': 1, 'Stephan': 1, 'Simon': 0, 'Mark': 0, 'Joey': 2, '.*': -1}