Pandas dataframe masking error: cannot reindex on an axis with duplicate labels


I am trying to get some metrics on some data at my company.

Basically, I have this dataframe that I have titled rawData.
rawData contains a number of columns, mostly of parameters I am interested in. The specifics of this are not too important I dont think, so we can just think of these as parameter1, parameter2, and so on.

There is an additional column, which I have titled overallResult. This column will always contain either the string PASS, or FAIL. I am trying to extract a sub-dataframe from my raw data based on the overallResult. It sounds simple enough, but I am messing up my implementation somehow.

I make my mask like this:
mask = rawData[overallResult].eq(truthyVal), where in this case truthyVal is PASS

The mask is created successfully, but..

The mask is like this:
filteredData = rawData[mask]
and I would like filteredData to now contain everything that rawData does, but only on rows where truthyVal exists.

and it always give me this error: cannot reindex on an axis with duplicate labels.

From what I understand, the mask contains a boolean list of my overallResult column, true if truthyVal is found on that row, and false if not. I am pretty sure that I am not applying my mask correctly here. There must be some small extra step I am overlooking, and at this point I am frustrated because it seems so simple, so IDK, any ideas?

Asked By: creosean



You have the principle correct as the following basic example shows:

import pandas as pd

df = pd.DataFrame({'data': [ 1, 2, 3, 4, 5, 6],
                  'test': ['pass', 'fail', 'pass', 'fail','pass', 'fail']})

mask = df['test'].eq('pass')

To decipher your error message it would be necessary to see a data sample which produces it; you might get some useful insights here

Answered By: user19077881
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.