Remove duplicate rows from a pandas dataframe: Case Insenstive comparison

Question:

I want to remove duplicate rows from the dataframe based on values in two columns: Column1 and Column2

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})

On using:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

I get:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

But using same code for dataframe (Cat and Bat cases changed)

df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
                   'Column2': ["'Bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})

I get:

  Column1   Column2 Column3
0   'Cat'     'Bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Expected Output:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

How can this comparison be done case insensitively?

Asked By: Ashu Grover

||

Answers:

You could convert the dataframe to lower case and then apply your solution.

Your dataframe.

df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
                   'Column2': ["'Bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})

print(df)

  Column1   Column2 Column3
0   'Cat'     'Bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Then apply lower string.

result_df = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Column1', 'Column2'], keep='first')

print(result_df)
  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

Then filter df for upper case.

df.loc[result_df.index]

  Column1   Column2 Column3
0   'Cat'     'Bat'   'xyz'
1   'toy'  'flower'   'abc'
Answered By: run-out

First, convert all the string values to lowercase to make them case insensitive using the following line:

df[['Column1', 'Column2']] = df[['Column1', 'Column2']].applymap(lambda x: x.lower())

You will get the output as follows.

    Column1   Column2  Column3 
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Now apply the drop duplicates function.

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

reference: here

Answered By: Saurabh Kansal

I figured it out. Create new uppercase columns and then use them to remove the duplicates. Once done, drop the uppercase columns.

df = pd.DataFrame({'Column1': ["'Cat'", "'toy'", "'cat'"],
                       'Column2': ["'Bat'", "'flower'", "'bat'"],
                       'Column3': ["'xyz'", "'abc'", "'lmn'"]})

df['Column1_Upper'] = df['Column1'].astype(str).str.upper()
df['Column2_Upper'] = df['Column2'].astype(str).str.upper()

This gives:

+---+---------+----------+---------+---------------+---------------+
|   | Column1 | Column2  | Column3 | Column1_Upper | Column2_Upper |
+---+---------+----------+---------+---------------+---------------+
| 0 | 'Cat'   | 'Bat'    | 'xyz'   | 'CAT'         | 'BAT'         |
| 1 | 'toy'   | 'flower' | 'abc'   | 'TOY'         | 'FLOWER'      |
| 2 | 'cat'   | 'bat'    | 'lmn'   | 'CAT'         | 'BAT'         |
+---+---------+----------+---------+---------------+---------------+

Finally, run the below to drop the duplicates and created columns.

result_df = df.drop_duplicates(subset=['Column1_Upper', 'Column2_Upper'], keep='first')
result_df.drop(['Column1_Upper', 'Column2_Upper'], axis=1, inplace=True)
print(result_df)

This gives:

+-----------------------------+
|   Column1   Column2 Column3 |
+-----------------------------+
| 0   'Cat'     'Bat'   'xyz' |
| 1   'toy'  'flower'   'abc' |
+-----------------------------+
Answered By: Ashu Grover
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.