How to get the difference of 2 lists in a Pandas DataFrame?

Question:

I’m new to python Pandas. I faced a problem to find the difference for 2 lists within a Pandas DataFrame.

Example Input with ; separator:

ColA; ColB  
A,B,C,D; B,C,D  
A,C,E,F; A,C,F  

Expected Output:

ColA; ColB; ColC  
A,B,C,D; B,C,D; A  
A,C,E,F; A,C,F; E  

What I want to do is similar to:

df['ColC'] = np.setdiff1d(df['ColA'].str.split(','), df['ColB'].str.split(','))

But it returns an error:

raise ValueError(‘Length of values does not match length of index’,data,index,len(data),len(index))

Kindly advise

Asked By: Shin CY

||

Answers:

You can apply a lambda function on the DataFrame to find the difference like this:

import pandas as pd

# creating DataFrame (can also be loaded from a file)
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])

# apply a lambda function to get the difference
df['ColC'] = df[['ColA','ColB']].apply(lambda x: [i for i in x[0] if i not in x[1]], axis=1)

Please notice! this will find the asymmetric difference ColA – ColB

Result:

difference of two lists pandas

Answered By: Abdulrahman Bres

A lot faster way to do this would be a simple set subtract:

import pandas as pd

#Creating a dataframe
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])

#Finding the difference
df['ColC']= df['ColA'].map(set)-df['ColB'].map(set)

As the dataframe grows in row numbers, it will be computationally pretty expensive to do any row by row operation.

Answered By: Strayhorn