iterating 2 large pandas df to remove duplicates

Question

I have 2 dataframes with rather large amounts of data that I need to iterate through to check for bad cases. One frame has 100k cases and the other has 6.5m cases. I need to check the dfll dataframe with 100k against the wdnc with 6.5m to remove the rows where the number in the dfll dataframe shows up ANYWHERE in the wdnc dataframe.

Here I am simply trying to count how many time duplicates appear. The problem is that this takes EXTREMELY long. Is there a better way to perform this specific operation? I’m not set on using only pandas if this is a task too large for pandas, but I can’t seem to find the solution elsewhere.

dfll = df.loc[df['Cell'] == 'N'].copy().reset_index().drop('index', axis=1)
wdnc = pd.read_fwf(path, names=['phone'])

counter = 0
            for item in wdnc['phone']:
                for i in range(len(dfll)):
                    if dfll['phone'][i] == item:
                        counter+=1
            print(f'Cases removed: {counter}')

Asked By: MotoMatt5040

||

Source

Answer 1

IIUC this will take a single row from dfll and look throughout all of wdnc and if it exists anywhere in any of the columns then it will keep it, otherwise it will not.

check_list = df1['Column1'].to_numpy()
df2.loc[df2.apply(lambda c : c.isin(check_list)).any(axis=1)]

Answered By: ArchAngelPwn

iterating 2 large pandas df to remove duplicates

Question:

Answers: