how compare two text file in python and delete duplicate?

Question:

I am new in python. I have two text file contains list of url. I want to compare text1 file with text2 file and remove text2 matching url from text1 file.

my text file look like this:

text2

https://www.basketbal.vlaanderen/clubs/detail/bbc-wervik
https://www.basketbal.vlaanderen/clubs/detail/bbc-alsemberg
https://www.basketbal.vlaanderen/clubs/detail/koninklijk-basket-team-ion-waregem
https://www.basketbal.vlaanderen/clubs/detail/basket-poperinge

text1

https://www.basketbal.vlaanderen/clubs/detail/bbc-erembodegem
https://www.basketbal.vlaanderen/clubs/detail/dbc-osiris-okapi-aalst
https://www.basketbal.vlaanderen/clubs/detail/the-tower-aalst
https://www.basketbal.vlaanderen/clubs/detail/gsg-aarschot
https://www.basketbal.vlaanderen/clubs/detail/bbc-wervik #duplicate url from text2
https://www.basketbal.vlaanderen/clubs/detail/bbc-alsemberg #duplicate url from text 2

After google searching I found few solutions but those solutions only remove duplicate from current file.

pandas solution for removing duplicate

df.drop_duplicates(subset ="link", keep ='first', inplace = True)  

python regex

import re
re.sub('<.*?>', '', string) #it's not removing duplicate just replacing string with with nothing (''). 

I didn’t find any better solution how to compare two text file in python for removing duplicate. If any text1 file url match with text2 file then matching url delete from text1 file. Any idea how to do it in python?

Asked By: boyenec

||

Answers:

If the order of the files doesn’t matter, you can do this:

with open("file1.txt") as f1:
    set1 = set(f1.readlines())
with open("file2.txt") as f2:
    set2 = set(f2.readlines())

nondups = set1 - set2

with open("file1.txt", "w") as out:
    out.writelines(nondups)

This converts the contents of each file to a set of lines. Then it removes the common elements from the first set, and writes that result back to the first file.

Answered By: Barmar

if you want to use pandas to solve this you can go like this.

    import pandas as pd 
    
    df1 = pd.read_csv('file1.txt',names=['link'])
    df2 = pd.read_csv('file2.txt',names=['link'])
    df1[~df1.link.isin(df2.link)].to_csv('file1_clean.csv',index=False)

this block of code will return a file containing the file1.txt data in csv format but without the links from file2.txt

Answered By: bido_Boy
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.