How do I delete duplicate lines and create a new file without duplicates?

Question:

I searched on here an found many postings, however none that I can implement into the following code

with open('TEST.txt') as f:
    seen = set()
    for line in f:
        line_lower = line.lower()
        if line_lower in seen and line_lower.strip():
            print(line.strip())
        else:
            seen.add(line_lower)

I can find the duplicate lines inside my TEST.txt file which contains hundreds of URLs.

However I need to remove these duplicates and create a new text file with these removed and all other URLs intact.

I will be Checking this newly created file for 404 errors using r.status_code.

In a nutshell I basically need help getting rid of duplicates so I can check for dead links. thanks for your help.

Asked By: user1719826

||

Answers:

this is something you could use:

import linecache

with open('pizza11.txt') as f:
    for i, l in enumerate(f):
            pass
    x=i+1
    k=0
    i=2
    j=1
    initial=linecache.getline('pizza11.txt', 1)
    clean= open ('clean.txt','a')
    clean.write(initial)
    while i<(x+1):
        a=linecache.getline('pizza11.txt', i)
        while j<i:
            b=linecache.getline('pizza11.txt', j)
            if a==b:
                k=k+1
            j=j+1
        if k==0:
                clean= open ('clean.txt','a')
                clean.write(a)
        k=0
        j=1
        i=i+1

With this you are going through every line and checking with the ones before itself, if there are no matches with the previous written lines then it adds it on the document.

pizza11 is the name of a file I have on my computer which is a text file with a ton of stuff in a list that I use to try stuff like this out, you would just need to change that to whatever your starting file is. Your output file with no duplicates would be clean.txt

Answered By: Daniel López

Sounds simple enough, but what you did looks overcomplicated. I think the following should be enough:

with open('TEST.txt', 'r') as f:
    unique_lines = set(f.readlines())
with open('TEST_no_dups.txt', 'w') as f:
    f.writelines(unique_lines)

A couple things to note:

  • If you are going to use a set, you might as well dump all the lines at creation, and f.readlines(), which returns the list of all the lines in your file, is perfect for that.
  • f.writelines() will write a sequence of lines to your files, but using a set breaks the order of the lines. So if that matters to you, I suggest replacing the last line by f.writelines(sorted(unique_lines, key=whatever you need))
Answered By: ursan

Simpler than linecache and doesn’t shuffle order like set

unique_lines = []
with open('file_in.txt', 'r') as f:
    for line in f.readlines():
        if line in unique_lines: continue
        unique_lines.append(line)
with open('file_out.txt', 'w') as f:
    f.writelines(unique_lines)

Old post, but I just had this question too, and this page was the first result.

Answered By: Aiden Vrenna
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.