Comparing two text files to EXCLUDE duplicates, and not line-by line. I want to output the exclusion of any duplicate strings, specifically

Question:

I feel like this isn’t that difficult but some reason it is and I’m sleep deprived…so yeah. I’ve been able to neatly format and isolate the words of interests from two long .txt files. I’ve searched around StackOverflow and I can only seem to find line-by-line comparisons (which specifically seeks out duplicate strings and I’m trying to do the exact opposite), so it is not at all what I’m looking for. My objective is to check whether the same string appears ANYWHERE in (as in, is duplicated) in either txt file (I’m comparing just two) and the resultant output should exclude any and all duplicates and written to a .txt file or at least printed to the console. I’ve read the Python documentation and am aware of set(). I don’t mind tips on that, but is there another way to go about it?

edit: it is solely a string of (numerous) five numeric characters, if that helps. Thank you in advance!

Both .txt files I’m comparing look like this essentially (I have had to change it a bit, but it is same exact idea).

1-94823 Words Words a numeric percentage time lapsed

2-84729 Words Words a numeric percentage time lapsed

The whole document, line-by-line is like that however there is some overlap between these two txt files and I am solely interested in the five digit number after the dash. I apologize my title is/was unclear, I want to compare every instance of these five digit numbers from both txt files and specifically exclude duplicates found if anything matches up in either of the two txt files, not just line-by-line and output that (there are a fair number of duplicates).

Thanks,
Amelia

Asked By: Kiley

||

Answers:

Once you have a list of those 5 digit number you can do this:

List of numbers:

list1 = [12345, 67890, 13579]
list2 = [54321, 67890, 11235]

Create the sets:

set1 = set(list1)
set2 = set(list2)

Get the union without the intersection

non_duplicates_list = list(set1.symmetric_difference(set2))

and the result is:

[11235, 13579, 54321, 12345]

If there is any problem let me know 🙂

Answered By: itogaston
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.