Anagram from large file

Question:

I have a file having 10,000 word on it. I wrote a program to find anagram word from that file but its taking too much time to get to output. For small file program works well. Try to optimize the code.

count=0
i=0
j=0
with open('file.txt') as file:
  lines = [i.strip() for i in file]
  for i in range(len(lines)):
      for j in range(i):
          if sorted(lines[i]) == sorted(lines[j]):
              #print(lines[i])
              count=count+1
              j=j+1
              i=i+1
print('There are ',count,'anagram words')
Asked By: Nabeinz kc

||

Answers:

Well it is unclear whether you account for duplicates or not, however if you don’t you can remove duplicates from your list of words and that will spare you a huge amount of runtime in my opinion. You can check for anagrams and then use sum() to get the their total number. This should do it:

def get_unique_words(lines):
    unique = [] 
    for word in " ".join(lines).split(" "): 
        if word not in unique:
            unique.append(word)
    return unique 

def check_for_anagrams(test_word, words):
    return sum([1 for word in words if (sorted(test_word) == sorted(word) and word != test_word)])

with open('file.txt') as file:
  lines = [line.strip() for line in file]


unique = get_unique_words(lines)
count  = sum([check_for_anagrams(word, unique) for word in unique])

print('There are ', count,'unique anagram words aka', int(count/2), 'unique anagram couples')
Answered By: SuperKogito

I don’t fully understand your code (for example, why do you increment i and j inside the loop?). But the main problem is that you have a nested loop, which makes the runtime of the algorithm O(n^2), i.e. if the file becomes 10 times as large, the execution time will become (approximately) 100 times as long.

So you need a way to avoid that. One possible way is to store the lines in a smarter way, so that you don’t have to walk through all lines every time. Then the runtime becomes O(n). In this case you can use the fact that anagrams consist of the same characters (only in a different order). So you can use the “sorted” variant as a key in a dictionary to store all lines that can be made from the same letters in a list under the same dictionary key. There are other possibilities of course, but in this case I think it works out quite nice 🙂

So, fully working example code:

#!/usr/bin/env python3

from collections import defaultdict

d = defaultdict(list)
with open('file.txt') as file:
    lines = [line.strip() for line in file]
    for line in lines:
        sorted_line = ''.join(sorted(line))
        d[sorted_line].append(line)

anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams

# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')

UPDATE

Without duplicates, and without using collections (although I strongly recommend to use it):

#!/usr/bin/env python3

d = {}
with open('file.txt') as file:
    lines = [line.strip() for line in file]
    lines = set(lines)  # remove duplicates
    for line in lines:
        sorted_line = ''.join(sorted(line))
        if sorted_line in d:
            d[sorted_line].append(line)
        else:
            d[sorted_line] = [line]

anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams

# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')
Answered By: wovano