Decrease time of a function (Python)

Question:

I’m trying to create a function in python that from a list of strings will return me a dict where the key(index) shows the most repetitive character for each index between all the strings. for example a list1 = [‘one’, ‘two’, ‘twin’, ‘who’] should return index 0=t index 1=w index 2=o index 3=n in fact the most frequent character at the index 1 between all the string is ‘w’.
I found a solution but if I have lists with thousands of strings inside it will require too much time to perform. I would like to know if you can give me some help to decrease the time of execution.

Here is what I tried to do but seems too slow to perform with lists of thousands strings inside

list1 = ['one', 'two', 'twin', 'who']

width = len(max(list1, key=len))

chars = {}

for i, item in enumerate(zip(*[s.ljust(width) for s in list1])):
    set1 = set(item)
    if ' ' in set1:
        set1.remove(' ')
    chars[i] = max(set1, key=item.count)
print(chars)
Asked By: terrier99uk

||

Answers:

Whether something is quick enough is a matter of use case, but this solution uses a couple of seconds to go through the default wordlist available under OS X.

Python’s collections.Counter implements a counter object for you, so you don’t have keep track of the counts of multiple possible values yourself.

I’ve paired it with defaultdict, which intializes a key with a function if the key is undefined – so that if we haven’t already seen the index we’re updating the count for, it gets initialized to a Counter object that we then update.

from collections import defaultdict, Counter


with open("/usr/share/dict/words") as f:
    words = f.read().splitlines()
    letters = defaultdict(Counter)

    for word in words:
        for idx, letter in enumerate(word):
            letters[idx].update((letter, ))

for idx, counter in letters.items():
    print(idx, counter.most_common(1))

Whether this is quick enough depends on your use case as mentioned; it can be done a lot quicker if necessary, but it’s probably quick enough. For 235 886 words the runtime is:

python3 letterfreq.py  2.67s user 0.04s system 99% cpu 2.734 total

This assumes that every word is lowercased, if not, lowercase it before adding it to your Counter object.

If you want to implement it without using the Counter or defaultdict parts of the standard library (which are just helper functionality to avoid reimplementing the same small code repeatedly), you can do the exact thing yourself manually:

with open("/usr/share/dict/words") as f:
    words = f.read().splitlines()
    letter_positions = {}

    for word in words:
        for idx, letter in enumerate(word):
            if idx not in letter_positions:
                letter_positions[idx] = {}

            if letter not in letter_positions[idx]:
                letter_positions[idx][letter] = 0

            letter_positions[idx][letter] += 1

final_dict = {}

for idx, counts in letter_positions.items():
    most_popular = sorted(counts.items(), key=lambda v: v[1], reverse=True)
    print(idx, most_popular)
    final_dict[idx] = most_popular[0][0]

print(final_dict)

Then pick as many entries as necessary from most_popular when going through the list afterwards.

Since we’re no longer using the defaultdict and Counter abstractions, our running time is now about a third of the previous one:

python3 letterfreq2.py  1.08s user 0.03s system 98% cpu 1.124 total

It’s usually a good idea to go through what you’re trying to do and formulate a strategy – i.e. "ok, I need to keep track of how many times a letter has appeared in this location .. so for that I need some way to keep values for each index .. and then for each letter ..".

Answered By: MatsLindh

I just make some improvements based on your algorithm.

First, you can use itertools.zip_longest() instead of zip() to remove the need of ljust() and the width variable:

from itertools import zip_longest

list1 = ['one', 'two', 'twin', 'who']

chars = {}

for i, item in enumerate(zip_longest(*list1)):
    set1 = set(item)
    if None in set1:
        set1.remove(None)
    chars[i] = max(set1, key=item.count)
print(chars)

Then, replace max(set1, key=item.count) with a more efficent way Counter(item).most_common(1)[0][0], combined with or set1.most_common(2)[1][0] to filter None values

from itertools import zip_longest
from collections import Counter

list1 = ['one', 'two', 'twin', 'who']

chars = {}

for i, item in enumerate(zip_longest(*list1)):
    set1 = Counter(item)
    chars[i] = set1.most_common(1)[0][0] or set1.most_common(2)[1][0]
print(chars)

As itertools and collections are Python built-in modules, you can import them directly without pip install them.

Answered By: Thaiminhpv
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.