Efficiently build a string from characters most frequent at i-th index of all the strings in a list

Question

I need to define a function that, given a list of strings, returns a string composed by the characters that are most frequent at the i-th position of every string. If multiple characters appear at the maximum frequency, the one which comes first alphabetically is chosen. External libraries are not allowed.

Example: [‘hello, ‘train’, ‘house’, ‘tank’, ‘car’] -> haaie

h: at index 0 we have the characters [‘h’, ‘t’, ‘h’, ‘t’, ‘c’]. ‘h’ and ‘t’ appear with the maximum frequency (2) but ‘h’ comes first in the alphabeth so the first character is ‘h’.
a: at index 1 we have the characters [‘e’, ‘r’, ‘o’, ‘a’, ‘a’]. ‘a’ appears with the maximum frequency, so the second character is ‘a’.
a: at index 2 we have the characters [‘l’, ‘a’, ‘u’, ‘n’, ‘r’]. All characters appear with the maximum frequency (1) but ‘a’ comes first in the alphabeth than the others so the third character is ‘a’.

This continues until the final string is as long as the longest string in the list.

My current approach is to iterate through each character of each word and appending it to a list where I put all the characters that are at index ‘i’ of every string (this list is the value of a pair inside of a dictionary, which has the index of the characters inside of it as its key). The function then returns a string created by joining a list containig the most frequent characters inside of the lists contained in chars.values().

def f(words: list) -> str:
    chars = dict()
    for word in words:
        for i, char in enumerate(word):
            chars.setdefault(i, list()).append(char)
    return ''.join([max(sorted(value), key = lambda x: value.count(x)) for value in chars.values()])

This code works, but it is extremely slow (I’m working with very large lists, 100k+ strings). I know the problem is the nested for loop, but I can’t figure out another approach other than this one, I’ve tried literally everything I could come up with. Hope you can help me, thanks in advance and have a nice day.

Asked By: 7c88

||

Source

Answer 1

Sorting is O(n*log(n)). You can modify your code to run in linear time by computing the counts during iteration and using min on the negative of the counts to get the smallest order in lexicographic order:

def f(words: list) -> str:
    chars = {}
    for word in words:
        for i, char in enumerate(word):
            d = chars.setdefault(i, dict())
            d[char] = d.get(char, 0)+1
    return ''.join([min(d.items(), key=lambda x: (-x[1], x[0]))[0] for d in chars.values()])


f(['hello', 'train', 'house', 'tank', 'car'])

Output:

'haaie'

For completeness, here is a pythonic solution using itertools.zip_longest and collections.Counter:

l = ['hello', 'train', 'house', 'tank', 'car']

from itertools import zip_longest
from collections import Counter

''.join([min(Counter(x for x in z if x).items(), key=lambda x: (-x[1], x[0]))[0] for z in zip_longest(*l)])

Answered By: mozway

Answer 2

You can try this efficient approach

l = ['hello', 'train', 'house', 'tank', 'car']
n = len(max(l, key=len))
l_ = [i+' '*(n-len(i)) for i in l]
''.join(max([(k.count(r), r) for r in 'zyxwvutsrqponmlkjihgfedcba'], key=lambda x: (x[0], -ord(x[1])))[1]  for k in zip(*l_))

Answered By: assume_irrational_is_rational

Efficiently build a string from characters most frequent at i-th index of all the strings in a list

Question:

Answers: