Finding maximum difference between each string in an list

Question:

This was an interview question given to me:

Given a list of n strings with length l which at any given index can contain either the character ‘a’ or ‘b’, how would I go about figuring out the maximum difference (meaning the amount of characters that differ at each index) for each n string in the lowest possible time complexity.

An example of this would be ["abaab", "abbbb","babba"]

The output would be [5, 3, 5] as the difference between the first and third strings is 5 (none of the characters are the same), so the maximum difference between string 1 and any other string is 5. Similarly, the difference between string 2 and string 3 is 3, which is greater than the difference between string 2 and 1. And the difference between string 3 and string 1 is 5.

The limit for l is 20

I tried the following code (where l and n are named accordingly and strings is the list of strings). The variables in this code are those of the sample case, but I would like a generalized solution.

n = 3
l = 5
abmap = {}
strings = ["abaab", "abbbb","babba"]

for i in range(l):
    abmap[i] = {"a": [], "b": []}

for i in range(n):
    for j in range(l):
        if strings[i][j] == "a":
            abmap[j]["a"].append(i)

        else:
            abmap[j]["b"].append(i)

for string in strings:
    differences = n * [0]
    for i in range(l):
        if string[i] == "a":
            for index in abmap[i]["b"]:
                differences[index] += 1

        else:
            for index in abmap[i]["a"]:
                differences[index] += 1

    print(max(differences))

However, this solution is O(n2·l). The interviewer asked me to optimize it further (such as to O(l·n·log(n)). How can I accomplish this?

The time limit for this is 15 seconds, and n is less than 100000.

Asked By: Arjun Senthil

||

Answers:

(With the now added limits for n and time, this is no longer viable. See my newer answer instead.)

The limit 20 for L (renamed for readability) suggests that they wanted you to convert each string to an L-bit number, and then compute the difference of two of them by xoring them and asking for the popcount, to have O(1) for each difference and thus O(nL+n²) overall.

strings = ["abaab", "abbbb", "babba"]

table = str.maketrans('ab', '01')
numbers = [int(s.translate(table), 2)
           for s in strings]

for x in numbers:
    print(max((x ^ y).bit_count()
              for y in numbers))

Attempt This Online!

Answered By: Kelly Bundy

Use a bitset to represent each team, where each bit corresponds to the breed of a cow (0 for Guernsey and 1 for Holstein). We can then calculate the difference between two teams by computing the bitwise XOR of their bitsets and counting the number of set bits (i.e., the number of positions where the two teams differ).

To find the maximum difference for each team, we can iterate over all other teams and compute the difference using the above method. We keep track of the maximum difference seen so far and update it if we find a larger difference.

To make the code faster, we can use the fact that the number of possible bitsets for a team of size C is 2^C, which is at most 2^18 = 262144. This means we can precompute the differences between all pairs of bitsets and store them in a lookup table.

We can then use this lookup table to quickly compute the difference between any two teams by simply looking up their bitset differences in the table. This reduces the time complexity of the algorithm to O(N^2 / 32), which is much faster than the previous approach.

Python Code:

from collections import defaultdict

C, N = map(int, input().split())

# read in the teams and convert them to bitsets
teams = []
for i in range(N):
    team_str = input().strip()
    team_bits = int(''.join(['0' if c == 'G' else '1' for c in team_str]), 2)
    teams.append(team_bits)

# precompute the differences between all pairs of bitsets
lookup = defaultdict(dict)
for i in range(N):
    for j in range(i + 1, N):
        diff = bin(teams[i] ^ teams[j]).count('1')
        lookup[i][j] = diff
        lookup[j][i] = diff

# compute the maximum difference for each team
for i in range(N):
    max_diff = max(lookup[i].values())
    print(max_diff)
Answered By: Jarvan Lormand

As in my earlier answer (not viable anymore now with the limits), I convert each string to an L-bit number. They’re a subset of the 2L ≤ 220 possible L-bit numbers. Think of it as a graph of 2L nodes, there’s an edge between two nodes if they differ by exactly one bit, and the input numbers are a subset of those nodes.

Now… for any input number, what is the most far away input number? We can solve that by instead looking at the inverted number (all L bits flipped, i.e., distance L away) and asking what is the closest input number to that. So we run a parallel BFS (breadth-first search) from the input numbers. We mark them as having distance 0. Then we mark all numbers one bit change away with distance 2. Then mark all numbers one further bit change away with distance 2. And so on. At the end, for each input number, we look at the distance of the inverted number, and subtract that from L.

Benchmark results, worst case takes ~3 seconds, well under the 15 seconds limit:

n=1000 L=20:
 0.14 s  solution1
 2.63 s  solution2

n=10000 L=20:
15.01 s  solution1
 3.01 s  solution2

n=100000 L=20:
 3.47 s  solution2

Full code (solution1 is my old, solution2 is the one I presented above):

def solution1(strings):
    table = str.maketrans('ab', '01')
    numbers = [
        int(s.translate(table), 2)
        for s in strings
    ]
    return [
        max((x ^ y).bit_count() for y in numbers)
        for x in numbers
    ]

def solution2(strings):
    table = str.maketrans('ab', '01')
    numbers = [
        int(s.translate(table), 2)
        for s in strings
    ]
    L = len(strings[0])
    bits = [2**i for i in range(L)]
    dist = [None] * 2**L
    for x in numbers:
        dist[x] = 0
    horizon = numbers
    d = 1
    while horizon:
        horizon = [
            y
            for x in horizon
            for bit in bits
            for y in [x ^ bit]
            if dist[y] is None
            for dist[y] in [d]
        ]
        d += 1
    return [L - dist[~x] for x in numbers]

funcs = solution1, solution2

import random
from time import time

# Generate random input
def gen(n, L):
    return [
        ''.join(random.choices('ab', k=L))
        for _ in range(n)
    ]

# Correctness
for _ in range(100):
    strings = gen(100, 10)
    expect = funcs[0](strings)
    for f in funcs:
        result = f(strings)
        assert result == expect

# Speed
def test(n, L, funcs):
    print(f'{n=} {L=}:')

    for _ in range(1):
        strings = gen(n, L)
        expect = None
        for f in funcs:
            t = time()
            result = f(strings)
            print(f'{time()-t:5.2f} s ', f.__name__)
            if expect is None:
                result = expect
            else:
                assert result == expect
            del result
    print()

test(1000, 20, [solution1, solution2])
test(10000, 20, [solution1, solution2])
test(100000, 20, [solution2])

Attempt This Online!

Answered By: Kelly Bundy