Replace two adjacent duplicate characters in string with the next character in alphabet

Question:

I want to write a function, that will find the first occurrence of two adjacent characters that are the same, replace them with a single character that is next in the alphabet and go over the string until there are no duplicates left. In case of "zz" it should go in a circular fashion back to "a". The string can only include characters a-z, that is, no capital letters or non-alphabetical characters. I have written a function that does it, but it is not effective enough for a very long string.

def solve(s):
    i = 1
    while i < len(s):
        if s[i] == s[i-1]:
            r = s[i+1:]
            l = s[:i-1]
            if s[i] == "z":
                x = "a"
            else:
                x = chr(ord(s[i])+1)
            i = 1
            s = l+x+r
        else:
            i += 1
    return s

So for example if s = 'aabbbc' the function should work like aabbbc –> bbbbc –> cbbc –> ccc and finally return dc. How can I make it more efficient?

Edit: for example if s = 'ab'*10**4 + 'cc'*10**4 + 'dd'*10**4 this function is taking a lot of time.

Asked By: user932895

||

Answers:

As a trivial optimisation: instead of the hard reset i = 1, you can use a softer reset i = i-1. Indeed, if there was no duplicate between 1 and i before the transformation, then there still won’t be any duplicate between 1 and i-1 after the transformation.

def solve(s):
    i = 1
    while i < len(s):
        if s[i] == s[i-1]:
            l = s[:i-1]
            r = s[i+1:]  # I swapped these two lines because I read left-to-right
            if s[i] == "z":
                x = "a"
            else:
                x = chr(ord(s[i])+1)
            s = l+x+r
            i = max(1, i-1)  # important change is here
        else:
            i += 1
    return s

s = 'ab'*10**4 + 'cc'*10**4 + 'dd'*10**4
t = solve(s)

print(t[2*10**4:])
# rqpnlih

print(t == 'ab'*10**4 + 'rqpnlih')
# True
Answered By: Stef

I think that in this case it is easier to think of appending new letters to the right of the existing string one by one, and if at any point we encounter two identical characters at the end, then we remove both of them and add the next character in alphabetical order.

In this way You will avoid indexing errors and copying the entire string after each modification.

Also, notice that since each operation reduces the length of the result, the complexity will be O(len(s)).

def solve(s):
    def next_char(char):
        if char == "z":
            return "a"
        else:
            return chr(ord(char)+1)
        
    stack = []
    for char in s:
        while len(stack) > 0 and stack[-1] == char:
            stack.pop()
            char = next_char(char)
        stack.append(char)        

    return ''.join(stack)
Answered By: Fly_37

You can build the result in a list that you use like a stack, adding letters that are different from the last one and popping back with the next letter when they match:

def compact(s):
    result = []
    nextChar = str.maketrans({i+96:i%26+97 for i in range(1,27)})
    for c in s:
        while result and result[-1] == c:
            c = result.pop(-1).translate(nextChar)
        result.append(c)
    return "".join(result)
    

output:

print(compact("aabbbc"))
# dc

s = 'ab'*10**4 + 'cc'*10**4 + 'dd'*10**4
print(compact(s)[-11:])
# ababrqpnlih
Answered By: Alain T.

The main causes of slowdown here are the O(N^2) behaviour resulting from:

  • Repeated slicing and re-creation of s (a new string has to be allocated and copied every time, because Python’s strings are immutable).

  • Resetting iteration to the beginning of the newly-created s whenever a duplicate letter is found, such that it must read past all the non-duplicates again. For example, when reducing 'abcdeffhh', after combining the 'ff' into 'g', the next iteration will scan past all of that before considering the 'hh'. (This is already pointed out in Stef’s answer.)

I thought of the same basic approach as Alain T., although I can offer some micro-optimizations. I first want to explain why this stack-based approach is so much faster:

  • .pop from the end of a list is O(1); the list is mutable, so it just needs to clear out a reference to the last object and update its element count. However, removing elements from anywhere else involves shifting all the following elements down to fill the gap.) Similarly, because the new elements are going on to a separate data structure, there’s never a need to re-create the s string.

  • Iterating this way allows us to use the natural, Pythonic iteration without having to index back in through an auxiliary range object. Since only one element from the input is needed at a time (it will be compared to the top of the stack). (Normally, iterating over overlapping pairs of a list can also be done much more elegantly, but standard approaches assume that the input won’t be modified. In general, algorithms that try to modify an input sequence while iterating over it are a bad idea and hard to get right.)

  • Finally, the actual replacement with "the next letter, looping around at z" can be simplified by preparing a lookup. The string type provides a static method maketrans which can create such a lookup easily; it’s suitable for use by the translate method of strings, which applies the lookup to each character of the string. However, since we are only working with single-character strings for this part, it would equally work to build a dictionary that directly maps letters to letters, and use it normally. (str.maketrans produces a dictionary, but it maps integers – the results of ord, basically – to other integers or None.)


Of course, "practicality beats purity", so none of this matters without testing. I made a file q75690334.py to test the performance of all these approaches:

def solve_user932895(s):
    i = 1
    while i < len(s):
        if s[i] == s[i-1]:
            r = s[i+1:]
            l = s[:i-1]
            if s[i] == "z":
                x = "a"
            else:
                x = chr(ord(s[i])+1)
            i = 1
            s = l+x+r
        else:
            i += 1
    return s

def solve_stef(s):
    i = 1
    while i < len(s):
        if s[i] == s[i-1]:
            l = s[:i-1]
            r = s[i+1:]
            if s[i] == "z":
                x = "a"
            else:
                x = chr(ord(s[i])+1)
            s = l+x+r
            i = max(1, i-1)
        else:
            i += 1
    return s

def solve_alain(s):
    result = []
    nextChar = str.maketrans({i+96:i%26+97 for i in range(1,27)})
    for c in s:
        while result and result[-1] == c:
            c = result.pop(-1).translate(nextChar)
        result.append(c)
    return "".join(result)

# with my suggestion about the lookup, more readable initialization,
# and some other micro-optimizations
abc = 'abcdefghijklmnopqrstuvwxyz'
g_lookup = dict(zip(abc, abc[1:] + abc[:1]))
def solve_karl(s):
    # put a dummy element in the list to simplify the while loop logic
    result = [None]
    # faster attribute access with a local
    lookup = g_lookup 
    for c in s:
        while result[-1] == c:
            c = lookup[result.pop(-1)]
        result.append(c)
    return "".join(result[1:])

def make_test_string(n):
    return 'ab'*n + 'cc'*n + 'dd'*n

if __name__ == '__main__':
    s = make_test_string(10**3)
    assert solve_user932895(s) == solve_alain(s) == solve_karl(s) == solve_stef(s)

Verifying correctness:

$ python q75690334.py

(no assertion was raised)

Timing with n == 10**3 at the command line:

$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**3)' -- 'q75690334.solve_user932895(s)'
1 loop, best of 5: 1.34 sec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**3)' -- 'q75690334.solve_stef(s)'
50 loops, best of 5: 5.56 msec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**3)' -- 'q75690334.solve_alain(s)'
200 loops, best of 5: 1.42 msec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**3)' -- 'q75690334.solve_karl(s)'
500 loops, best of 5: 901 usec per loop

However, an important note I want to make here is that Stef’s approach still involves O(N^2) behaviour (because of the slicing) – it just takes longer to show up:

$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**4)' -- 'q75690334.solve_stef(s)'
1 loop, best of 5: 215 msec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**4)' -- 'q75690334.solve_alain(s)'
20 loops, best of 5: 14.3 msec per loop
$ python -m timeit -s 'import q75690334' -s 's = q75690334.make_test_string(10**4)' -- 'q75690334.solve_karl(s)'
50 loops, best of 5: 9.14 msec per loop

Here we see the stack-based approaches taking roughly 10 times as long for 10 times as much input, but the improved slicing approach taking nearly 40 times as long. (Asymptotically, it should end up taking 100 times as long; but inputs long enough to make that clear would take far too long to test.)

Answered By: Karl Knechtel

I wish to offer a solution in Ruby. Considering that Ruby and Python have many similarities most readers should be able to at least get a gist of what I am doing. I am posting this answer in the chance that a reader might look upon it favourably and post an equivalent, if not improved-upon, solution in Python.

First create a constant holding a regular expression.

RGX = /(.)1/

This expression matches a pair of the same character (e.g., "aa"). . matches any character (other than a line terminator), which is saved to capture group 1. 1 matches the content of capture group 1.

Next I will define a constant H that holds a hash that maps pairs of letters into the "successor" of both, such as "aa"->"b" and "zz"->"a".

a = ('a'..'z').to_a
  #=> ["a", "b",..., "y", "z"]
H = a.map { |c| c*2 }.zip(a.rotate).to_h
  #=> {"aa"=>"b", "bb"=>"c",..., "yy"=>"z", "zz"=>"a"}

We may now define a method that takes as its argument the string and returns the desired string.

def doit(str)
  loop do
    break str if str.sub!(RGX, H).nil?
  end
end

Let’s try it.

doit("barrsxffzzdggh")
  #=> "batxgadi"

It can be seen that value of str at each iteration is as follows.

bassxffzzdggh
batxffzzdggh
batxgzzdggh
batxgadggh
batxgadhh
batxgadi

I will now break down each step of the method.

First create an array containing the letters of the alphabet.

a = ('a'..'z').to_a
   #=> ["a", "b",..., "y", "z"]

'a'..'z' denotes a range and the method Range#to_a maps it into an array.

The hash H is constructed in four steps.

x = a.map { |c| c*2 }
  #=> ["aa", "bb", "cc",..., "yy", "zz"]
b = a.rotate
  #=> ["b", "c",..., "z", "a"]
c = x.zip(b)
  #=> [["aa", "b"], ["bb", "c"],..., ["yy", "z"], ["zz", "a"]]
H = c.to_h
  #=> {"aa"=>"b", "bb"=>"c",..., "yy"=>"z", "zz"=>"a"}

These steps use the methods Array#map, String#*, Array#rotate, Array#zip and Array#to_h.

Now assign the given string to the variable str.

str = "barrsxffzzdggh"

Ruby’s method Kernel#loop more-or-less loops to its end statement until the break keyword is encountered. The single statement within the loop is the following.

    str.sub!(RGX, H)
      #=> "bassxffzzdggh"

This uses the form of Ruby’s method String#sub! which matches substrings with its first argument (a string or a regular expression), and uses its second argument, a hash, for making substitutions. This method modifies the string in place. If a replacement is made the string is returned; else nil is returned. (Ruby has a non-destructive counterpart to this methods, sub, and methods gsub! and gsub for making multiple replacements.)

Initially an attempt is made to match str with the regular expression RGX. It matches "rr". As H["rr"] #=> "s", "s" is substituted for "rr" and the modified string is returned (as well as changed in place), so we repeat the loop. This continues until sub! returns nil, at which time we are finished, so we break out of the loop and return the current contents of str.

Answered By: Cary Swoveland
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.