Replace and += is abismally slow

Question:

I’ve made following code that deciphers some byte-arrays into "Readable" text for a translation project.

with open(Path(cur_file), mode="rb") as file:
    contents = file.read()
    file.close()

text = ""
for i in range(0, len(contents), 2): # Since it's encoded in UTF16 or similar, there should always be pairs of 2 bytes
    byte = contents[i]
    byte_2 = contents[i+1]
    if byte == 0x00 and byte_2 == 0x00:
        text+="[0x00 0x00]"
    elif byte != 0x00 and byte_2 == 0x00:
        #print("Normal byte")
        if chr(byte) in printable:
            text+=chr(byte)
        elif byte == 0x00:
            pass
        else:
            text+="[" + "0x{:02x}".format(byte) + "]"
    else:
        #print("Special byte")
        text+="[" + "0x{:02x}".format(byte) + " " + "0x{:02x}".format(byte_2) + "]"
# Some dirty replaces - Probably slow but what do I know - It works
text = text.replace("[0x0e]n[0x01]","[USERNAME_1]") # Your name
text = text.replace("[0x0e]n[0x03]","[USERNAME_3]") # Your name
text = text.replace("[0x0e]n[0x08]","[TOWNNAME_8]") # Town name
text = text.replace("[0x0e]n[0x09]","[TOWNNAME_9]") # Town name
text = text.replace("[0x0e]n[0x0a]","[CHARNAME_A]") # Character name

text = text.replace("[0x0a]","[ENTER]") # Generic enter

lang_dict[emsbt_key_name] = text

While this code does work and produce output like:

Cancel[0x00 0x00]

And more complex ones, I’ve stumbled upon a performance problem when I loop it through 60000 files.

I’ve read a couple of questions regarding += with large strings and people say that join is preferred with large strings. However, even with strings of just under 1000 characters, a single file takes about 5 seconds to store, which is a lot.

I almost feel like it’s starts fast and gets progressively slower and slower.

What would be a way to optimize this code? I feel it’s also abysmal.

Any help or clue is greatly appreciated.

EDIT: Added cProfile output:

         261207623 function calls (261180607 primitive calls) in 95.364 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    284/1    0.002    0.000   95.365   95.365 {built-in method builtins.exec}
        1    0.000    0.000   95.365   95.365 start.py:1(<module>)
        1    0.610    0.610   94.917   94.917 emsbt_to_json.py:21(to_json)
    11179   11.807    0.001   85.829    0.008 {method 'index' of 'list' objects}
 62501129   49.127    0.000   74.146    0.000 pathlib.py:578(__eq__)
125048857   18.401    0.000   18.863    0.000 pathlib.py:569(_cparts)
 63734640    6.822    0.000    6.828    0.000 {built-in method builtins.isinstance}
   160958    0.183    0.000    4.170    0.000 pathlib.py:504(_from_parts)
   160958    0.713    0.000    3.942    0.000 pathlib.py:484(_parse_args)
    68959    0.110    0.000    3.769    0.000 pathlib.py:971(absolute)
   160959    1.600    0.000    2.924    0.000 pathlib.py:56(parse_parts)
    91999    0.081    0.000    1.624    0.000 pathlib.py:868(__new__)
    68960    0.028    0.000    1.547    0.000 pathlib.py:956(rglob)
    68960    0.090    0.000    1.518    0.000 pathlib.py:402(_select_from)
    68959    0.067    0.000    1.015    0.000 pathlib.py:902(cwd)
       37    0.001    0.000    0.831    0.022 __init__.py:1(<module>)
   937462    0.766    0.000    0.798    0.000 pathlib.py:147(splitroot)
    11810    0.745    0.000    0.745    0.000 {method '__exit__' of '_io._IOBase' objects}
   137918    0.143    0.000    0.658    0.000 pathlib.py:583(__hash__)

EDIT: Upon further inspection with line_profiler, turns out that the culprit isn’t even in above code. It’s well outside that code where I read search over the indexes to see if there is +1 file (looking ahead of the index). This apparently consumes a whole lot of CPU time.

Asked By: Fusseldieb

||

Answers:

Just in case it provides you pathways to search, if I was in your case I’d do two separate checks over 100 files for example timing:

  • How much time it takes to execute only the for loop.
  • How much it takes to do only the six replaces.

If any takes most of the total time, I’d try to find a solution just for that bit.
For raw replacements there are specific software designed for massive replacements.
I hope it helps in some way.

Answered By: Fernando

You might use .format to replace += and + following way, let say you have code like this

text = ""
for i in range(10):
    text += "[" + "{}".format(i) + "]"
print(text)  # [0][1][2][3][4][5][6][7][8][9]

which is equvialent to

text = ""
for i in range(10):
    text = "{}[{}]".format(text, i)
print(text)  # [0][1][2][3][4][5][6][7][8][9]

Observe that other string formatting ways might be used as above, I elected to use .format as you are already using it.

Answered By: Daweo

Turns out, prior to this I was looking up a entry in my list (on each iteration) with the index method +1 (to see if there was a path change), which really did bog down the performance.

In the cProfile we can clearly see it:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
11179   11.807    0.001   85.829    0.008 {method 'index' of 'list' objects}

It wasn’t .replace! It wasn’t even included in my question.

What really made me understand what this call was (other than it called index somehow), was another profiler:

I believe that’s what Robert Kern’s line_profiler is intended for.

Source: https://stackoverflow.com/a/3927671/3525780

It showed me neatly, line-by-line, which code consumed how many CPU-cycles/time, much neater than cProfile.

Once I found out, I replaced it with:

for ind, cur_file in enumerate(to_write):
        next_file = None
        if ind < len(to_write) - 1:
            next_file = to_write[ind+1]

This answer probably doesn’t make much sense without the actual code, but I will leave it here nonetheless.

Answered By: Fusseldieb
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.