Replace and += is abismally slow

Question

I’ve made following code that deciphers some byte-arrays into "Readable" text for a translation project.

with open(Path(cur_file), mode="rb") as file:
    contents = file.read()
    file.close()

text = ""
for i in range(0, len(contents), 2): # Since it's encoded in UTF16 or similar, there should always be pairs of 2 bytes
    byte = contents[i]
    byte_2 = contents[i+1]
    if byte == 0x00 and byte_2 == 0x00:
        text+="[0x00 0x00]"
    elif byte != 0x00 and byte_2 == 0x00:
        #print("Normal byte")
        if chr(byte) in printable:
            text+=chr(byte)
        elif byte == 0x00:
            pass
        else:
            text+="[" + "0x{:02x}".format(byte) + "]"
    else:
        #print("Special byte")
        text+="[" + "0x{:02x}".format(byte) + " " + "0x{:02x}".format(byte_2) + "]"
# Some dirty replaces - Probably slow but what do I know - It works
text = text.replace("[0x0e]n[0x01]","[USERNAME_1]") # Your name
text = text.replace("[0x0e]n[0x03]","[USERNAME_3]") # Your name
text = text.replace("[0x0e]n[0x08]","[TOWNNAME_8]") # Town name
text = text.replace("[0x0e]n[0x09]","[TOWNNAME_9]") # Town name
text = text.replace("[0x0e]n[0x0a]","[CHARNAME_A]") # Character name

text = text.replace("[0x0a]","[ENTER]") # Generic enter

lang_dict[emsbt_key_name] = text

While this code does work and produce output like:

Cancel[0x00 0x00]

And more complex ones, I’ve stumbled upon a performance problem when I loop it through 60000 files.

I’ve read a couple of questions regarding += with large strings and people say that join is preferred with large strings. However, even with strings of just under 1000 characters, a single file takes about 5 seconds to store, which is a lot.

I almost feel like it’s starts fast and gets progressively slower and slower.

What would be a way to optimize this code? I feel it’s also abysmal.

Any help or clue is greatly appreciated.

EDIT: Added cProfile output:

         261207623 function calls (261180607 primitive calls) in 95.364 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    284/1    0.002    0.000   95.365   95.365 {built-in method builtins.exec}
        1    0.000    0.000   95.365   95.365 start.py:1(<module>)
        1    0.610    0.610   94.917   94.917 emsbt_to_json.py:21(to_json)
    11179   11.807    0.001   85.829    0.008 {method 'index' of 'list' objects}
 62501129   49.127    0.000   74.146    0.000 pathlib.py:578(__eq__)
125048857   18.401    0.000   18.863    0.000 pathlib.py:569(_cparts)
 63734640    6.822    0.000    6.828    0.000 {built-in method builtins.isinstance}
   160958    0.183    0.000    4.170    0.000 pathlib.py:504(_from_parts)
   160958    0.713    0.000    3.942    0.000 pathlib.py:484(_parse_args)
    68959    0.110    0.000    3.769    0.000 pathlib.py:971(absolute)
   160959    1.600    0.000    2.924    0.000 pathlib.py:56(parse_parts)
    91999    0.081    0.000    1.624    0.000 pathlib.py:868(__new__)
    68960    0.028    0.000    1.547    0.000 pathlib.py:956(rglob)
    68960    0.090    0.000    1.518    0.000 pathlib.py:402(_select_from)
    68959    0.067    0.000    1.015    0.000 pathlib.py:902(cwd)
       37    0.001    0.000    0.831    0.022 __init__.py:1(<module>)
   937462    0.766    0.000    0.798    0.000 pathlib.py:147(splitroot)
    11810    0.745    0.000    0.745    0.000 {method '__exit__' of '_io._IOBase' objects}
   137918    0.143    0.000    0.658    0.000 pathlib.py:583(__hash__)

EDIT: Upon further inspection with line_profiler, turns out that the culprit isn’t even in above code. It’s well outside that code where I read search over the indexes to see if there is +1 file (looking ahead of the index). This apparently consumes a whole lot of CPU time.

Asked By: Fusseldieb

||

Source

Answer 1

Just in case it provides you pathways to search, if I was in your case I’d do two separate checks over 100 files for example timing:

How much time it takes to execute only the for loop.
How much it takes to do only the six replaces.

If any takes most of the total time, I’d try to find a solution just for that bit.
For raw replacements there are specific software designed for massive replacements.
I hope it helps in some way.

Answered By: Fernando

Answer 2

You might use .format to replace += and + following way, let say you have code like this

text = ""
for i in range(10):
    text += "[" + "{}".format(i) + "]"
print(text)  # [0][1][2][3][4][5][6][7][8][9]

which is equvialent to

text = ""
for i in range(10):
    text = "{}[{}]".format(text, i)
print(text)  # [0][1][2][3][4][5][6][7][8][9]

Observe that other string formatting ways might be used as above, I elected to use .format as you are already using it.

Answered By: Daweo

Answer 3

Turns out, prior to this I was looking up a entry in my list (on each iteration) with the index method +1 (to see if there was a path change), which really did bog down the performance.

In the cProfile we can clearly see it:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
11179   11.807    0.001   85.829    0.008 {method 'index' of 'list' objects}

It wasn’t .replace! It wasn’t even included in my question.

What really made me understand what this call was (other than it called index somehow), was another profiler:

I believe that’s what Robert Kern’s line_profiler is intended for.

Source: https://stackoverflow.com/a/3927671/3525780

It showed me neatly, line-by-line, which code consumed how many CPU-cycles/time, much neater than cProfile.

Once I found out, I replaced it with:

for ind, cur_file in enumerate(to_write):
        next_file = None
        if ind < len(to_write) - 1:
            next_file = to_write[ind+1]

This answer probably doesn’t make much sense without the actual code, but I will leave it here nonetheless.

Answered By: Fusseldieb

Replace and += is abismally slow

Question:

Answers: