Speed up re.sub() on large strings representing large files in python?

Question:

Hi I am running this python code to reduce multi-line patterns to singletons however, I am doing this on extremely large files of 200,000+ lines.

Here is my current code:

import sys
import re

with open('largefile.txt', 'r+') as file:
    string = file.read()
    string = re.sub(r"((?:^.*n)+)(?=1)", "", string, flags=re.MULTILINE)
    file.seek(0)
    file.write(string)
    file.truncate()

The problem is the re.sub() is taking ages (10m+) on my large files. Is it possible to speed this up in any way?

Example input file:

hello
mister
hello
mister
goomba
bananas
goomba
bananas
chocolate
hello
mister

Example output:

hello
mister
goomba
bananas
chocolate
hello
mister

These patterns can be bigger than 2 lines as well.

Asked By: kipchak

||

Answers:

Nesting a quantifier within a quantifier is expensive and in this case unnecessary.

You can use the following regex without nesting instead:

string = re.sub(r"(^.*n)(?=1)", "", string, flags=re.M | re.S)

In the following test it more than cuts the time in half compared to your approach:

https://replit.com/@blhsing/HugeTrivialExperiment

Answered By: blhsing

Regexps are compact here, but will never be speedy. For one reason, you have an inherently line-based problem, but regexps are inherently character-based. The regexp engine has to deduce, over & over & over again, where "lines" are by searching for newline characters, one at a time. For a more fundamental reason, everything here is brute-force character-at-a-time search, remembering nothing from one phase to the next.

So here’s an alternative. Split the giant string into a list of lines, just once at the start. Then that work never needs to be done again. And then build a dict, mapping a line to a list of the indices at which that line appears. That takes linear time. Then, given a line, we don’t have to search for it at all: the list of indices tells us at once every place it appears.

Worse-case time can still be poor, but I expect it will be at least a hundred times faster on "typical" inputs.

def dedup(s):
    from collections import defaultdict

    lines = s.splitlines(keepends=True)
    line2ix = defaultdict(list)
    for i, line in enumerate(lines):
        line2ix[line].append(i)
    out = []
    n = len(lines)
    i = 0
    while i < n:
        line = lines[i]
        # Look for longest adjacent match between i:j and j:j+(j-i).
        # j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
        maxj = (n + i) // 2
        searching = True
        for j in reversed(line2ix[line]):
            if j > maxj:
                continue
            if j <= i:
                break
            # Lines at i and j match.
            if all(lines[i + k] == lines[j + k]
                   for k in range(1, j - i)):
                searching = False
                break
        if searching:
            out.append(line)
            i += 1
        else: # skip the repeated block at i:j
            i = j
    return "".join(out)

EDIT

This incorporates Kelly’s idea of incrementally updating line2ix using a deque so that the candidates looked at are always in range(i+1, maxj+1). Then the innermost loop doesn’t need to check for those conditions.

It’s a mixed bag, losing a little when there are very few duplicates, because in such cases the line2ix sequences are very short (or even singletons for unique lines).

Here’s timing for a case where it really pays off: a file containing about 30,000 lines of Python code. Many lines are unique, but a few kinds of lines are very common (for example, the empty "n" line). Cutting the work in the innermost loop can pay for those common lines. dedup_nuts was picked for the name because this level of micro-optimization is, well, nuts 😉

71.67997950001154 dedup_original
48.948923900024965 dedup_blhsing
2.204853900009766 dedup_Tim
9.623824400012381 dedup_Kelly
1.0341253000078723 dedup_blhsingTimKelly
0.8434303000103682 dedup_nuts

And the code:

def dedup_nuts(s):
    from array import array
    from collections import deque

    encode = {}
    decode = []
    lines = array('L')
    for line in s.splitlines(keepends=True):
        if (code := encode.get(line)) is None:
            code = encode[line] = len(encode)
            decode.append(line)
        lines.append(code)
    del encode
    line2ix = [deque() for line in lines]
    view = memoryview(lines)
    out = []
    n = len(lines)
    i = 0
    last_maxj = -1
    while i < n:
        maxj = (n + i) // 2
        for j in range(last_maxj + 1, maxj + 1):
            line2ix[lines[j]].appendleft(j)
        last_maxj = maxj
        line = lines[i]
        js = line2ix[line]
        assert js[-1] == i, (i, n, js)
        js.pop()
        for j in js:
            #assert i < j <= maxj
            if view[i : j] == view[j : j + j - i]:
                for k in range(i + 1, j):
                    js = line2ix[lines[k]]
                    assert js[-1] == k, (i, k, js)
                    js.pop()
                i = j
                break
        else:
            out.append(line)
            i += 1
    #assert all(not d for d in line2ix)
    return "".join(map(decode.__getitem__, out))

Some key invariants are checked by asserts there, but the expensive ones are commented out for speed. Season to taste.

Answered By: Tim Peters

Another idea: You’re talking about "200,000+ lines", so we can encode each unique line as one of the 1,114,112 possible characters and simplify the regex to r"(.+)(?=1)". And after the deduplication, decode the characters back to lines.

def dedup(s):
    encode = {}
    decode = {}
    lines = s.split('n')
    for line in lines:
        if line not in encode:
            c = chr(len(encode))
            encode[line] = c
            decode[c] = line
    s = ''.join(map(encode.get, lines))
    s = re.sub(r"(.+)(?=1)", "", s, flags=re.S)
    return 'n'.join(map(decode.get, s))

A little benchmark based on blhsing’s but with some repeating lines (times in seconds):

2.5934535119995417 dedup_original
1.2498892020012136 dedup_blhsing
0.5043159520009795 dedup_Tim
0.9235864399997809 dedup_Kelly

I built a pool of 50 lines of 10 random letters, then joined 5000 random lines from that pool.

The two fastest with 10,000 lines instead:

2.0905018440007552 dedup_Tim
3.220036650000111 dedup_Kelly

Code (Try it online!):

import re
import random
import string
from timeit import timeit

strings = [''.join((*random.choices(string.ascii_letters, k=10), 'n')) for _ in range(50)]
s = ''.join(random.choices(strings, k=5000))

def dedup_original(s):
    return re.sub(r"((?:^.*n)+)(?=1)", "", s, flags=re.MULTILINE)

def dedup_blhsing(s):
    return re.sub(r"(^.*n)(?=1)", "", s, flags=re.M | re.S)

def dedup_Tim(s):
    from collections import defaultdict

    lines = s.splitlines(keepends=True)
    line2ix = defaultdict(list)
    for i, line in enumerate(lines):
        line2ix[line].append(i)
    out = []
    n = len(lines)
    i = 0
    while i < n:
        line = lines[i]
        # Look for longest adjacent match between i:j and j:j+(j-i).
        # j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
        maxj = (n + i) // 2
        searching = True
        for j in reversed(line2ix[line]):
            if j > maxj:
                continue
            if j <= i:
                break
            # Lines at i and j match.
            if all(lines[i + k] == lines[j + k]
                   for k in range(1, j - i)):
                searching = False
                break
        if searching:
            out.append(line)
            i += 1
        else: # skip the repeated block at i:j
            i = j
    return "".join(out)

def dedup_Kelly(s):
    encode = {}
    decode = {}
    lines = s.split('n')
    for line in lines:
        if line not in encode:
            c = chr(len(encode))
            encode[line] = c
            decode[c] = line
    s = ''.join(map(encode.get, lines))
    s = re.sub(r"(.+)(?=1)", "", s, flags=re.S)
    return 'n'.join(map(decode.get, s))

funcs = dedup_original, dedup_blhsing, dedup_Tim, dedup_Kelly
expect = funcs[0](s)
for f in funcs[1:]:
    print(f(s) == expect)

for _ in range(3):
    for f in funcs:
        t = timeit(lambda: f(s), number=1)
        print(t, f.__name__)
    print()
Answered By: Kelly Bundy

@TimPeters’ line-based comparison approach is good but wastes time in repeated comparisons of the same lines. @KellyBundy’s encoding idea is good but wastes time in the overhead of a regex engine and text encoding.

A more efficient approach would be to adopt @KellyBundy’s encoding idea in @TimPeters’ algorithm, but instead of encoding lines into characters, encode them into an array.array of 32-bit integers to avoid the overhead of text encoding, and then use a memoryview of the array for quick slice-based comparisons:

from array import array

def dedup_blhsingTimKelly2(s):
    encode = {}
    decode = []
    lines = s.splitlines(keepends=True)
    n = len(lines)
    for line in lines:
        if line not in encode:
            encode[line] = len(decode)
            decode.append(line)
    lines = array('L', map(encode.get, lines))
    del encode
    line2ix = [[] for _ in range(n)]
    for i, line in enumerate(lines):
        line2ix[line].append(i)
    view = memoryview(lines)
    out = []
    i = 0
    while i < n:
        line = lines[i]
        maxj = (n + i) // 2
        searching = True
        for j in reversed(line2ix[line]):
            if j > maxj:
                continue
            if j <= i:
                break
            if view[i: j] == view[j: j + j - i]:
                searching = False
                break
        if searching:
            out.append(decode[line])
            i += 1
        else:
            i = j
    return "".join(out)

A run of @KellyBundy’s benchmark code with this approach added, originally named dedup_blhsingTimKelly, now amended with Tim and Kelly’s comments and named dedup_blhsingTimKelly2:

2.6650364249944687 dedup_original
1.3109814710041974 dedup_blhsing
0.5598453340062406 dedup_Tim
0.9783012029947713 dedup_Kelly
0.24442325498966966 dedup_blhsingTimKelly
0.21991234300367068 dedup_blhsingTimKelly2

Try it online!

Answered By: blhsing
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.