Speed up re.sub() on large strings representing large files in python?
Question:
Hi I am running this python code to reduce multi-line patterns to singletons however, I am doing this on extremely large files of 200,000+ lines.
Here is my current code:
import sys
import re
with open('largefile.txt', 'r+') as file:
string = file.read()
string = re.sub(r"((?:^.*n)+)(?=1)", "", string, flags=re.MULTILINE)
file.seek(0)
file.write(string)
file.truncate()
The problem is the re.sub() is taking ages (10m+) on my large files. Is it possible to speed this up in any way?
Example input file:
hello
mister
hello
mister
goomba
bananas
goomba
bananas
chocolate
hello
mister
Example output:
hello
mister
goomba
bananas
chocolate
hello
mister
These patterns can be bigger than 2 lines as well.
Answers:
Nesting a quantifier within a quantifier is expensive and in this case unnecessary.
You can use the following regex without nesting instead:
string = re.sub(r"(^.*n)(?=1)", "", string, flags=re.M | re.S)
In the following test it more than cuts the time in half compared to your approach:
Regexps are compact here, but will never be speedy. For one reason, you have an inherently line-based problem, but regexps are inherently character-based. The regexp engine has to deduce, over & over & over again, where "lines" are by searching for newline characters, one at a time. For a more fundamental reason, everything here is brute-force character-at-a-time search, remembering nothing from one phase to the next.
So here’s an alternative. Split the giant string into a list of lines, just once at the start. Then that work never needs to be done again. And then build a dict, mapping a line to a list of the indices at which that line appears. That takes linear time. Then, given a line, we don’t have to search for it at all: the list of indices tells us at once every place it appears.
Worse-case time can still be poor, but I expect it will be at least a hundred times faster on "typical" inputs.
def dedup(s):
from collections import defaultdict
lines = s.splitlines(keepends=True)
line2ix = defaultdict(list)
for i, line in enumerate(lines):
line2ix[line].append(i)
out = []
n = len(lines)
i = 0
while i < n:
line = lines[i]
# Look for longest adjacent match between i:j and j:j+(j-i).
# j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
# Lines at i and j match.
if all(lines[i + k] == lines[j + k]
for k in range(1, j - i)):
searching = False
break
if searching:
out.append(line)
i += 1
else: # skip the repeated block at i:j
i = j
return "".join(out)
EDIT
This incorporates Kelly’s idea of incrementally updating line2ix
using a deque
so that the candidates looked at are always in range(i+1, maxj+1)
. Then the innermost loop doesn’t need to check for those conditions.
It’s a mixed bag, losing a little when there are very few duplicates, because in such cases the line2ix
sequences are very short (or even singletons for unique lines).
Here’s timing for a case where it really pays off: a file containing about 30,000 lines of Python code. Many lines are unique, but a few kinds of lines are very common (for example, the empty "n"
line). Cutting the work in the innermost loop can pay for those common lines. dedup_nuts
was picked for the name because this level of micro-optimization is, well, nuts 😉
71.67997950001154 dedup_original
48.948923900024965 dedup_blhsing
2.204853900009766 dedup_Tim
9.623824400012381 dedup_Kelly
1.0341253000078723 dedup_blhsingTimKelly
0.8434303000103682 dedup_nuts
And the code:
def dedup_nuts(s):
from array import array
from collections import deque
encode = {}
decode = []
lines = array('L')
for line in s.splitlines(keepends=True):
if (code := encode.get(line)) is None:
code = encode[line] = len(encode)
decode.append(line)
lines.append(code)
del encode
line2ix = [deque() for line in lines]
view = memoryview(lines)
out = []
n = len(lines)
i = 0
last_maxj = -1
while i < n:
maxj = (n + i) // 2
for j in range(last_maxj + 1, maxj + 1):
line2ix[lines[j]].appendleft(j)
last_maxj = maxj
line = lines[i]
js = line2ix[line]
assert js[-1] == i, (i, n, js)
js.pop()
for j in js:
#assert i < j <= maxj
if view[i : j] == view[j : j + j - i]:
for k in range(i + 1, j):
js = line2ix[lines[k]]
assert js[-1] == k, (i, k, js)
js.pop()
i = j
break
else:
out.append(line)
i += 1
#assert all(not d for d in line2ix)
return "".join(map(decode.__getitem__, out))
Some key invariants are checked by asserts there, but the expensive ones are commented out for speed. Season to taste.
Another idea: You’re talking about "200,000+ lines", so we can encode each unique line as one of the 1,114,112 possible characters and simplify the regex to r"(.+)(?=1)"
. And after the deduplication, decode the characters back to lines.
def dedup(s):
encode = {}
decode = {}
lines = s.split('n')
for line in lines:
if line not in encode:
c = chr(len(encode))
encode[line] = c
decode[c] = line
s = ''.join(map(encode.get, lines))
s = re.sub(r"(.+)(?=1)", "", s, flags=re.S)
return 'n'.join(map(decode.get, s))
A little benchmark based on blhsing’s but with some repeating lines (times in seconds):
2.5934535119995417 dedup_original
1.2498892020012136 dedup_blhsing
0.5043159520009795 dedup_Tim
0.9235864399997809 dedup_Kelly
I built a pool of 50 lines of 10 random letters, then joined 5000 random lines from that pool.
The two fastest with 10,000 lines instead:
2.0905018440007552 dedup_Tim
3.220036650000111 dedup_Kelly
Code (Try it online!):
import re
import random
import string
from timeit import timeit
strings = [''.join((*random.choices(string.ascii_letters, k=10), 'n')) for _ in range(50)]
s = ''.join(random.choices(strings, k=5000))
def dedup_original(s):
return re.sub(r"((?:^.*n)+)(?=1)", "", s, flags=re.MULTILINE)
def dedup_blhsing(s):
return re.sub(r"(^.*n)(?=1)", "", s, flags=re.M | re.S)
def dedup_Tim(s):
from collections import defaultdict
lines = s.splitlines(keepends=True)
line2ix = defaultdict(list)
for i, line in enumerate(lines):
line2ix[line].append(i)
out = []
n = len(lines)
i = 0
while i < n:
line = lines[i]
# Look for longest adjacent match between i:j and j:j+(j-i).
# j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
# Lines at i and j match.
if all(lines[i + k] == lines[j + k]
for k in range(1, j - i)):
searching = False
break
if searching:
out.append(line)
i += 1
else: # skip the repeated block at i:j
i = j
return "".join(out)
def dedup_Kelly(s):
encode = {}
decode = {}
lines = s.split('n')
for line in lines:
if line not in encode:
c = chr(len(encode))
encode[line] = c
decode[c] = line
s = ''.join(map(encode.get, lines))
s = re.sub(r"(.+)(?=1)", "", s, flags=re.S)
return 'n'.join(map(decode.get, s))
funcs = dedup_original, dedup_blhsing, dedup_Tim, dedup_Kelly
expect = funcs[0](s)
for f in funcs[1:]:
print(f(s) == expect)
for _ in range(3):
for f in funcs:
t = timeit(lambda: f(s), number=1)
print(t, f.__name__)
print()
@TimPeters’ line-based comparison approach is good but wastes time in repeated comparisons of the same lines. @KellyBundy’s encoding idea is good but wastes time in the overhead of a regex engine and text encoding.
A more efficient approach would be to adopt @KellyBundy’s encoding idea in @TimPeters’ algorithm, but instead of encoding lines into characters, encode them into an array.array
of 32-bit integers to avoid the overhead of text encoding, and then use a memoryview
of the array
for quick slice-based comparisons:
from array import array
def dedup_blhsingTimKelly2(s):
encode = {}
decode = []
lines = s.splitlines(keepends=True)
n = len(lines)
for line in lines:
if line not in encode:
encode[line] = len(decode)
decode.append(line)
lines = array('L', map(encode.get, lines))
del encode
line2ix = [[] for _ in range(n)]
for i, line in enumerate(lines):
line2ix[line].append(i)
view = memoryview(lines)
out = []
i = 0
while i < n:
line = lines[i]
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
if view[i: j] == view[j: j + j - i]:
searching = False
break
if searching:
out.append(decode[line])
i += 1
else:
i = j
return "".join(out)
A run of @KellyBundy’s benchmark code with this approach added, originally named dedup_blhsingTimKelly
, now amended with Tim and Kelly’s comments and named dedup_blhsingTimKelly2
:
2.6650364249944687 dedup_original
1.3109814710041974 dedup_blhsing
0.5598453340062406 dedup_Tim
0.9783012029947713 dedup_Kelly
0.24442325498966966 dedup_blhsingTimKelly
0.21991234300367068 dedup_blhsingTimKelly2
Hi I am running this python code to reduce multi-line patterns to singletons however, I am doing this on extremely large files of 200,000+ lines.
Here is my current code:
import sys
import re
with open('largefile.txt', 'r+') as file:
string = file.read()
string = re.sub(r"((?:^.*n)+)(?=1)", "", string, flags=re.MULTILINE)
file.seek(0)
file.write(string)
file.truncate()
The problem is the re.sub() is taking ages (10m+) on my large files. Is it possible to speed this up in any way?
Example input file:
hello
mister
hello
mister
goomba
bananas
goomba
bananas
chocolate
hello
mister
Example output:
hello
mister
goomba
bananas
chocolate
hello
mister
These patterns can be bigger than 2 lines as well.
Nesting a quantifier within a quantifier is expensive and in this case unnecessary.
You can use the following regex without nesting instead:
string = re.sub(r"(^.*n)(?=1)", "", string, flags=re.M | re.S)
In the following test it more than cuts the time in half compared to your approach:
Regexps are compact here, but will never be speedy. For one reason, you have an inherently line-based problem, but regexps are inherently character-based. The regexp engine has to deduce, over & over & over again, where "lines" are by searching for newline characters, one at a time. For a more fundamental reason, everything here is brute-force character-at-a-time search, remembering nothing from one phase to the next.
So here’s an alternative. Split the giant string into a list of lines, just once at the start. Then that work never needs to be done again. And then build a dict, mapping a line to a list of the indices at which that line appears. That takes linear time. Then, given a line, we don’t have to search for it at all: the list of indices tells us at once every place it appears.
Worse-case time can still be poor, but I expect it will be at least a hundred times faster on "typical" inputs.
def dedup(s):
from collections import defaultdict
lines = s.splitlines(keepends=True)
line2ix = defaultdict(list)
for i, line in enumerate(lines):
line2ix[line].append(i)
out = []
n = len(lines)
i = 0
while i < n:
line = lines[i]
# Look for longest adjacent match between i:j and j:j+(j-i).
# j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
# Lines at i and j match.
if all(lines[i + k] == lines[j + k]
for k in range(1, j - i)):
searching = False
break
if searching:
out.append(line)
i += 1
else: # skip the repeated block at i:j
i = j
return "".join(out)
EDIT
This incorporates Kelly’s idea of incrementally updating line2ix
using a deque
so that the candidates looked at are always in range(i+1, maxj+1)
. Then the innermost loop doesn’t need to check for those conditions.
It’s a mixed bag, losing a little when there are very few duplicates, because in such cases the line2ix
sequences are very short (or even singletons for unique lines).
Here’s timing for a case where it really pays off: a file containing about 30,000 lines of Python code. Many lines are unique, but a few kinds of lines are very common (for example, the empty "n"
line). Cutting the work in the innermost loop can pay for those common lines. dedup_nuts
was picked for the name because this level of micro-optimization is, well, nuts 😉
71.67997950001154 dedup_original
48.948923900024965 dedup_blhsing
2.204853900009766 dedup_Tim
9.623824400012381 dedup_Kelly
1.0341253000078723 dedup_blhsingTimKelly
0.8434303000103682 dedup_nuts
And the code:
def dedup_nuts(s):
from array import array
from collections import deque
encode = {}
decode = []
lines = array('L')
for line in s.splitlines(keepends=True):
if (code := encode.get(line)) is None:
code = encode[line] = len(encode)
decode.append(line)
lines.append(code)
del encode
line2ix = [deque() for line in lines]
view = memoryview(lines)
out = []
n = len(lines)
i = 0
last_maxj = -1
while i < n:
maxj = (n + i) // 2
for j in range(last_maxj + 1, maxj + 1):
line2ix[lines[j]].appendleft(j)
last_maxj = maxj
line = lines[i]
js = line2ix[line]
assert js[-1] == i, (i, n, js)
js.pop()
for j in js:
#assert i < j <= maxj
if view[i : j] == view[j : j + j - i]:
for k in range(i + 1, j):
js = line2ix[lines[k]]
assert js[-1] == k, (i, k, js)
js.pop()
i = j
break
else:
out.append(line)
i += 1
#assert all(not d for d in line2ix)
return "".join(map(decode.__getitem__, out))
Some key invariants are checked by asserts there, but the expensive ones are commented out for speed. Season to taste.
Another idea: You’re talking about "200,000+ lines", so we can encode each unique line as one of the 1,114,112 possible characters and simplify the regex to r"(.+)(?=1)"
. And after the deduplication, decode the characters back to lines.
def dedup(s):
encode = {}
decode = {}
lines = s.split('n')
for line in lines:
if line not in encode:
c = chr(len(encode))
encode[line] = c
decode[c] = line
s = ''.join(map(encode.get, lines))
s = re.sub(r"(.+)(?=1)", "", s, flags=re.S)
return 'n'.join(map(decode.get, s))
A little benchmark based on blhsing’s but with some repeating lines (times in seconds):
2.5934535119995417 dedup_original
1.2498892020012136 dedup_blhsing
0.5043159520009795 dedup_Tim
0.9235864399997809 dedup_Kelly
I built a pool of 50 lines of 10 random letters, then joined 5000 random lines from that pool.
The two fastest with 10,000 lines instead:
2.0905018440007552 dedup_Tim
3.220036650000111 dedup_Kelly
Code (Try it online!):
import re
import random
import string
from timeit import timeit
strings = [''.join((*random.choices(string.ascii_letters, k=10), 'n')) for _ in range(50)]
s = ''.join(random.choices(strings, k=5000))
def dedup_original(s):
return re.sub(r"((?:^.*n)+)(?=1)", "", s, flags=re.MULTILINE)
def dedup_blhsing(s):
return re.sub(r"(^.*n)(?=1)", "", s, flags=re.M | re.S)
def dedup_Tim(s):
from collections import defaultdict
lines = s.splitlines(keepends=True)
line2ix = defaultdict(list)
for i, line in enumerate(lines):
line2ix[line].append(i)
out = []
n = len(lines)
i = 0
while i < n:
line = lines[i]
# Look for longest adjacent match between i:j and j:j+(j-i).
# j must be > i, and j+(j-i) <= n so that j <= (n+i)/2.
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
# Lines at i and j match.
if all(lines[i + k] == lines[j + k]
for k in range(1, j - i)):
searching = False
break
if searching:
out.append(line)
i += 1
else: # skip the repeated block at i:j
i = j
return "".join(out)
def dedup_Kelly(s):
encode = {}
decode = {}
lines = s.split('n')
for line in lines:
if line not in encode:
c = chr(len(encode))
encode[line] = c
decode[c] = line
s = ''.join(map(encode.get, lines))
s = re.sub(r"(.+)(?=1)", "", s, flags=re.S)
return 'n'.join(map(decode.get, s))
funcs = dedup_original, dedup_blhsing, dedup_Tim, dedup_Kelly
expect = funcs[0](s)
for f in funcs[1:]:
print(f(s) == expect)
for _ in range(3):
for f in funcs:
t = timeit(lambda: f(s), number=1)
print(t, f.__name__)
print()
@TimPeters’ line-based comparison approach is good but wastes time in repeated comparisons of the same lines. @KellyBundy’s encoding idea is good but wastes time in the overhead of a regex engine and text encoding.
A more efficient approach would be to adopt @KellyBundy’s encoding idea in @TimPeters’ algorithm, but instead of encoding lines into characters, encode them into an array.array
of 32-bit integers to avoid the overhead of text encoding, and then use a memoryview
of the array
for quick slice-based comparisons:
from array import array
def dedup_blhsingTimKelly2(s):
encode = {}
decode = []
lines = s.splitlines(keepends=True)
n = len(lines)
for line in lines:
if line not in encode:
encode[line] = len(decode)
decode.append(line)
lines = array('L', map(encode.get, lines))
del encode
line2ix = [[] for _ in range(n)]
for i, line in enumerate(lines):
line2ix[line].append(i)
view = memoryview(lines)
out = []
i = 0
while i < n:
line = lines[i]
maxj = (n + i) // 2
searching = True
for j in reversed(line2ix[line]):
if j > maxj:
continue
if j <= i:
break
if view[i: j] == view[j: j + j - i]:
searching = False
break
if searching:
out.append(decode[line])
i += 1
else:
i = j
return "".join(out)
A run of @KellyBundy’s benchmark code with this approach added, originally named dedup_blhsingTimKelly
, now amended with Tim and Kelly’s comments and named dedup_blhsingTimKelly2
:
2.6650364249944687 dedup_original
1.3109814710041974 dedup_blhsing
0.5598453340062406 dedup_Tim
0.9783012029947713 dedup_Kelly
0.24442325498966966 dedup_blhsingTimKelly
0.21991234300367068 dedup_blhsingTimKelly2