Ways to speedup regex and make it faster

Question

Is there a way to speedup this code regex code? The file is really large and will not open in excel because of size.

import regex as re

path = "C:/Users/.../CDPH/"
with open(path + 'Thefile.tab') as file:
     data = file.read()
     # replace all space bars between tab characters
     data = re.sub('( )*(?=n)|( )*(?=t)', '', data )
with open(path + 'Data.csv', 'w') as file:
     file.write(data)

Asked By: Shane S

||

Source

Answer 1

Not knowing the exact dialect of the tab separated csv file I’m having to take a guess. You’ll find a lot more options in the csv library documentation.

Here’s what I would try to speed up the right trimming of the fields:

#!/usr/bin/python

import csv

with open('Data.csv', 'w', newline='') as outfile:
    with open('Thefile.tab', newline='') as infile:
        rd = csv.reader(infile, delimiter = 't')
        wr = csv.writer(outfile, delimiter = 't')
        for row in rd:
            row = [field.rstrip() for field in row]
            wr.writerow(row)

Answered By: Ted Lyngmo

Answer 2

Since you expressed interest in my comment, this is what I had in mind:

import os


dirpath = "C:/Users/.../CDPH/"
infilepath = os.path.join(dirpath, 'Thefile.tab')
outfilepath = os.path.join(dirpath, 'Thefile.out.tab')
with open() as infile, open(outfilepath, 'w') as outfile:
    # replace all spaces between tab characters
    for line in infile:
        line = line.lstrip(' ').rstrip('n').rstrip('t').lstrip('=')
        if not line: continue
        outfile.write(line)
        outfile.write('n')

Answered By: inspectorG4dget

Ways to speedup regex and make it faster

Question:

Answers: