Ways to speedup regex and make it faster
Question:
Is there a way to speedup this code regex code? The file is really large and will not open in excel because of size.
import regex as re
path = "C:/Users/.../CDPH/"
with open(path + 'Thefile.tab') as file:
data = file.read()
# replace all space bars between tab characters
data = re.sub('( )*(?=n)|( )*(?=t)', '', data )
with open(path + 'Data.csv', 'w') as file:
file.write(data)
Answers:
Not knowing the exact dialect of the tab separated csv file I’m having to take a guess. You’ll find a lot more options in the csv
library documentation.
Here’s what I would try to speed up the right trimming of the fields:
#!/usr/bin/python
import csv
with open('Data.csv', 'w', newline='') as outfile:
with open('Thefile.tab', newline='') as infile:
rd = csv.reader(infile, delimiter = 't')
wr = csv.writer(outfile, delimiter = 't')
for row in rd:
row = [field.rstrip() for field in row]
wr.writerow(row)
Since you expressed interest in my comment, this is what I had in mind:
import os
dirpath = "C:/Users/.../CDPH/"
infilepath = os.path.join(dirpath, 'Thefile.tab')
outfilepath = os.path.join(dirpath, 'Thefile.out.tab')
with open() as infile, open(outfilepath, 'w') as outfile:
# replace all spaces between tab characters
for line in infile:
line = line.lstrip(' ').rstrip('n').rstrip('t').lstrip('=')
if not line: continue
outfile.write(line)
outfile.write('n')
Is there a way to speedup this code regex code? The file is really large and will not open in excel because of size.
import regex as re
path = "C:/Users/.../CDPH/"
with open(path + 'Thefile.tab') as file:
data = file.read()
# replace all space bars between tab characters
data = re.sub('( )*(?=n)|( )*(?=t)', '', data )
with open(path + 'Data.csv', 'w') as file:
file.write(data)
Not knowing the exact dialect of the tab separated csv file I’m having to take a guess. You’ll find a lot more options in the csv
library documentation.
Here’s what I would try to speed up the right trimming of the fields:
#!/usr/bin/python
import csv
with open('Data.csv', 'w', newline='') as outfile:
with open('Thefile.tab', newline='') as infile:
rd = csv.reader(infile, delimiter = 't')
wr = csv.writer(outfile, delimiter = 't')
for row in rd:
row = [field.rstrip() for field in row]
wr.writerow(row)
Since you expressed interest in my comment, this is what I had in mind:
import os
dirpath = "C:/Users/.../CDPH/"
infilepath = os.path.join(dirpath, 'Thefile.tab')
outfilepath = os.path.join(dirpath, 'Thefile.out.tab')
with open() as infile, open(outfilepath, 'w') as outfile:
# replace all spaces between tab characters
for line in infile:
line = line.lstrip(' ').rstrip('n').rstrip('t').lstrip('=')
if not line: continue
outfile.write(line)
outfile.write('n')