Replace carriage returns in list python
Question:
I have a list of values and need to remove errant carriage returns whenever they occur in a list of values.
the format of the file that I am looking to remove these in is as follows.
field1|field2|field3|field4|field5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|val
ue 3|value 4|value 5
value 1|value 2|value 3|va
lue 4|value 5
I am looking to address a situation like the one above where there are errant carriage returns in the 3rd and 4th values for the last 2 rows of data.
I have seen a few posts for how to address this but so far nothing has worked for this situation. I have pasted the code I have attempted so far.
import os
import sys
filetoread = 'C:temptest.dat'
filetowrite = 'C:emptest_updated.dat'
'''
Attempt 1
'''
with open(filetoread, "r+b") as inf:
with open(filetowrite, "w") as fixed:
for line in inf:
fixed.write(line)
'''
Attempt 2
'''
for line in filetoread:
line = line.replace("n", "")
'''
Attempt 3
'''
with open(filetoread, "r") as inf:
for line in inf:
if "n" in line:
line = line.replace("n", "")
Answers:
The n character is a line feed. r is the carriage return:
http://en.cppreference.com/w/cpp/language/escape
So,
> line.replace("n", "")
should be
line.replace("r", "")
Do check if it’s really r alone, or the rn pair. Windows/DOS uses rn,
Mac & Co uses r, Linux uses n alone
Note: I’m assuming you have extra newlines ('n'
) not carriage returns ('r'
).
def remove_newlines_in_fields(data, ncols, sep):
sep_count = 0
for c in data:
if c == sep:
sep_count += 1
if c == 'n':
if sep_count == ncols - 1:
yield c
sep_count = 0
else:
yield c
Also note that if you have newlines in your rightmost column this won’t work properly. (The partial column will be prepended to the next row.)
Here it is in action:
>>> s = '''field1|field2|field3|field4|field5
... value 1|value 2|value 3|value 4|value 5
... value 1|value 2|value 3|value 4|value 5
... value 1|value 2|val
... ue 3|value 4|value 5
... value 1|value 2|value 3|va
... lue 4|value 5'''
>>> print(''.join(remove_newlines_in_fields(s, 5, '|')))
field1|field2|field3|field4|field5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|value 3|value 4|value 5
You have to count the number of fields, to match 5 per line:
import re
with open(filetoread, "r+b") as inf:
with open(filetowrite, "w") as fixed:
for l in re.finditer('(?:.*?|){4}(?:.*?)n', inf.read(), re.DOTALL):
fixed.write(l.group(0).replace('n','') + 'n')
The following will remove any carriage return characters embedded in each field:
with open(filetoread, "rb") as inf:
with open(filetowrite, "w") as fixed:
for line in (line.rstrip() for line in inf):
fields = (field.replace('r', '') for field in line.split('|'))
fixed.write('|'.join(fields) + 'n')
**if the line you read from a text file is empty with ^M at the end, in only that case, python will read as two empty lines:
infile:**
Cookie: login=admin; session=oNvChuTLIyFhParkQ0c4UswT^M
^M
{"order":["descending","time"],"where":{"access_logs":{"time":{"<=":1675900799,">=":1673308800}},"users":{},"groups":{},"time_zones":{}},"object":"access_logs","fields":["COUNT(*)"],"join":"LEFT"}
output of: for line in infile:print(‘LINE:’+line+’!’)
LINE:Cookie: login=admin; session=oNvChuTLIyFhParkQ0c4UswT!
LINE:!
LINE:!
LINE:!
LINE:{"order":["descending","time"],"where":{"access_logs":{"time":{"<=":1675900799,">=":1673308800}},"users":{},"groups":{},"time_zones":{}},"object":"access_logs","fields":["COUNT(*)"],"join":"LEFT"}!
I have a list of values and need to remove errant carriage returns whenever they occur in a list of values.
the format of the file that I am looking to remove these in is as follows.
field1|field2|field3|field4|field5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|val
ue 3|value 4|value 5
value 1|value 2|value 3|va
lue 4|value 5
I am looking to address a situation like the one above where there are errant carriage returns in the 3rd and 4th values for the last 2 rows of data.
I have seen a few posts for how to address this but so far nothing has worked for this situation. I have pasted the code I have attempted so far.
import os
import sys
filetoread = 'C:temptest.dat'
filetowrite = 'C:emptest_updated.dat'
'''
Attempt 1
'''
with open(filetoread, "r+b") as inf:
with open(filetowrite, "w") as fixed:
for line in inf:
fixed.write(line)
'''
Attempt 2
'''
for line in filetoread:
line = line.replace("n", "")
'''
Attempt 3
'''
with open(filetoread, "r") as inf:
for line in inf:
if "n" in line:
line = line.replace("n", "")
The n character is a line feed. r is the carriage return:
http://en.cppreference.com/w/cpp/language/escape
So,
> line.replace("n", "")
should be
line.replace("r", "")
Do check if it’s really r alone, or the rn pair. Windows/DOS uses rn,
Mac & Co uses r, Linux uses n alone
Note: I’m assuming you have extra newlines ('n'
) not carriage returns ('r'
).
def remove_newlines_in_fields(data, ncols, sep):
sep_count = 0
for c in data:
if c == sep:
sep_count += 1
if c == 'n':
if sep_count == ncols - 1:
yield c
sep_count = 0
else:
yield c
Also note that if you have newlines in your rightmost column this won’t work properly. (The partial column will be prepended to the next row.)
Here it is in action:
>>> s = '''field1|field2|field3|field4|field5
... value 1|value 2|value 3|value 4|value 5
... value 1|value 2|value 3|value 4|value 5
... value 1|value 2|val
... ue 3|value 4|value 5
... value 1|value 2|value 3|va
... lue 4|value 5'''
>>> print(''.join(remove_newlines_in_fields(s, 5, '|')))
field1|field2|field3|field4|field5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|value 3|value 4|value 5
value 1|value 2|value 3|value 4|value 5
You have to count the number of fields, to match 5 per line:
import re
with open(filetoread, "r+b") as inf:
with open(filetowrite, "w") as fixed:
for l in re.finditer('(?:.*?|){4}(?:.*?)n', inf.read(), re.DOTALL):
fixed.write(l.group(0).replace('n','') + 'n')
The following will remove any carriage return characters embedded in each field:
with open(filetoread, "rb") as inf:
with open(filetowrite, "w") as fixed:
for line in (line.rstrip() for line in inf):
fields = (field.replace('r', '') for field in line.split('|'))
fixed.write('|'.join(fields) + 'n')
**if the line you read from a text file is empty with ^M at the end, in only that case, python will read as two empty lines:
infile:**
Cookie: login=admin; session=oNvChuTLIyFhParkQ0c4UswT^M
^M
{"order":["descending","time"],"where":{"access_logs":{"time":{"<=":1675900799,">=":1673308800}},"users":{},"groups":{},"time_zones":{}},"object":"access_logs","fields":["COUNT(*)"],"join":"LEFT"}
output of: for line in infile:print(‘LINE:’+line+’!’)
LINE:Cookie: login=admin; session=oNvChuTLIyFhParkQ0c4UswT!
LINE:!
LINE:!
LINE:!
LINE:{"order":["descending","time"],"where":{"access_logs":{"time":{"<=":1675900799,">=":1673308800}},"users":{},"groups":{},"time_zones":{}},"object":"access_logs","fields":["COUNT(*)"],"join":"LEFT"}!