Remove linebreak in csv
Question:
I have a CSV file that has errors. The most common one is a too early linebreak.
But now I don’t know how to remove it ideally. If I read the line by line
with open("test.csv", "r") as reader:
test = reader.read().splitlines()
the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?
I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?
EDIT:
I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;
for line in lines:
if("Foobar" in line):
line = line.replace("Foobar", "")
if(";n" in line):
line = line.replace(";n", ";")
The only thing that remains are rows that beginn with a ;
Since I need to go back one entry in the list
Example:
Col_a;Col_b;Col_c;Col_d
2021;Foobar;Bla
;Blub
Blub belongs in the row above.
Answers:
This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle.
Parameters of the function are:
- message – content of the file – reader.read() in your case
- columns – number of expected columns
- filename – filename (I use it for logging)
def pre_parse(message, columns, filename):
parsed_message=[]
i =0
temp_line =''
for line in message.splitlines():
#print(line)
split = line.split(',')
if len(split) == columns:
parsed_message.append(line)
elif len(split) > columns:
print(f'Line {i} has been truncated in file {filename} - too much columns'))
split = split[:columns]
line = ','.join(split)
parsed_message.append(line)
elif len(split) < columns and temp_line =='':
temp_line = line.replace('n','')
print(temp_line)
elif temp_line !='':
line = temp_line+line
if line.count(',') == columns-1:
print((f'Line {i} has been fixed in file {filename} - extra line feed'))
parsed_message.append(line)
temp_line =''
else:
temp_line=line.replace('n', '')
i+=1
return parsed_message
make sure you use proper split character and proper line feed characer.
Here’s a simple Python script to merge lines until you have the desired number of fields.
import sys
sep = ';'
fields = 4
collected = []
for line in sys.stdin:
new = line.rstrip('n').split(sep)
if collected:
collected[-1] += new[0]
collected.extend(new[1:])
else:
collected = new
if len(collected) < fields:
continue
print(';'.join(collected))
collected = []
This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost.
The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.
If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer
to write the fields back out as properly quoted CSV.
I ended up using this post to create a solution: Replace CRLF with LF in Python 3.6 it also helped me get over the hump and provided an understanding of what was happening underneath the hood.
OldFile=r"c:Testinput.csv"
NewFile=r"C:Testoutput.csv"
#reading it in as binary keeps the cr lf in windows as is
with (
open(OldFile, 'rb') as f_in,
open(NewFile, 'wb') as f_out,
):
FileContent = f_in.read()
#removing all line breaks including the ones after the carriage return
oldLineFeed = b'n'
newLineFeed = b''
FileContent = FileContent.replace(oldLineFeed, newLineFeed)
#only have a carriage return now at the end of each true line, added back in the line break
oldLineFeed = b'r'
newLineFeed = b'rn'
FileContent = FileContent.replace(oldLineFeed, newLineFeed)
f_out.write(FileContent)
f_in.close()
f_out.close()
I have a CSV file that has errors. The most common one is a too early linebreak.
But now I don’t know how to remove it ideally. If I read the line by line
with open("test.csv", "r") as reader:
test = reader.read().splitlines()
the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?
I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?
EDIT:
I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;
for line in lines:
if("Foobar" in line):
line = line.replace("Foobar", "")
if(";n" in line):
line = line.replace(";n", ";")
The only thing that remains are rows that beginn with a ;
Since I need to go back one entry in the list
Example:
Col_a;Col_b;Col_c;Col_d
2021;Foobar;Bla
;Blub
Blub belongs in the row above.
This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle.
Parameters of the function are:
- message – content of the file – reader.read() in your case
- columns – number of expected columns
- filename – filename (I use it for logging)
def pre_parse(message, columns, filename):
parsed_message=[]
i =0
temp_line =''
for line in message.splitlines():
#print(line)
split = line.split(',')
if len(split) == columns:
parsed_message.append(line)
elif len(split) > columns:
print(f'Line {i} has been truncated in file {filename} - too much columns'))
split = split[:columns]
line = ','.join(split)
parsed_message.append(line)
elif len(split) < columns and temp_line =='':
temp_line = line.replace('n','')
print(temp_line)
elif temp_line !='':
line = temp_line+line
if line.count(',') == columns-1:
print((f'Line {i} has been fixed in file {filename} - extra line feed'))
parsed_message.append(line)
temp_line =''
else:
temp_line=line.replace('n', '')
i+=1
return parsed_message
make sure you use proper split character and proper line feed characer.
Here’s a simple Python script to merge lines until you have the desired number of fields.
import sys
sep = ';'
fields = 4
collected = []
for line in sys.stdin:
new = line.rstrip('n').split(sep)
if collected:
collected[-1] += new[0]
collected.extend(new[1:])
else:
collected = new
if len(collected) < fields:
continue
print(';'.join(collected))
collected = []
This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost.
The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.
If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer
to write the fields back out as properly quoted CSV.
I ended up using this post to create a solution: Replace CRLF with LF in Python 3.6 it also helped me get over the hump and provided an understanding of what was happening underneath the hood.
OldFile=r"c:Testinput.csv"
NewFile=r"C:Testoutput.csv"
#reading it in as binary keeps the cr lf in windows as is
with (
open(OldFile, 'rb') as f_in,
open(NewFile, 'wb') as f_out,
):
FileContent = f_in.read()
#removing all line breaks including the ones after the carriage return
oldLineFeed = b'n'
newLineFeed = b''
FileContent = FileContent.replace(oldLineFeed, newLineFeed)
#only have a carriage return now at the end of each true line, added back in the line break
oldLineFeed = b'r'
newLineFeed = b'rn'
FileContent = FileContent.replace(oldLineFeed, newLineFeed)
f_out.write(FileContent)
f_in.close()
f_out.close()