Remove linebreak in csv

Question

I have a CSV file that has errors. The most common one is a too early linebreak.

But now I don’t know how to remove it ideally. If I read the line by line

with open("test.csv", "r") as reader:
    test = reader.read().splitlines()

the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?

I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?

EDIT:
I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;

for line in lines:
    if("Foobar" in line):
        line = line.replace("Foobar", "")
    if(";n" in line):
        line = line.replace(";n", ";")

The only thing that remains are rows that beginn with a ;
Since I need to go back one entry in the list

Example:

Col_a;Col_b;Col_c;Col_d 
2021;Foobar;Bla 
;Blub

Blub belongs in the row above.

Asked By: sa-biene

||

Source

Answer 1

This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle.

Parameters of the function are:

message – content of the file – reader.read() in your case
columns – number of expected columns
filename – filename (I use it for logging)

def pre_parse(message, columns, filename):
    parsed_message=[]
    i =0
    temp_line =''
    for line in message.splitlines():
        #print(line)
        split = line.split(',')
        if len(split) == columns:
            parsed_message.append(line)
        elif len(split) > columns:
            print(f'Line {i} has been truncated in file {filename} - too much columns'))
            split = split[:columns]
            line = ','.join(split)
            parsed_message.append(line)
        elif len(split) < columns and temp_line =='':
            temp_line = line.replace('n','')
            print(temp_line)
        elif temp_line !='':
            line = temp_line+line
            if line.count(',') == columns-1:
                print((f'Line {i} has been fixed in file {filename} - extra line feed'))
                parsed_message.append(line)
                temp_line =''
            else:
                temp_line=line.replace('n', '')
        i+=1
    return parsed_message

make sure you use proper split character and proper line feed characer.

Answered By: Hubert

Answer 2

Here’s a simple Python script to merge lines until you have the desired number of fields.

import sys

sep = ';'
fields = 4

collected = []
for line in sys.stdin:
    new = line.rstrip('n').split(sep)
    if collected:
        collected[-1] += new[0]
        collected.extend(new[1:])
    else:
        collected = new
    if len(collected) < fields:
       continue
    print(';'.join(collected))
    collected = []

This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost.
The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.

If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer to write the fields back out as properly quoted CSV.

Answered By: tripleee

Answer 3

I ended up using this post to create a solution: Replace CRLF with LF in Python 3.6 it also helped me get over the hump and provided an understanding of what was happening underneath the hood.

OldFile=r"c:Testinput.csv"
NewFile=r"C:Testoutput.csv"

#reading it in as binary keeps the cr lf in windows as is
with (
    open(OldFile, 'rb') as f_in,
    open(NewFile, 'wb') as f_out,
):


FileContent = f_in.read()

#removing all line breaks including the ones after the carriage return
oldLineFeed = b'n'
newLineFeed = b''

FileContent = FileContent.replace(oldLineFeed, newLineFeed)

#only have a carriage return now at the end of each true line, added back in the line break 
oldLineFeed = b'r'
newLineFeed = b'rn'
FileContent = FileContent.replace(oldLineFeed, newLineFeed)

f_out.write(FileContent)

f_in.close()
f_out.close()

Answered By: Breadtruck

Remove linebreak in csv

Question:

Answers: