Memory Error when parsing a large number of files

Question:

I am parsing 6k csv files to merge them into one. I need this for their joint analysis and training of the ML model. There are too many files and my computer ran out of memory by simply concatenating them.


S = ‘’

for f in csv_files:

# read the csv file

#df = df.append(pd.read_csv(f))

s = s + open(f, mode ='r').read()[32:] 



print(f)

file = open(‘bigdata.csv’, mode = ‘w’)

file.write(s)

file.close()


I need a way to create a single dataset from all files (60gb) for train my ML model

Asked By: Kost1k

||

Answers:

I believe this may help:

file = open('bigdata.csv', mode = 'w')

for f in csv_files:
    s = open(f, mode='r').read()[32:]
    file.write(s)

file.close()

In contrast, your origin code need at least the same memory as the size of the output file, which is 60gb and maybe larger than the memory of your computer.

However, if there is a single input file that can run out of your memory, then this method will also fail, in which case you may need to read each of csv files line by line and write them into the output file. I didn’t write that method because I’m not sure about your magic number 32.

Answered By: HolderRoy