Writing large Pandas Dataframes to CSV file in chunks
Question:
How do I write out a large data files to a CSV file in chunks?
I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of the data files are of interest to me.
I want to make things easier by making copies of these files with only the columns of interest so I have smaller files to work with for post-processing. So I plan to read the file into a dataframe, then write to csv file.
I’ve been looking into reading large data files in chunks into a dataframe. However, I haven’t been able to find anything on how to write out the data to a csv file in chunks.
Here is what I’m trying now, but this doesn’t append the csv file:
with open(os.path.join(folder, filename), 'r') as src:
df = pd.read_csv(src, sep='t',skiprows=(0,1,2),header=(0), chunksize=1000)
for chunk in df:
chunk.to_csv(os.path.join(folder, new_folder,
"new_file_" + filename),
columns = [['TIME','STUFF']])
Answers:
Solution:
header = True
for chunk in chunks:
chunk.to_csv(os.path.join(folder, new_folder, "new_file_" + filename),
header=header, cols=[['TIME','STUFF']], mode='a')
header = False
Notes:
- The
mode='a'
tells pandas to append.
- We only write a column header on the first chunk.
Check out the chunksize
argument in the to_csv
method. Here are the docs.
Writing to file would look like:
df.to_csv("path/to/save/file.csv", chunksize=1000, cols=['TIME','STUFF'])
Why don’t you only read the columns of interest and then save it?
file_in = os.path.join(folder, filename)
file_out = os.path.join(folder, new_folder, 'new_file' + filename)
df = pd.read_csv(file_in, sep='t', skiprows=(0, 1, 2), header=0, names=['TIME', 'STUFF'])
df.to_csv(file_out)
How do I write out a large data files to a CSV file in chunks?
I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of the data files are of interest to me.
I want to make things easier by making copies of these files with only the columns of interest so I have smaller files to work with for post-processing. So I plan to read the file into a dataframe, then write to csv file.
I’ve been looking into reading large data files in chunks into a dataframe. However, I haven’t been able to find anything on how to write out the data to a csv file in chunks.
Here is what I’m trying now, but this doesn’t append the csv file:
with open(os.path.join(folder, filename), 'r') as src:
df = pd.read_csv(src, sep='t',skiprows=(0,1,2),header=(0), chunksize=1000)
for chunk in df:
chunk.to_csv(os.path.join(folder, new_folder,
"new_file_" + filename),
columns = [['TIME','STUFF']])
Solution:
header = True
for chunk in chunks:
chunk.to_csv(os.path.join(folder, new_folder, "new_file_" + filename),
header=header, cols=[['TIME','STUFF']], mode='a')
header = False
Notes:
- The
mode='a'
tells pandas to append. - We only write a column header on the first chunk.
Check out the chunksize
argument in the to_csv
method. Here are the docs.
Writing to file would look like:
df.to_csv("path/to/save/file.csv", chunksize=1000, cols=['TIME','STUFF'])
Why don’t you only read the columns of interest and then save it?
file_in = os.path.join(folder, filename)
file_out = os.path.join(folder, new_folder, 'new_file' + filename)
df = pd.read_csv(file_in, sep='t', skiprows=(0, 1, 2), header=0, names=['TIME', 'STUFF'])
df.to_csv(file_out)