Refactoring pandas using an iterator via chunksize
Question:
I am looking for advice on using a pandas
iterator.
I performed a parsing operation using Python pandas
, the size of the input files (a bioinformatics program called eggNOG) is resulting in ‘RAM bottleneck’ phenomenon. It’s just not processing the file.
The obvious solution is to shift to an iterator, which for pandas
is the chunksize
option
import pandas as pd
import numpy as np
df = pd.read_csv('myinfile.csv', sep="t", chunksize=100)
Whats changed with the original code is the chunksize=100
bit, forcing an iterator.
The next step is just to perform a simple operation, dropping a few columns and moving all ‘-‘ characters to np.nan then writing the whole file.
df.drop(['score', 'evalue', 'Description', 'EC', 'PFAMs'],axis=1).replace('-', np.nan)
df.to_csv('my.csv',sep='t',index=False)
How is this done under a pandas iterator?
Answers:
IIUC, you can do:
cols_to_drop = ['score', 'evalue', 'Description', 'EC', 'PFAMs']
data = []
for chunk in pd.read_csv('myinfile.csv', sep='t', na_values='-', chunksize=100):
chunk = chunk.drop(columns=cols_to_drop)
data.append(chunk)
pd.concat(data).to_csv('my.csv', sep='t', index=False)
If you know the columns you want to keep instead of which ones you want to drop, use:
cols_to_keep = ['col1', 'col2', 'col3']
data = []
for chunk in pd.read_csv('myinfile.csv', usecols=cols_to_keep, usesep='t', na_values='-', chunksize=100):
data.append(chunk)
pd.concat(data).to_csv('my.csv', sep='t', index=False)
Alternative inspired by @el_oso:
cols_to_drop = ['score', 'evalue', 'Description', 'EC', 'PFAMs']
with (open('myinfile.csv') as inp,
open('my.csv', 'w') as out):
headers = inp.readline().split('t')
out.write('t'.join([col for col in headers if col not in cols_to_drop]))
for chunk in pd.read_csv(inp, header=None, names=headers, sep='t', na_values='-', chunksize=100):
chunk = chunk.drop(columns=cols_to_drop)
chunk.to_csv(out, sep='t', index=False, header=False)
if you’re already having memory issues reading the file, the below will actually read and write to the new file in chunks.
cols_to_keep = {'col1', 'col2', 'col3'}
df = pd.read_csv('myinfile.csv', chunksize=10, usecols=cols_to_keep)
with open('my.csv', 'w') as f:
f.write('t'.join(cols_to_keep) + 'n')
for chunk in df:
#do your processing here, appending that chunk to the file
chunk.to_csv(f, header=None, sep='t')
f
is _io.TextIOWrapper i.e. a stream you can consistently write to in small memory chunks
I am looking for advice on using a pandas
iterator.
I performed a parsing operation using Python pandas
, the size of the input files (a bioinformatics program called eggNOG) is resulting in ‘RAM bottleneck’ phenomenon. It’s just not processing the file.
The obvious solution is to shift to an iterator, which for pandas
is the chunksize
option
import pandas as pd
import numpy as np
df = pd.read_csv('myinfile.csv', sep="t", chunksize=100)
Whats changed with the original code is the chunksize=100
bit, forcing an iterator.
The next step is just to perform a simple operation, dropping a few columns and moving all ‘-‘ characters to np.nan then writing the whole file.
df.drop(['score', 'evalue', 'Description', 'EC', 'PFAMs'],axis=1).replace('-', np.nan)
df.to_csv('my.csv',sep='t',index=False)
How is this done under a pandas iterator?
IIUC, you can do:
cols_to_drop = ['score', 'evalue', 'Description', 'EC', 'PFAMs']
data = []
for chunk in pd.read_csv('myinfile.csv', sep='t', na_values='-', chunksize=100):
chunk = chunk.drop(columns=cols_to_drop)
data.append(chunk)
pd.concat(data).to_csv('my.csv', sep='t', index=False)
If you know the columns you want to keep instead of which ones you want to drop, use:
cols_to_keep = ['col1', 'col2', 'col3']
data = []
for chunk in pd.read_csv('myinfile.csv', usecols=cols_to_keep, usesep='t', na_values='-', chunksize=100):
data.append(chunk)
pd.concat(data).to_csv('my.csv', sep='t', index=False)
Alternative inspired by @el_oso:
cols_to_drop = ['score', 'evalue', 'Description', 'EC', 'PFAMs']
with (open('myinfile.csv') as inp,
open('my.csv', 'w') as out):
headers = inp.readline().split('t')
out.write('t'.join([col for col in headers if col not in cols_to_drop]))
for chunk in pd.read_csv(inp, header=None, names=headers, sep='t', na_values='-', chunksize=100):
chunk = chunk.drop(columns=cols_to_drop)
chunk.to_csv(out, sep='t', index=False, header=False)
if you’re already having memory issues reading the file, the below will actually read and write to the new file in chunks.
cols_to_keep = {'col1', 'col2', 'col3'}
df = pd.read_csv('myinfile.csv', chunksize=10, usecols=cols_to_keep)
with open('my.csv', 'w') as f:
f.write('t'.join(cols_to_keep) + 'n')
for chunk in df:
#do your processing here, appending that chunk to the file
chunk.to_csv(f, header=None, sep='t')
f
is _io.TextIOWrapper i.e. a stream you can consistently write to in small memory chunks