MemoryError when I merge two Pandas data frames

Question

I searched almost all over the internet and somehow none of the approaches seem to work in my case.

I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csv function without having to use the chunksize parameter.
I even performed certain minor operations on this data like new column generation, filtering, etc.

However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.

Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7

Thank you.

Edit: I tried chunking methods too. When I do this, I don’t get MemoryError, but the RAM usage explodes and my system crashes.

Asked By: Ronit Chidara

||

Source

Answer 1

When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.

It might be a better way but you can try this.
*for large data set you can use chunksize option in pandas.read_csv

df1 = pd.read_csv("yourdata.csv")
df2 = pd.read_csv("yourdata2.csv")
df2_key = df2.Colname2

# creating a empty bucket to save result
df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique())
df_result.to_csv("df3.csv",index_label=False)

# save data which only appear in df1 # sorry I was doing left join here. no need to run below two line.
# df_result = df1[df1.Colname1.isin(df2.Colname2)!=True]
# df_result.to_csv("df3.csv",index_label=False, mode="a")

# deleting df2 to save memory
del(df2)

def preprocess(x):
    df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2")
    df2.to_csv("df3.csv",mode="a",header=False,index=False)

reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize

[preprocess(r) for r in reader]

this will save merged data as df3.

Answered By: T_cat

Answer 2

In general chunk version suggested by @T_cat works great.

However, memory exploding might be caused by joining on columns that have Nan values.
So you may want to exclude those rows from the join.

see: https://github.com/pandas-dev/pandas/issues/24698#issuecomment-614347153

Answered By: Neil

Answer 3

The reason you might be getting MemoryError: Unable to allocate.. could be due to duplicates or blanks in your dataframe. Check the column you are joining on (when using merge) and see if you have duplicates or blanks. If so get rid of them using this command:

df.drop_duplicates(subset ='column_name', keep = False, inplace = True)

Then re-run your python/pandas code. This worked for me.

Answered By: user3062459

Answer 4

Maybe the left data frame has NaN in the merging columns which causes the final merged dataframe to bloat.

Fill the merging column in the left data frame with zeros if ok.

df['left_column'] = df['left_column'].fillna(0)

Then do the merge. See what you get.

Answered By: Pranay Aryal

MemoryError when I merge two Pandas data frames

Question:

Answers: