Kernel stopping and restarting when merging two huge databases

Question:

I know this might be quite a general question but I’ll try.
I have 3 huge databases (around 5 Million observations each) that I have to merge all together but when I do using

db_cpc_id = pd.merge(df_id_appended, df_cpc_appended, how='left', on='docdb_family_id')

the kernel stops working. Any suggestion on how to avoid the kernel restarting? Maybe using pd.concat() might solve the issue?

Thank you

Asked By: Nutarelli Federico

||

Answers:

The first thing you should consider is that merge is memory intensive and that you simply might not have enough RAM to do this operation. Please have a look at Vaex, as this is a fast and easy way to manipulate massive amounts of data. https://vaex.io/. The syntax is not identical but very similar to pandas. In the example below I am assuming you have 5 CSVs that you can load and merge, and then store.

import vaex

vaex_df1 = vaex.from_csv(file1,convert=True, chunk_size=5_000)
vaex_df2 = vaex.from_csv(file2,convert=True, chunk_size=5_000)
joined_df = vaex_df1.join(vaex_df2, how='left', on='docdb_family_id')

Please check your system resources when running your code to get a better understanding of why your kernel is failing 🙂

Answered By: Pieter Geelen
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.