Kernel stopping and restarting when merging two huge databases
Question:
I know this might be quite a general question but I’ll try.
I have 3 huge databases (around 5 Million observations each) that I have to merge all together but when I do using
db_cpc_id = pd.merge(df_id_appended, df_cpc_appended, how='left', on='docdb_family_id')
the kernel stops working. Any suggestion on how to avoid the kernel restarting? Maybe using pd.concat() might solve the issue?
Thank you
Answers:
The first thing you should consider is that merge is memory intensive and that you simply might not have enough RAM to do this operation. Please have a look at Vaex, as this is a fast and easy way to manipulate massive amounts of data. https://vaex.io/. The syntax is not identical but very similar to pandas. In the example below I am assuming you have 5 CSVs that you can load and merge, and then store.
import vaex
vaex_df1 = vaex.from_csv(file1,convert=True, chunk_size=5_000)
vaex_df2 = vaex.from_csv(file2,convert=True, chunk_size=5_000)
joined_df = vaex_df1.join(vaex_df2, how='left', on='docdb_family_id')
Please check your system resources when running your code to get a better understanding of why your kernel is failing 🙂
I know this might be quite a general question but I’ll try.
I have 3 huge databases (around 5 Million observations each) that I have to merge all together but when I do using
db_cpc_id = pd.merge(df_id_appended, df_cpc_appended, how='left', on='docdb_family_id')
the kernel stops working. Any suggestion on how to avoid the kernel restarting? Maybe using pd.concat() might solve the issue?
Thank you
The first thing you should consider is that merge is memory intensive and that you simply might not have enough RAM to do this operation. Please have a look at Vaex, as this is a fast and easy way to manipulate massive amounts of data. https://vaex.io/. The syntax is not identical but very similar to pandas. In the example below I am assuming you have 5 CSVs that you can load and merge, and then store.
import vaex
vaex_df1 = vaex.from_csv(file1,convert=True, chunk_size=5_000)
vaex_df2 = vaex.from_csv(file2,convert=True, chunk_size=5_000)
joined_df = vaex_df1.join(vaex_df2, how='left', on='docdb_family_id')
Please check your system resources when running your code to get a better understanding of why your kernel is failing 🙂