Maximum size of pandas dataframe

Question

I’m trying to read in a somewhat large dataset using pandas read_csv or read_stata functions, but I keep running into Memory Errors. What is the maximum size of a dataframe? My understanding is that dataframes should be okay as long as the data fits into memory, which shouldn’t be a problem for me. What else could cause the memory error?

For context, I’m trying to read in the Survey of Consumer Finances 2007, both in ASCII format (using read_csv) and in Stata format (using read_stata). The file is around 200MB as dta and around 1.2GB as ASCII, and opening it in Stata tells me that there are 5,800 variables/columns for 22,000 observations/rows.

Asked By: Nils Gudat

||

Source

Answer 1

I’m going to post this answer as was discussed in comments. I’ve seen it come up numerous times without an accepted answer.

The Memory Error is intuitive – out of memory. But sometimes the solution or the debugging of this error is frustrating as you have enough memory, but the error remains.

1) Check for code errors

This may be a “dumb step” but that’s why it’s first. Make sure there are no infinite loops or things that will knowingly take a long time (like using something the os module that will search your entire computer and put the output in an excel file)

2) Make your code more efficient

Goes along the lines of Step 1. But if something simple is taking a long time, there’s usually a module or a better way of doing something that is faster and more memory efficent. That’s the beauty of Python and/or open source Languages!

3) Check The Total Memory of the object

The first step is to check the memory of an object. There are a ton of threads on Stack about this, so you can search them. Popular answers are here and here

to find the size of an object in bites you can always use sys.getsizeof():

import sys
print(sys.getsizeof(OBEJCT_NAME_HERE))

Now the error might happen before anything is created, but if you read the csv in chunks you can see how much memory is being used per chunk.

4) Check the memory while running

Sometimes you have enough memory but the function you are running consumes a lot of memory at runtime. This causes memory to spike beyond the actual size of the finished object causing the code/process to error. Checking memory in real time is lengthy, but can be done. Ipython is good with that. Check Their Document.

use the code below to see the documentation straight in Jupyter Notebook:

%mprun?
%memit?

Sample use:

%load_ext memory_profiler
def lol(x):
    return x
%memit lol(500)
#output --- peak memory: 48.31 MiB, increment: 0.00 MiB

If you need help on magic functions This is a great post

5) This one may be first…. but Check for simple things like bit version

As in your case, a simple switching of the version of python you were running solved the issue.

Usually the above steps solve my issues.

Answered By: MattR

Maximum size of pandas dataframe

Question:

Answers: