Creating large Pandas DataFrames: preallocation vs append vs concat
Question:
I am confused by the performance in Pandas when building a large dataframe chunk by chunk. In Numpy, we (almost) always see better performance by preallocating a large empty array and then filling in the values. As I understand it, this is due to Numpy grabbing all the memory it needs at once instead of having to reallocate memory with every append
operation.
In Pandas, I seem to be getting better performance by using the df = df.append(temp)
pattern.
Here is an example with timing. The definition of the Timer
class follows. As you, see I find that preallocating is roughly 10x slower than using append
! Preallocating a dataframe with np.empty
values of the appropriate dtype helps a great deal, but the append
method is still the fastest.
import numpy as np
from numpy.random import rand
import pandas as pd
from timer import Timer
# Some constants
num_dfs = 10 # Number of random dataframes to generate
n_rows = 2500
n_cols = 40
n_reps = 100 # Number of repetitions for timing
# Generate a list of num_dfs dataframes of random values
df_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)]
##
# Define two methods of growing a large dataframe
##
# Method 1 - append dataframes
def method1():
out_df1 = pd.DataFrame(columns=np.arange(4))
for df in df_list:
out_df1 = out_df1.append(df, ignore_index=True)
return out_df1
def method2():
# # Create an empty dataframe that is big enough to hold all the dataframes in df_list
out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows))
#EDIT_1: Set the dtypes of each column
for ix, col in enumerate(out_df2.columns):
out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix])
# Fill in the values
for ix, df in enumerate(df_list):
out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
return out_df2
# EDIT_2:
# Method 3 - preallocate dataframe with np.empty data of appropriate type
def method3():
# Create fake data array
data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes]))
# Create placeholder dataframe
out_df3 = pd.DataFrame(data)
# Fill in the real values
for ix, df in enumerate(df_list):
out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
return out_df3
##
# Time both methods
##
# Time Method 1
times_1 = np.empty(n_reps)
for i in np.arange(n_reps):
with Timer() as t:
df1 = method1()
times_1[i] = t.secs
print 'Total time for %d repetitions of Method 1: %f [sec]' % (n_reps, np.sum(times_1))
print 'Best time: %f' % (np.min(times_1))
print 'Mean time: %f' % (np.mean(times_1))
#>> Total time for 100 repetitions of Method 1: 2.928296 [sec]
#>> Best time: 0.028532
#>> Mean time: 0.029283
# Time Method 2
times_2 = np.empty(n_reps)
for i in np.arange(n_reps):
with Timer() as t:
df2 = method2()
times_2[i] = t.secs
print 'Total time for %d repetitions of Method 2: %f [sec]' % (n_reps, np.sum(times_2))
print 'Best time: %f' % (np.min(times_2))
print 'Mean time: %f' % (np.mean(times_2))
#>> Total time for 100 repetitions of Method 2: 32.143247 [sec]
#>> Best time: 0.315075
#>> Mean time: 0.321432
# Time Method 3
times_3 = np.empty(n_reps)
for i in np.arange(n_reps):
with Timer() as t:
df3 = method3()
times_3[i] = t.secs
print 'Total time for %d repetitions of Method 3: %f [sec]' % (n_reps, np.sum(times_3))
print 'Best time: %f' % (np.min(times_3))
print 'Mean time: %f' % (np.mean(times_3))
#>> Total time for 100 repetitions of Method 3: 6.577038 [sec]
#>> Best time: 0.063437
#>> Mean time: 0.065770
I use a nice Timer
courtesy of Huy Nguyen:
# credit: http://www.huyng.com/posts/python-performance-analysis/
import time
class Timer(object):
def __init__(self, verbose=False):
self.verbose = verbose
def __enter__(self):
self.start = time.clock()
return self
def __exit__(self, *args):
self.end = time.clock()
self.secs = self.end - self.start
self.msecs = self.secs * 1000 # millisecs
if self.verbose:
print 'elapsed time: %f ms' % self.msecs
If you are still following, I have two questions:
1) Why is the append
method faster? (NOTE: for very small dataframes, i.e. n_rows = 40
, it is actually slower).
2) What is the most efficient way to build a large dataframe out of chunks? (In my case, the chunks are all large csv files).
Thanks for your help!
EDIT_1:
In my real world project, the columns have different dtypes. So I cannot use the pd.DataFrame(.... dtype=some_type)
trick to improve the performance of preallocation, per BrenBarn’s recommendation. The dtype parameter forces all the columns to be the same dtype [Ref. issue 4464]
I added some lines to method2()
in my code to change the dtypes column-by-column to match in the input dataframes. This operation is expensive and negates the benefits of having the appropriate dtypes when writing blocks of rows.
EDIT_2: Try preallocating a dataframe using placeholder array np.empty(... dtyp=some_type)
. Per @Joris’s suggestion.
Answers:
You didn’t specify any data or type for out_df2
, so it has the “object” dtype. This makes assigning values to it very slow. Specify float64 dtype:
out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows), dtype=np.float64)
You will see a dramatic speedup. When I tried it, method2
with this change is about twice as fast as method1
.
Your benchmark is actually too small to show the real difference.
Appending, copies EACH time, so you are actually doing copying a size N memory space N*(N-1) times. This is horribly inefficient as the size of your dataframe grows. This certainly might not matter in a very small frame. But if you have any real size this matters a lot. This is specifically noted in the docs here, though kind of a small warning.
In [97]: df = DataFrame(np.random.randn(100000,20))
In [98]: df['B'] = 'foo'
In [99]: df['C'] = pd.Timestamp('20130101')
In [103]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 22 columns):
0 100000 non-null float64
1 100000 non-null float64
2 100000 non-null float64
3 100000 non-null float64
4 100000 non-null float64
5 100000 non-null float64
6 100000 non-null float64
7 100000 non-null float64
8 100000 non-null float64
9 100000 non-null float64
10 100000 non-null float64
11 100000 non-null float64
12 100000 non-null float64
13 100000 non-null float64
14 100000 non-null float64
15 100000 non-null float64
16 100000 non-null float64
17 100000 non-null float64
18 100000 non-null float64
19 100000 non-null float64
B 100000 non-null object
C 100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(20), object(1)
memory usage: 17.5+ MB
Appending
In [85]: def f1():
....: result = df
....: for i in range(9):
....: result = result.append(df)
....: return result
....:
Concat
In [86]: def f2():
....: result = []
....: for i in range(10):
....: result.append(df)
....: return pd.concat(result)
....:
In [100]: f1().equals(f2())
Out[100]: True
In [101]: %timeit f1()
1 loops, best of 3: 1.66 s per loop
In [102]: %timeit f2()
1 loops, best of 3: 220 ms per loop
Note that I wouldn’t even bother trying to pre-allocate. Its somewhat complicated, especially since you are dealing with multiple dtypes (e.g. you could make a giant frame and simply .loc
and it would work). But pd.concat
is just dead simple, works reliably, and fast.
And timing of your sizes from above
In [104]: df = DataFrame(np.random.randn(2500,40))
In [105]: %timeit f1()
10 loops, best of 3: 33.1 ms per loop
In [106]: %timeit f2()
100 loops, best of 3: 4.23 ms per loop
@Jeff, pd.concat
wins by a mile! I benchmarked a fourth method using pd.concat
with num_dfs = 500
. The results are unequivocal:
The method4()
definition:
# Method 4 - us pd.concat on df_list
def method4():
return pd.concat(df_list, ignore_index=True)
Profiling results, using the same Timer
in my original question:
Total time for 100 repetitions of Method 1: 3679.334655 [sec]
Best time: 35.570036
Mean time: 36.793347
Total time for 100 repetitions of Method 2: 1569.917425 [sec]
Best time: 15.457102
Mean time: 15.699174
Total time for 100 repetitions of Method 3: 325.730455 [sec]
Best time: 3.192702
Mean time: 3.257305
Total time for 100 repetitions of Method 4: 25.448473 [sec]
Best time: 0.244309
Mean time: 0.254485
The pd.concat
method is 13x faster than preallocating with a np.empty(... dtype)
palceholder.
Jeff’s answer is correct, but I found for my data type another solution worked better.
def df_():
return pd.DataFrame(['foo']*np.random.randint(100)).transpose()
k = 100
frames = [df_() for x in range(0, k)]
def f1():
result = frames[0]
for i in range(k-1):
result = result.append(frames[i+1])
return result
def f2():
result = []
for i in range(k):
result.append(frames[i])
return pd.concat(result)
def f3():
result = []
for i in range(k):
result.append(frames[i])
n = 2
while len(result) > 1:
_result = []
for i in range(0, len(result), n):
_result.append(pd.concat(result[i:i+n]))
result = _result
return result[0]
My dataframes are a single row and of varying length – the null entries must have something to do with why f3() succeeds.
In [33]: f1().equals(f2())
Out[33]: True
In [34]: f1().equals(f3())
Out[34]: True
In [35]: %timeit f1()
357 ms ± 192 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [36]: %timeit f2()
562 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]: %timeit f3()
215 ms ± 58.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The above results is still for k=100, but for larger k it is even more significant.
I am confused by the performance in Pandas when building a large dataframe chunk by chunk. In Numpy, we (almost) always see better performance by preallocating a large empty array and then filling in the values. As I understand it, this is due to Numpy grabbing all the memory it needs at once instead of having to reallocate memory with every append
operation.
In Pandas, I seem to be getting better performance by using the df = df.append(temp)
pattern.
Here is an example with timing. The definition of the Timer
class follows. As you, see I find that preallocating is roughly 10x slower than using append
! Preallocating a dataframe with np.empty
values of the appropriate dtype helps a great deal, but the append
method is still the fastest.
import numpy as np
from numpy.random import rand
import pandas as pd
from timer import Timer
# Some constants
num_dfs = 10 # Number of random dataframes to generate
n_rows = 2500
n_cols = 40
n_reps = 100 # Number of repetitions for timing
# Generate a list of num_dfs dataframes of random values
df_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)]
##
# Define two methods of growing a large dataframe
##
# Method 1 - append dataframes
def method1():
out_df1 = pd.DataFrame(columns=np.arange(4))
for df in df_list:
out_df1 = out_df1.append(df, ignore_index=True)
return out_df1
def method2():
# # Create an empty dataframe that is big enough to hold all the dataframes in df_list
out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows))
#EDIT_1: Set the dtypes of each column
for ix, col in enumerate(out_df2.columns):
out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix])
# Fill in the values
for ix, df in enumerate(df_list):
out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
return out_df2
# EDIT_2:
# Method 3 - preallocate dataframe with np.empty data of appropriate type
def method3():
# Create fake data array
data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes]))
# Create placeholder dataframe
out_df3 = pd.DataFrame(data)
# Fill in the real values
for ix, df in enumerate(df_list):
out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
return out_df3
##
# Time both methods
##
# Time Method 1
times_1 = np.empty(n_reps)
for i in np.arange(n_reps):
with Timer() as t:
df1 = method1()
times_1[i] = t.secs
print 'Total time for %d repetitions of Method 1: %f [sec]' % (n_reps, np.sum(times_1))
print 'Best time: %f' % (np.min(times_1))
print 'Mean time: %f' % (np.mean(times_1))
#>> Total time for 100 repetitions of Method 1: 2.928296 [sec]
#>> Best time: 0.028532
#>> Mean time: 0.029283
# Time Method 2
times_2 = np.empty(n_reps)
for i in np.arange(n_reps):
with Timer() as t:
df2 = method2()
times_2[i] = t.secs
print 'Total time for %d repetitions of Method 2: %f [sec]' % (n_reps, np.sum(times_2))
print 'Best time: %f' % (np.min(times_2))
print 'Mean time: %f' % (np.mean(times_2))
#>> Total time for 100 repetitions of Method 2: 32.143247 [sec]
#>> Best time: 0.315075
#>> Mean time: 0.321432
# Time Method 3
times_3 = np.empty(n_reps)
for i in np.arange(n_reps):
with Timer() as t:
df3 = method3()
times_3[i] = t.secs
print 'Total time for %d repetitions of Method 3: %f [sec]' % (n_reps, np.sum(times_3))
print 'Best time: %f' % (np.min(times_3))
print 'Mean time: %f' % (np.mean(times_3))
#>> Total time for 100 repetitions of Method 3: 6.577038 [sec]
#>> Best time: 0.063437
#>> Mean time: 0.065770
I use a nice Timer
courtesy of Huy Nguyen:
# credit: http://www.huyng.com/posts/python-performance-analysis/
import time
class Timer(object):
def __init__(self, verbose=False):
self.verbose = verbose
def __enter__(self):
self.start = time.clock()
return self
def __exit__(self, *args):
self.end = time.clock()
self.secs = self.end - self.start
self.msecs = self.secs * 1000 # millisecs
if self.verbose:
print 'elapsed time: %f ms' % self.msecs
If you are still following, I have two questions:
1) Why is the append
method faster? (NOTE: for very small dataframes, i.e. n_rows = 40
, it is actually slower).
2) What is the most efficient way to build a large dataframe out of chunks? (In my case, the chunks are all large csv files).
Thanks for your help!
EDIT_1:
In my real world project, the columns have different dtypes. So I cannot use the pd.DataFrame(.... dtype=some_type)
trick to improve the performance of preallocation, per BrenBarn’s recommendation. The dtype parameter forces all the columns to be the same dtype [Ref. issue 4464]
I added some lines to method2()
in my code to change the dtypes column-by-column to match in the input dataframes. This operation is expensive and negates the benefits of having the appropriate dtypes when writing blocks of rows.
EDIT_2: Try preallocating a dataframe using placeholder array np.empty(... dtyp=some_type)
. Per @Joris’s suggestion.
You didn’t specify any data or type for out_df2
, so it has the “object” dtype. This makes assigning values to it very slow. Specify float64 dtype:
out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows), dtype=np.float64)
You will see a dramatic speedup. When I tried it, method2
with this change is about twice as fast as method1
.
Your benchmark is actually too small to show the real difference.
Appending, copies EACH time, so you are actually doing copying a size N memory space N*(N-1) times. This is horribly inefficient as the size of your dataframe grows. This certainly might not matter in a very small frame. But if you have any real size this matters a lot. This is specifically noted in the docs here, though kind of a small warning.
In [97]: df = DataFrame(np.random.randn(100000,20))
In [98]: df['B'] = 'foo'
In [99]: df['C'] = pd.Timestamp('20130101')
In [103]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 22 columns):
0 100000 non-null float64
1 100000 non-null float64
2 100000 non-null float64
3 100000 non-null float64
4 100000 non-null float64
5 100000 non-null float64
6 100000 non-null float64
7 100000 non-null float64
8 100000 non-null float64
9 100000 non-null float64
10 100000 non-null float64
11 100000 non-null float64
12 100000 non-null float64
13 100000 non-null float64
14 100000 non-null float64
15 100000 non-null float64
16 100000 non-null float64
17 100000 non-null float64
18 100000 non-null float64
19 100000 non-null float64
B 100000 non-null object
C 100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(20), object(1)
memory usage: 17.5+ MB
Appending
In [85]: def f1():
....: result = df
....: for i in range(9):
....: result = result.append(df)
....: return result
....:
Concat
In [86]: def f2():
....: result = []
....: for i in range(10):
....: result.append(df)
....: return pd.concat(result)
....:
In [100]: f1().equals(f2())
Out[100]: True
In [101]: %timeit f1()
1 loops, best of 3: 1.66 s per loop
In [102]: %timeit f2()
1 loops, best of 3: 220 ms per loop
Note that I wouldn’t even bother trying to pre-allocate. Its somewhat complicated, especially since you are dealing with multiple dtypes (e.g. you could make a giant frame and simply .loc
and it would work). But pd.concat
is just dead simple, works reliably, and fast.
And timing of your sizes from above
In [104]: df = DataFrame(np.random.randn(2500,40))
In [105]: %timeit f1()
10 loops, best of 3: 33.1 ms per loop
In [106]: %timeit f2()
100 loops, best of 3: 4.23 ms per loop
@Jeff, pd.concat
wins by a mile! I benchmarked a fourth method using pd.concat
with num_dfs = 500
. The results are unequivocal:
The method4()
definition:
# Method 4 - us pd.concat on df_list
def method4():
return pd.concat(df_list, ignore_index=True)
Profiling results, using the same Timer
in my original question:
Total time for 100 repetitions of Method 1: 3679.334655 [sec]
Best time: 35.570036
Mean time: 36.793347
Total time for 100 repetitions of Method 2: 1569.917425 [sec]
Best time: 15.457102
Mean time: 15.699174
Total time for 100 repetitions of Method 3: 325.730455 [sec]
Best time: 3.192702
Mean time: 3.257305
Total time for 100 repetitions of Method 4: 25.448473 [sec]
Best time: 0.244309
Mean time: 0.254485
The pd.concat
method is 13x faster than preallocating with a np.empty(... dtype)
palceholder.
Jeff’s answer is correct, but I found for my data type another solution worked better.
def df_():
return pd.DataFrame(['foo']*np.random.randint(100)).transpose()
k = 100
frames = [df_() for x in range(0, k)]
def f1():
result = frames[0]
for i in range(k-1):
result = result.append(frames[i+1])
return result
def f2():
result = []
for i in range(k):
result.append(frames[i])
return pd.concat(result)
def f3():
result = []
for i in range(k):
result.append(frames[i])
n = 2
while len(result) > 1:
_result = []
for i in range(0, len(result), n):
_result.append(pd.concat(result[i:i+n]))
result = _result
return result[0]
My dataframes are a single row and of varying length – the null entries must have something to do with why f3() succeeds.
In [33]: f1().equals(f2())
Out[33]: True
In [34]: f1().equals(f3())
Out[34]: True
In [35]: %timeit f1()
357 ms ± 192 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [36]: %timeit f2()
562 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]: %timeit f3()
215 ms ± 58.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The above results is still for k=100, but for larger k it is even more significant.