Improve Row Append Performance On Pandas DataFrames

Question:

I am running a basic script that loops over a nested dictionary, grabs data from each record, and appends it to a Pandas DataFrame. The data looks something like this:

data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...}

In total it has a few million records. The script itself looks like this:

city = ["SomeCity"]
df = DataFrame({}, columns=['Date', 'HouseID', 'Price'])
for city in cities:
    for dateRun in data[city]:
        for record in data[city][dateRun]:
            recSeries = Series([record['Timestamp'], 
                                record['Id'], 
                                record['Price']],
                                index = ['Date', 'HouseID', 'Price'])
            FredDF = FredDF.append(recSeries, ignore_index=True)

This runs painfully slow, however. Before I look for a way to parallelize it, I just want to make sure I’m not missing something obvious that would make this perform faster as it is, as I’m still quite new to Pandas.

Asked By: Brideau

||

Answers:

I ran into a similar problem where I had to append many times to a DataFrame, but did not know the values in advance of the appends. I wrote a lightweight DataFrame like data structure that is just blists() under the hood. I use that to accumulate all of the data and then when it is complete transform the output into a Pandas DataFrame. Here is a link to my project, all open source so I hope it helps others:

https://pypi.python.org/pypi/raccoon

Answered By: Ryan Sheftel

I also used the dataframe’s append function inside a loop and I was perplexed how slow it ran.

A useful example for those who are suffering, based on the correct answer on this page.

Python version: 3

Pandas version: 0.20.3

# the dictionary to pass to pandas dataframe
d = {}

# a counter to use to add entries to "dict"
i = 0 

# Example data to loop and append to a dataframe
data = [{"foo": "foo_val_1", "bar": "bar_val_1"}, 
       {"foo": "foo_val_2", "bar": "bar_val_2"}]

# the loop
for entry in data:

    # add a dictionary entry to the final dictionary
    d[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']}
    
    # increment the counter
    i = i + 1

# create the dataframe using 'from_dict'
# important to set the 'orient' parameter to "index" to make the keys as rows
df = DataFrame.from_dict(d, "index")

The "from_dict" function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

Answered By: P-S

Appending rows to lists is far more efficient than to a DataFrame.
Hence you would want to

  1. append the rows to a list.
  2. Then convert it into DataFrame and
  3. set the index as required.
Answered By: Mahidhar Surapaneni

I think the best way to do it is, if you know the data you are going to receive, allocate before hand.

import numpy as np
import pandas as pd

random_matrix = np.random.randn(100, 100)
insert_df = pd.DataFrame(random_matrix)

df = pd.DataFrame(columns=range(100), index=range(200))
df.loc[range(100), df.columns] = random_matrix
df.loc[range(100, 200), df.columns] = random_matrix

This is the pattern that I think makes the most sense. append will be faster if
you have a very small dataframe, but it doesn’t scale.

In [1]: import numpy as np; import pandas as pd

In [2]: random_matrix = np.random.randn(100, 100)
   ...: insert_df = pd.DataFrame(random_matrix)
   ...: df = pd.DataFrame(np.random.randn(100, 100))

In [2]: %timeit df.append(insert_df)
272 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
493 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(100), df.columns] = insert_df
821 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

When we run this with a 100,000 row dataframe, we see much more dramatic results.

In [1]: df = pd.DataFrame(np.random.randn(100_000, 100))

In [2]: %timeit df.append(insert_df)
17.9 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
465 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(99_900, 100_000), df.columns] = random_matrix
465 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit df.loc[range(99_900, 100_000), df.columns] = insert_df
1.02 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So we can see an append is about 17 times slower than an insert with a dataframe, and 35 times slower than an insert with a numpy array.

Answered By: Rob

Another way is to make it into a list and then use pd.concat

import pandas as pd 

df = pd.DataFrame({'num_legs': [2, 4, 8, 0],

                   'num_wings': [2, 0, 0, 0],

                   'num_specimen_seen': [10, 2, 1, 8]},

                  index=['falcon', 'dog', 'spider', 'fish'])

def append(df):
    df_out = df.copy()
    for i in range(1000):
        df_out = df_out.append(df)
    return df_out

def concat(df):
    df_list = []
    for i in range(1001):
        df_list.append(df)

    return pd.concat(df_list)


# some testing
df2 = concat(df)
df3 = append(df)

pd.testing.assert_frame_equal(df2,df3)

%timeit concat(df):

20.2 ms ± 794 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit append(df)

275 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It is the recommended way to concatenate rows in pandas now:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. link

Answered By: libertasT

In my case I was loading a large number of dataframes with the same columns from different files and wanted to append them to create one large data frame.

My solution was to first load all the dataframes into a list, and then use

all_dfs = []
for i in all_files:
  all_dfs.append(/* load df from file */)

master_df = pd.concat(all_dfs, ignore_index=True)
Answered By: wfbarksdale
N=100000

t0=time.time()
d=[]
for i in range(N):
    d.append([i, i+1,i+2,i+3,i+0.1,1+0.2])
testdf=pd.DataFrame.from_records(d, columns=["x1","x2","x3","x4", "x5", "x6"])
print(time.time()-t0)

t0=time.time()
d={}
for i in range(N):
    d[len(d)+1]={"x1":i, "x2":i+1, "x3":i+2,"x4":i+3,"x5":i+0.1,"x6":1+0.2}
testdf=pd.DataFrame.from_dict(d, "index")
print(time.time()-t0)

t0=time.time()
testdf=pd.DataFrame()
for i in range(N):
    testdf=testdf.append({"x1":i, "x2":i+1, "x3":i+2,"x4":i+3,"x5":i+0.1,"x6":1+0.2}, ignore_index=True)
print(time.time()-t0)


=== result for N=10000 ===
list:0.016329050064086914
dict:0.03952217102050781
DataFrame:10.598219871520996

=== result for N=100000 ===
list: 0.4076499938964844
dict: 0.45696187019348145
DataFrame: 187.6609809398651

Answered By: Qinghua

in my case I didn’t see any improvement:

From:

for box, score, label in zip(boxes, scores, labels):

    box = [round(i, 2) for i in box.tolist()]
               
    if score >= score_threshold:
        global_dframe.append(
                        {
                         'Image': images,
                         'Label': text[label],
                         'Confidence':  round(score.item(), 3),
                         'Bounding box': box
                        }
                )

To:

global_dict = {}
dict_index = 0

for box, score, label in zip(boxes, scores, labels):

    box = [round(i, 2) for i in box.tolist()]
                   
    if score >= score_threshold:
        global_dict[dict_index] = {"Image": images, 'Label': text[label], 'Confidence':  round(score.item(), 3), 'Bounding box': box }
        dict_index = dict_index + 1

I’m using python3.8, pandas 1.5.2. I’m doing something wrong? Thanks.

Answered By: unrue
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.