Loop gets slower after each iteration

Question:

I have a python scripts which is about the following:

  1. I have a list of jsons
  2. I create an empty pandas dataframe
  3. I run a for loop on this list
  4. I create an empty dictionary at every iteration with the (same) keys which are interesting for me
  5. I parse the json at every iteration to retrieve the values of the keys
  6. I append the dictionary at every iteration to the pandas dataframe

The issue with this is that at every iteration the processing time is increasing.
Specifically:

0-1000 documents -> 5 seconds
1000-2000 documents -> 6 seconds
2000-3000 documents -> 7 seconds
...
10000-11000 documents -> 18 seconds
11000-12000 documents -> 19 seconds
...
22000-23000 documents -> 39 seconds
23000-24000 documents -> 42 seconds
...
34000-35000 documents -> 69 seconds
35000-36000 documents -> 72 seconds

Why is this happening?

My code looks like this:

# 'documents' is the list of jsons

columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']

df_documents = pd.DataFrame(columns=columns)

for index, document in enumerate(documents):

    dict_document = dict.fromkeys(columns)

    ...
    (parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
    ...

    df_documents = df_documents.append(dict_document, ignore_index=True)

P.S.

After applying @eumiro’s suggestion below the times are the following:

    0-1000 documents -> 0.06 seconds
    1000-2000 documents -> 0.05 seconds
    2000-3000 documents -> 0.05 seconds
    ...
    10000-11000 documents -> 0.05 seconds
    11000-12000 documents -> 0.05 seconds
    ...
    22000-23000 documents -> 0.05 seconds
    23000-24000 documents -> 0.05 seconds
    ...
    34000-35000 documents -> 0.05 seconds
    35000-36000 documents -> 0.05 seconds

After applying @DariuszKrynicki’s suggestion below the times are the following:

0-1000 documents -> 0.56 seconds
1000-2000 documents -> 0.54 seconds
2000-3000 documents -> 0.53 seconds
...
10000-11000 documents -> 0.51 seconds
11000-12000 documents -> 0.51 seconds
...
22000-23000 documents -> 0.51 seconds
23000-24000 documents -> 0.51 seconds
...
34000-35000 documents -> 0.51 seconds
35000-36000 documents -> 0.51 seconds
...
Asked By: Outcast

||

Answers:

Yes, appending to a DataFrame will be slower after each new line, because it has to copy the whole (growing) content again and again.

Create a simple list, append to it and then create one DataFrame in one step:

records = []

for index, document in enumerate(documents):
    …
    records.append(dict_document)

df_documents = pd.DataFrame.from_records(records)
Answered By: eumiro

The answer could already lie in the pandas.DataFrame.append method which you are constantly using. This is very inefficient, as it needs to allocate new memory frequently, i.e. copying the old one, which could explain your results. See also the official pandas.DataFrame.append docs for this:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

with the two examples:

Less efficient:

>>> df = pd.DataFrame(columns=['A'])
>>> for i in range(5): ...     df = df.append({'A': i}, ignore_index=True)
>>> df    A 0  0 1  1 2  2 3  3 4  4

More efficient:

>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ...           ignore_index=True)    A 0  0 1  1 2  2 3  3 4  4

You can apply the same strategy, create a list of dataframes instead of appending to the same dataframe with each iteration, then concat once your for loop is finished

Answered By: FlyingTeller

I suspect your DataFrame is growing with each iteration.
How about using iterators?

# documents = # json
def get_df_from_json(document):
    columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
    # parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
    # dict_document =  # use document to parse it and create dictionary
    return pd.DataFrame(list(dict_document.values()), index=dict_document)   

res = (get_df_from_json(document) for document in enumerate(documents))
res = pd.concat(res).reset_index() 

EDIT:
I have made a quick comparison on such example as below and it turns out that iterator use does not speed up the code against list comprehension use:

import json
import time


def get_df_from_json():
    dd = {'a': [1, 1], 'b': [2, 2]}
    app_json = json.dumps(dd)
    return pd.DataFrame(list(dd.values()), index=dd)

start = time.time()
res = pd.concat((get_df_from_json() for x in range(1,20000))).reset_index()
print(time.time() - start)


start = time.time()
res = pd.concat([get_df_from_json() for x in range(1,20000)]).reset_index()
print(time.time() - start)

iterator: 9.425999879837036
list comprehension: 8.934999942779541

Answered By: Dariusz Krynicki

This may get deleted by the good people at stack overflow, but every time I see a question about "why is my loop slowing down", no one actually gives an answer, yes you can always speed them up by using different code, using lists intead of dataframes etc, but in my experience, it will still slow down, even if there is no object that you can see growing in size. I can’t find an anser to that. I find myself resetting the variables every x number of iterations to get it done faster for long jobs.

Answered By: Andrew Martin