Loop gets slower after each iteration
Question:
I have a python
scripts which is about the following:
- I have a list of jsons
- I create an empty
pandas
dataframe
- I run a for loop on this list
- I create an empty dictionary at every iteration with the (same) keys which are interesting for me
- I parse the json at every iteration to retrieve the values of the keys
- I append the dictionary at every iteration to the
pandas
dataframe
The issue with this is that at every iteration the processing time is increasing.
Specifically:
0-1000 documents -> 5 seconds
1000-2000 documents -> 6 seconds
2000-3000 documents -> 7 seconds
...
10000-11000 documents -> 18 seconds
11000-12000 documents -> 19 seconds
...
22000-23000 documents -> 39 seconds
23000-24000 documents -> 42 seconds
...
34000-35000 documents -> 69 seconds
35000-36000 documents -> 72 seconds
Why is this happening?
My code looks like this:
# 'documents' is the list of jsons
columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
df_documents = pd.DataFrame(columns=columns)
for index, document in enumerate(documents):
dict_document = dict.fromkeys(columns)
...
(parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
...
df_documents = df_documents.append(dict_document, ignore_index=True)
P.S.
After applying @eumiro’s suggestion below the times are the following:
0-1000 documents -> 0.06 seconds
1000-2000 documents -> 0.05 seconds
2000-3000 documents -> 0.05 seconds
...
10000-11000 documents -> 0.05 seconds
11000-12000 documents -> 0.05 seconds
...
22000-23000 documents -> 0.05 seconds
23000-24000 documents -> 0.05 seconds
...
34000-35000 documents -> 0.05 seconds
35000-36000 documents -> 0.05 seconds
After applying @DariuszKrynicki’s suggestion below the times are the following:
0-1000 documents -> 0.56 seconds
1000-2000 documents -> 0.54 seconds
2000-3000 documents -> 0.53 seconds
...
10000-11000 documents -> 0.51 seconds
11000-12000 documents -> 0.51 seconds
...
22000-23000 documents -> 0.51 seconds
23000-24000 documents -> 0.51 seconds
...
34000-35000 documents -> 0.51 seconds
35000-36000 documents -> 0.51 seconds
...
Answers:
Yes, append
ing to a DataFrame will be slower after each new line, because it has to copy the whole (growing) content again and again.
Create a simple list, append to it and then create one DataFrame in one step:
records = []
for index, document in enumerate(documents):
…
records.append(dict_document)
df_documents = pd.DataFrame.from_records(records)
The answer could already lie in the pandas.DataFrame.append
method which you are constantly using. This is very inefficient, as it needs to allocate new memory frequently, i.e. copying the old one, which could explain your results. See also the official pandas.DataFrame.append docs for this:
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
with the two examples:
Less efficient:
>>> df = pd.DataFrame(columns=['A'])
>>> for i in range(5): ... df = df.append({'A': i}, ignore_index=True)
>>> df A 0 0 1 1 2 2 3 3 4 4
More efficient:
>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ... ignore_index=True) A 0 0 1 1 2 2 3 3 4 4
You can apply the same strategy, create a list of dataframes instead of appending to the same dataframe with each iteration, then concat
once your for
loop is finished
I suspect your DataFrame is growing with each iteration.
How about using iterators?
# documents = # json
def get_df_from_json(document):
columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
# parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
# dict_document = # use document to parse it and create dictionary
return pd.DataFrame(list(dict_document.values()), index=dict_document)
res = (get_df_from_json(document) for document in enumerate(documents))
res = pd.concat(res).reset_index()
EDIT:
I have made a quick comparison on such example as below and it turns out that iterator use does not speed up the code against list comprehension use:
import json
import time
def get_df_from_json():
dd = {'a': [1, 1], 'b': [2, 2]}
app_json = json.dumps(dd)
return pd.DataFrame(list(dd.values()), index=dd)
start = time.time()
res = pd.concat((get_df_from_json() for x in range(1,20000))).reset_index()
print(time.time() - start)
start = time.time()
res = pd.concat([get_df_from_json() for x in range(1,20000)]).reset_index()
print(time.time() - start)
iterator: 9.425999879837036
list comprehension: 8.934999942779541
This may get deleted by the good people at stack overflow, but every time I see a question about "why is my loop slowing down", no one actually gives an answer, yes you can always speed them up by using different code, using lists intead of dataframes etc, but in my experience, it will still slow down, even if there is no object that you can see growing in size. I can’t find an anser to that. I find myself resetting the variables every x number of iterations to get it done faster for long jobs.
I have a python
scripts which is about the following:
- I have a list of jsons
- I create an empty
pandas
dataframe - I run a for loop on this list
- I create an empty dictionary at every iteration with the (same) keys which are interesting for me
- I parse the json at every iteration to retrieve the values of the keys
- I append the dictionary at every iteration to the
pandas
dataframe
The issue with this is that at every iteration the processing time is increasing.
Specifically:
0-1000 documents -> 5 seconds
1000-2000 documents -> 6 seconds
2000-3000 documents -> 7 seconds
...
10000-11000 documents -> 18 seconds
11000-12000 documents -> 19 seconds
...
22000-23000 documents -> 39 seconds
23000-24000 documents -> 42 seconds
...
34000-35000 documents -> 69 seconds
35000-36000 documents -> 72 seconds
Why is this happening?
My code looks like this:
# 'documents' is the list of jsons
columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
df_documents = pd.DataFrame(columns=columns)
for index, document in enumerate(documents):
dict_document = dict.fromkeys(columns)
...
(parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
...
df_documents = df_documents.append(dict_document, ignore_index=True)
P.S.
After applying @eumiro’s suggestion below the times are the following:
0-1000 documents -> 0.06 seconds
1000-2000 documents -> 0.05 seconds
2000-3000 documents -> 0.05 seconds
...
10000-11000 documents -> 0.05 seconds
11000-12000 documents -> 0.05 seconds
...
22000-23000 documents -> 0.05 seconds
23000-24000 documents -> 0.05 seconds
...
34000-35000 documents -> 0.05 seconds
35000-36000 documents -> 0.05 seconds
After applying @DariuszKrynicki’s suggestion below the times are the following:
0-1000 documents -> 0.56 seconds
1000-2000 documents -> 0.54 seconds
2000-3000 documents -> 0.53 seconds
...
10000-11000 documents -> 0.51 seconds
11000-12000 documents -> 0.51 seconds
...
22000-23000 documents -> 0.51 seconds
23000-24000 documents -> 0.51 seconds
...
34000-35000 documents -> 0.51 seconds
35000-36000 documents -> 0.51 seconds
...
Yes, append
ing to a DataFrame will be slower after each new line, because it has to copy the whole (growing) content again and again.
Create a simple list, append to it and then create one DataFrame in one step:
records = []
for index, document in enumerate(documents):
…
records.append(dict_document)
df_documents = pd.DataFrame.from_records(records)
The answer could already lie in the pandas.DataFrame.append
method which you are constantly using. This is very inefficient, as it needs to allocate new memory frequently, i.e. copying the old one, which could explain your results. See also the official pandas.DataFrame.append docs for this:
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
with the two examples:
Less efficient:
>>> df = pd.DataFrame(columns=['A']) >>> for i in range(5): ... df = df.append({'A': i}, ignore_index=True) >>> df A 0 0 1 1 2 2 3 3 4 4
More efficient:
>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ... ignore_index=True) A 0 0 1 1 2 2 3 3 4 4
You can apply the same strategy, create a list of dataframes instead of appending to the same dataframe with each iteration, then concat
once your for
loop is finished
I suspect your DataFrame is growing with each iteration.
How about using iterators?
# documents = # json
def get_df_from_json(document):
columns = ['column_1', 'column_2', ..., 'column_19', 'column_20']
# parsing the jsons and retrieve the values of the keys and assign them to the dictionary)
# dict_document = # use document to parse it and create dictionary
return pd.DataFrame(list(dict_document.values()), index=dict_document)
res = (get_df_from_json(document) for document in enumerate(documents))
res = pd.concat(res).reset_index()
EDIT:
I have made a quick comparison on such example as below and it turns out that iterator use does not speed up the code against list comprehension use:
import json
import time
def get_df_from_json():
dd = {'a': [1, 1], 'b': [2, 2]}
app_json = json.dumps(dd)
return pd.DataFrame(list(dd.values()), index=dd)
start = time.time()
res = pd.concat((get_df_from_json() for x in range(1,20000))).reset_index()
print(time.time() - start)
start = time.time()
res = pd.concat([get_df_from_json() for x in range(1,20000)]).reset_index()
print(time.time() - start)
iterator: 9.425999879837036
list comprehension: 8.934999942779541
This may get deleted by the good people at stack overflow, but every time I see a question about "why is my loop slowing down", no one actually gives an answer, yes you can always speed them up by using different code, using lists intead of dataframes etc, but in my experience, it will still slow down, even if there is no object that you can see growing in size. I can’t find an anser to that. I find myself resetting the variables every x number of iterations to get it done faster for long jobs.