How to make python for loops faster

Question:

I have a list of dictionaries, like this:

[{'user': '123456', 'db': 'db1', 'size': '8628'}
{'user': '123456', 'db': 'db1', 'size': '7168'}
{'user': '123456', 'db': 'db1', 'size': '38160'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db3', 'size': '8628'}
{'user': '222345', 'db': 'db5', 'size': '840'}
{'user': '34521', 'db': 'db6', 'size': '12288'}
{'user': '34521', 'db': 'db6', 'size': '476'}
{'user': '2345156', 'db': 'db7', 'size': '5120'}.....]

This list contains millions of entries. Each user can be found in multiple dbs, each user can have multiple entires in the same db. I want to sum up how much is the size occupied by each user, per each db. I don’t want to use pandas.
At the moment I do it this way:

  • I create 2 lists of unique users and unique dbs
  • Use those lists to iterate through the big list and sum up where user and db are the same
result = []
for user in unique_users:
    for db in unique_dbs:
        total_size = 0
        for i in big_list:
            if (i['user'] == user and i['db'] == db):
                total_size += float(i['size'])
        if(total_size) > 0:
            row = {}
            row['user'] = user
            row['db'] = db
            row['size'] = total_size
            result.append(row)

The problem is that this triple for loop develops into something very large (hundreds of billions of iterations) which takes forever to sum up the result.
If the big_list is small, this works very well.

How should I approach this in order to keep it fast and simple?
Thanks a lot!

Asked By: Ioan Fulgeanu

||

Answers:

Try mapping user to db to the total size in a dictionary. It will require additional memory but should be faster to access & requires only a single pass through the data:

user_to_db_to_size = {}
for entry in unique_users:
    user = entry['user']
    db = entry['db']
    size = int(entry['size'])
    if user not in user_to_db_to_size:
        user_to_db_to_size[user] = {}
    if db not in user_to_db_to_size[user]:
        user_to_db_to_size[user][db] = 0
    user_to_db_to_size[user][db] += size

print(user_to_db_to_size)

For your sample data it produces:

{'123456': {'db1': 53956}, '222345': {'db3': 17256, 'db5': 840}, '34521': {'db6': 12764}, '2345156': {'db7': 5120}}

And now you can access the total size per user/db with:

print(user_to_db_to_size['123456']['db1'])  # 53956
Answered By: rdas

There are two main issue with the current approach: the inefficient algorithm and the inefficient data structure.

The first is that the algorithm used is clearly inefficient as it iterates many times over the big list. There is not need to iterate over the whole list to filter a unique user and db. You can iterate over the big list once and aggregate data using a dictionary. The key of the target dictionary is simply a (user, db) tuple. The value of the dictionary is total_size. Here is an untested example:

# Aggregation part
# Note: a default dict can be used instead to make the code possibly simpler
aggregate_dict = dict()
for i in big_list:
    key = (i['user'], i['db'])
    value = float(i['size'])
    if key in aggregate_dict:
        aggregate_dict[key] += value
    else:
        aggregate_dict[key] = value

# Fast creation of `result`
result = []
for user in unique_users:
    for db in unique_dbs:
        total_size = aggregate_dict.get((user, key))
        if total_size is not None and total_size > 0:
            result.append({'user': user, 'db': db, 'size': total_size})

The other issue is the inefficient data structure: for each row, the keys are replicated while tuples can be used instead. In fact, a better data structure is to store a dictionary of (column, items) key-values where items is a list of items for the target column. This way of storing data is called a dataframe. This is roughly what Pandas uses internally (except it is a Numpy array which is even better as it is more compact and generally more efficient than a list for most operations). Using this data structure for both the input and the output should result in a significant speed up (if combined with Numpy) and a lower memory footprint.

Answered By: Jérôme Richard

If you use Counter and make tuples of value pairs (user, db) as keys, then:

from collections import Counter

data = [{'user': '123456', 'db': 'db1', 'size': '8628'},
        {'user': '123456', 'db': 'db1', 'size': '7168'},
        {'user': '123456', 'db': 'db1', 'size': '38160'},
        {'user': '222345', 'db': 'db3', 'size': '8628'},
        {'user': '222345', 'db': 'db3', 'size': '8628'},
        {'user': '222345', 'db': 'db5', 'size': '840'},
        {'user': '34521', 'db': 'db6', 'size': '12288'},
        {'user': '34521', 'db': 'db6', 'size': '476'},
        {'user': '2345156', 'db': 'db7', 'size': '5120'}]

print(sum((Counter({(d['user'], d['db']): int(d['size'])}) for d in data), start=Counter()))

Counter({('123456', 'db1'): 53956, ('222345', 'db3'): 17256, ('34521', 'db6'): 12764, ('2345156', 'db7'): 5120, ('222345', 'db5'): 840})
Answered By: Сергей Кох

In summary, looking at the three suggested answers to this problem, the approach from rdas is probably the winner. With a couple of modifications, it is 57% faster than the solution for Jérôme, and 180 times faster than the original code. Сергей’s solution was roughly 350 times slower on a much trimmed subset of the results (1000 entries); it also seemed to scale very badly and I did not have time to wait for results from the full dataset.

The timings are as follows:

Method Time Relative
Original 102.83720700000413 180.4580695623077
Jérôme 0.8722491000080481 1.5306171118096032
Jérôme (updated) 0.9000219000154175 1.5793526426731528
rdas 0.6866130999987945 1.2048642527015463
rdas (updated) 0.6580462999991141 1.1547354157572032
rdas with defaultdict 0.5698675999883562 1.0

In Jérôme’s answer, the final results calculation can be improved by swapping the final list construction to a list comprehension. Looking at the isolated list comprehension change, update code is roughly 38% faster. It is worth noting this isn’t a major contributor to the overall time.

[{'user': user, 'db': db, 'size': total_size} for user in unique_users for db in unique_dbs if (total_size := aggregate_dict.get((user, db)))]

comparing before (Method1) and after (Method2) over a few iterations gives:

Method1: 6.447344800049905
Method2: 4.851344000024255
Method1: 6.546811900043394
Method2: 4.816081999975722
Method1: 6.863985500007402
Method2: 4.755402299982961
Method1: 6.8024070000392385
Method2: 4.738170499971602
Method1: 6.524162200046703
Method2: 4.755566900013946

The first update to rdas’ answer was to avoid doing multiple dictionary lookups, by caching looking up the first dictionary:

if not (user_dict := user_to_db_to_size.get(user)):
    user_to_db_to_size[user] = { db: size }
elif not db in user_dict:
    user_dict[db] = size
else:
    user_dict[db] += size

this was subsequently improved again by swapping to collections.defaultdict which removes additional costly dictionary lookups:

user_to_db_to_size = defaultdict(lambda: defaultdict(int))
for entry in big_list:
    user = entry['user']
    db = entry['db']
    size = int(entry['size'])
    user_to_db_to_size[user][db] += size
return user_to_db_to_size

This is the code used to run the benchmarks:

import random
import timeit
from collections import defaultdict

random.seed(1)

test_iterations = 100
big_list = [{'user': random.randint(0, 100), 'db': f'db{random.randint(1, 10)}', 'size': f'{random.randint(100, 90000)}' } for i in range(10000)]
unique_users = { i['user'] for i in big_list }
unique_dbs = { i['db'] for i in big_list }

def method0():
    result = []
    for user in unique_users:
        for db in unique_dbs:
            total_size = 0
            for i in big_list:
                if (i['user'] == user and i['db'] == db):
                    total_size += float(i['size'])
            if(total_size) > 0:
                row = {}
                row['user'] = user
                row['db'] = db
                row['size'] = total_size
                result.append(row)
    return result

def method1():
    aggregate_dict = dict()
    for i in big_list:
        key = (i['user'], i['db'])
        value = float(i['size'])
        if key in aggregate_dict:
            aggregate_dict[key] += value
        else:
            aggregate_dict[key] = value
    result = []
    for user in unique_users:
        for db in unique_dbs:
            total_size = aggregate_dict.get((user, db))
            if total_size is not None and total_size > 0:
                result.append({'user': user, 'db': db, 'size': total_size})
    return result

def method2():
    aggregate_dict = dict()
    for i in big_list:
        key = (i['user'], i['db'])
        value = float(i['size'])
        if key in aggregate_dict:
            aggregate_dict[key] += value
        else:
            aggregate_dict[key] = value
    return [{'user': user, 'db': db, 'size': total_size} for user in unique_users for db in unique_dbs if (total_size := aggregate_dict.get((user, db)))]

def method3():
    user_to_db_to_size = {}
    for entry in big_list:
        user = entry['user']
        db = entry['db']
        size = int(entry['size'])
        if user not in user_to_db_to_size:
            user_to_db_to_size[user] = {}
        if db not in user_to_db_to_size[user]:
            user_to_db_to_size[user][db] = 0
        user_to_db_to_size[user][db] += size
    return user_to_db_to_size

def method4():
    user_to_db_to_size = {}
    for entry in big_list:
        user = entry['user']
        db = entry['db']
        size = int(entry['size'])
        if not (user_dict := user_to_db_to_size.get(user)):
            user_to_db_to_size[user] = { db: size }
        elif not db in user_dict:
            user_dict[db] = size
        else:
            user_dict[db] += size
    return user_to_db_to_size

def method5():
    user_to_db_to_size = defaultdict(lambda: defaultdict(int))
    for entry in big_list:
        user = entry['user']
        db = entry['db']
        size = int(entry['size'])
        user_to_db_to_size[user][db] += size
    return user_to_db_to_size

assert (m0 := method0()) == method1()
print('Method1 OK')
assert m0 == method2()
print('Method2 OK')
assert all(v['size'] == method3()[v['user']][v['db']] for v in m0)
print('Method3 OK')
assert all(v['size'] == method4()[v['user']][v['db']] for v in m0)
print('Method4 OK')
assert all(v['size'] == method5()[v['user']][v['db']] for v in m0)
print('Method5 OK')

t0 = timeit.timeit(method0, number=test_iterations)
t1 = timeit.timeit(method1, number=test_iterations)
t2 = timeit.timeit(method2, number=test_iterations)
t3 = timeit.timeit(method3, number=test_iterations)
t4 = timeit.timeit(method4, number=test_iterations)
t5 = timeit.timeit(method5, number=test_iterations)
tmin = min((t0, t1, t2, t3, t4, t5))

print(f'| Method           | Time | Relative      |')
print(f'|------------------|----------------------|')
print(f'| Original         | {t0} | {t0 / tmin}   |')
print(f'| Jérôme           | {t1} | {t1 / tmin}   |')
print(f'| Jérôme (updated) | {t2} | {t2 / tmin}   |')
print(f'| rdas             | {t3} | {t3 / tmin}   |')
print(f'| rdas (updated)   | {t4} | {t4 / tmin}   |')
print(f'| defaultdict      | {t5} | {t5 / tmin}   |')

results in:

Method1 OK
Method2 OK
Method3 OK
Method4 OK
Method4 OK
| Method           | Time | Relative      |
|------------------|----------------------|
| Original         | 102.83720700000413 | 180.4580695623077   |
| Jérôme           | 0.8722491000080481 | 1.5306171118096032   |
| Jérôme (updated) | 0.9000219000154175 | 1.5793526426731528   |
| rdas             | 0.6866130999987945 | 1.2048642527015463   |
| rdas (updated)   | 0.6580462999991141 | 1.1547354157572032   |
| defaultdict      | 0.5698675999883562 | 1.0   |

Note Сергей’s method is roughly equivalent to:

def c():
    tmp = Counter()
    for d in big_list:
        tmp = tmp + Counter(d)
    return tmp

so is effectively creating a new dictionary with every iteration in the loop. For each iteration in the loop the dictionary it creates will be bigger. I suspect this scales as N ** 2 or worse.

Answered By: John M.