Optimizing 3x nested loop to avoid MemoryError working with big datasets

Question:

I need your help to rewrite this so that I wont get memory error.

I have two dataframes containing laptops/pc’s with their config.

DataFrame one called df_original:

processorName GraphicsCardname ProcessorBrand
5950x Rtx 3060 ti i7
3600 Rtx 3090 i7
1165g7 Rtx 3050 i5

DataFrame two df_compare:

processorName GraphicsCardname ProcessorBrand
5950x Rtx 3090 i7
1165g7 Rtx 3060 ti i7
1165g7 Rtx 3050 i5

What I would like to do is calculate if they are similar. By similar meaning, check each value in the column and compare it to the same column value. For example comparing 5950x to 1165g7 (processorName). These features has values (weights) for example processorName has a weight of 2.

So for each row of df1 I want to check if they have the same config in df2
If yes do nothing, if not add their value to a variable called weight. For example if two rows are the same, only the processorName is differs, then the weight is going to be 2 because processorName has a value of 2.

This is what I am doing:

values=[]
for i, df_orig in enumerate(df_original):
                values.append([])
                for df_comp in df_compare:
                    values[i].append(calculate_values(df_orig, df_comp, columns))


def calculate_values(df_orig, df_comp, columns):
        weight = 0
        for i, c in enumerate(df_orig):
            if df_comp[i] != c:
                weight += get_weight(columns[i])  #just gets their so called weight like 2 if they don’t have the same processorName
        return weight

The output for values would be like values = [[2,2,6],[2,4,6] ... ]

the output values =[ [2,2,6],[2,4,6]...] it means that values[0] is the first row in the df_original values[0][0] is the weight compared first row from df_original and first row from df_compare values[0][1] is the weight from the first row from df_original and the second row from the df_compare and thats how it goes on

This 3x for loop is very slow and giving me MemoryError. I am working with around 200k rows each.

Would you mind helping me rewrite this into a faster way?

Thanks

Asked By: My name is jeff

||

Answers:

Memory issue

Your desired output for such big dataframes can’t fit into regular pc memory.

With two DataFrames of ~200k rows you are calculating a 2d array (matrix) of shape 200k times 200k. That’s 40 billion values – 160 GB assuming the size of the data type is 4B.

I would suggest you reconsider the need of whole similarity matrix. The structure you might want to replace it with depends on what you want to do with the matrix:

  • Do you want to find devices with similarity higher than x? Store their indices into a list as pairs
  • Do you want to find N most similar devices? You can use some kind of priority queue ordered by similarity score and remove lower values during calculation.

Speed issue

Comparing strings is expensive as every single character has to be compared. Since the only operation performed with the data is equality they can be either hashed:

pd.util.hash_array(df.to_numpy().flatten()).reshape(df.shape)

or converted into category type:

df.astype('category'). However this improvement alone won’t make enough difference.

As stated in this answer, the fastest way to iterate rows is using df.itertuples. But again your dataset it too large and only iterating without computation would take more than 5 hours. For loops in python are unfit for large data processing.

Marginally faster approach would be to use vector operations provided by pandas or numpy. This would be one way to do it using hashing and numpy:

import pandas as pd
import numpy as np
from tqdm import tqdm

def df_to_hashed_array(df):
    return pd.util.hash_array(df.to_numpy().flatten()).reshape(df.shape)

original_array = df_to_hashed_array(df_original)
compare_array = df_to_hashed_array(df_compare)

weights = np.array([1, 2, 3])  # vector of weights for each column instead of function

values = []

for i, row in enumerate(tqdm(original_array)):  # for each row in original
    # Compare row to whole compare_array. Output has 1 where the values of original_array differs from row
    diff_values_array = original_array != row

    # Mutliply by weights
    weighted_diff = diff_values_array * weights

    # Sum weights on each row of compare_array
    # output is vector 
    row_weights = weighted_diff.sum(axis=1)

    # EXAMPLE: find values with weight less than 1
    filtered_indices = np.where(row_weights < 1)
    values.extend((i, j[0]) for j in filtered_indices)

Output here is list of tuples (df_original index value, df_compare index value) satisfying the condition. Dataframes are required to have index 0 to N to map these pairs back to their original values.

This code still runs ~20 minutes. I don’t think it can get much better for o problem with quadratic complexity and large data.

Answered By: jrudolf

IIUC try using list comprehension – note that this assumes every column has a weight of 2

lst = [list(((val != df_compare)*2).sum(axis=1)) for val in df_original.values]

# -> [[2, 2, 6], [2, 4, 6], [6, 4, 0]]

If you have custom weights try

weights = {'processorName': {True: 1},
           'GraphicsCardname': {True: 2},
           'ProcessorBrand': {True: 3}}

lst = [list((val != df_compare).replace(weights).astype(int).sum(axis=1)) 
       for val in df_original.values]

# -> [[2, 1, 6], [1, 3, 6], [6, 5, 0]]
Answered By: It_is_Chris