Does anyone know a better way of doing column calculations?

Question

Has been rewritten!

Currently I’m trying to make some bitwise overlap calculations, using pandas dataframes. The function I use does work, but it’s rather slow, and I would like to speed it up. Unfortunately I don’t really have any good ideas of how I can do that.

This is my current function to do so

def get_simple_overlap(dataframe, events_x, events_y):
    df_dict = dict()

    for evt_x, evt_y in product(events_x, events_y):
        overlap = (dataframe[evt_x] & dataframe[evt_y]).tolist()
        total = (dataframe[evt_x] | dataframe[evt_y]).tolist()
        try:
            percentage = sum(overlap) / sum(total)
        except ZeroDivisionError:
            percentage = 0

        if df_dict.get(str(evt_x)) is None:
            df_dict[str(evt_x)] = dict()
        
        df_dict[str(evt_x)][str(evt_y)] = percentage
    
    df = pd.DataFrame(df_dict)

    return df

matrix = pd.DataFrame({
    "evt_x": [0, 1, 0, 1, 1, 1, 0, 1, 0, 1],
     ...
    "evt_y": [0, 1, 1, 1, 1, 1, 1, 1, 0, 1],
     ...
})

event_x = ['evt_x']
event_y = ['evt_y']

overlaps = get_simple_overlap(matrix, event_x, event_y)

This was a simple way of doing it, and it rather slow. It returns a matrix with the columns being all events in event_x and indexes being all events in event_y. So there is a percentage for each evt_x – evt_y pair.

Here I expect the overlap of overlaps['evt_x']['evt_y'] to be 0.75 since there are 8 times where either event have a 1 at the same index, and 6 times where both of them have a 1 at the same index, making it be 6/8.

Since i have hundreds of thousands indexes with multiple hundreds columns, I would like not iterate through the dataframe like this. And instead use some smarter way of doing this.

Hope the rewritten version is explained in a way simpler and clearer way.

Asked By: Martin Lange

||

Source

Answer 1

This uses dot product to count number of times events occur concurrently. We negate the matrix and use the dot product again to count the number of times
no events occur, and subtract that from the total possible number of events
to get the number of times at least one event occurs. That gives us the required numerator and denominator.

import pandas as pd
import numpy as np

matrix = pd.DataFrame({
    "evt_w": [0, 1, 0, 1, 0, 0, 0, 1, 0, 1],
    "evt_x": [0, 1, 0, 1, 1, 1, 0, 1, 0, 1],
    "evt_y": [0, 1, 1, 1, 1, 1, 1, 1, 0, 1],
    "evt_z": [0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
})


mat=matrix.to_numpy()
neg_mat=1-mat
sim_events = np.dot(mat.T, mat)
tot_events = neg_mat.shape[0]-np.dot(neg_mat.T, neg_mat)
overlaps_mat = np.divide(sim_events, tot_events, out=np.zeros_like(sim_events, dtype=float), where=tot_events!=0)
overlaps_df = pd.DataFrame(overlaps_mat, index=matrix.columns, columns=matrix.columns)

This gives the output:

        evt_w       evt_x       evt_y   evt_z
evt_w   1.000000    0.666667    0.50    0.00
evt_x   0.666667    1.000000    0.75    0.00
evt_y   0.500000    0.750000    1.00    0.25
evt_z   0.000000    0.000000    0.25    1.00

As this question is about speed I did a %%timeit comparison on my machine:

# dummy data
matrix = pd.DataFrame(np.random.randint(0,2,size=(100,10)))

This method returned 99.4 µs ± 803 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Using the original get_simple_overlap function as follows:

list1, list2 = zip(*product(matrix.columns,matrix.columns))
overlaps = get_simple_overlap(matrix, list1, list2)
overlaps

gives 1.77 s ± 46.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) for the same matrix.

Answered By: s_pike

Does anyone know a better way of doing column calculations?

Question:

Answers: