Most efficient way to run matching table (500 x 3) on (1m x 10) pandas dataset

Question:

The problem in 1 sentence: I am currently using a pandas apply function to replicate an Excel Vlookup, however I feel like this is not the most efficient way, as it takes a long time to process (10 minutes for a dataset with 600k rows and 10-15 columns.

I have a dataset with Google Analytics data. I want to assign a specific campaignID to each unique URL in the dataset and am currently using a matching table + apply lambda function to do this:

Column A Column B
df_ftr Dataset with Google Analytics data
df_match Matching table
campaignId Unique key that is generated by our CRM system and has to be assigned to a given page
pagePath URL that I use as a lookup value in the matching table
property The dataset contains data from 3 different websites. This is an additional condition that needs to be satisfied in order to be able to assign a given campaignID.
df_ftr['campaignId'] = df_ftr['pagePath'].progress_apply(lambda x: df_match['campaignId'][(df_match['pagePath'] == x) & (df_match['property'] == i)].values[0] if len(df_match.loc[df_match['pagePath'] == x].loc[df_match['property'] == i].values) > 0 else np.nan)

Snippet of Google Analytics dataset (example):

Date pagePath campaignId property Pageviews, etc
06/03/23 / 0 34
06/03/23 /about-us 0 12
06/03/23 /about-us 1 32

Snippet of Matching table (example):

pagePath property campaignId
/ 0 POE-3732-CNS
/about-us 0 EHE-7648-FHD
/about-us 1 OWS-2739-WJS

What is the most efficient way to approach this?

I already tried to speed up the process with pandarallel (multi core processing) and while this cuts about 25% of the total time required, it still takes too much time to run on a frequent interval.

Asked By: Jeroen Vester

||

Answers:

welcome to StackOverflow.

Given your matching table contains unique entries for combinations of pagePath and property, you could do the following:

pd.merge(df_google_analytics, df_matching, on=['pagePath', 'property'])

Hope that helps.

Answered By: Lukas Hestermeyer

A little difficult to compare your solution to mine since you did not provide th function you applied, but you can check

import pandas as pd
import numpy as np
import time

df_ftr = pd.DataFrame({
    'Date': ['06/03/23', '06/03/23', '06/03/23'],
    'pagePath': ['/', '/about-us', '/about-us'],
    'property': [0, 0, 1],
    'Pageviews': [34, 12, 32]
})
df_match = pd.DataFrame({
    'pagePath': ['/', '/about-us', '/about-us'],
    'property': [0, 0, 1],
    'campaignId': ['POE-3732-CNS', 'EHE-7648-FHD', 'OWS-2739-WJS']
})

# Original solution using apply and lambda
start_time = time.time()

i = 0 
df_ftr['campaignId'] = df_ftr['pagePath'].progress_apply(lambda x: df_match['campaignId'][(df_match['pagePath'] == x) & (df_match['property'] == i)].values[0] if len(df_match.loc[df_match['pagePath'] == x].loc[df_match['property'] == i].values) > 0 else np.nan)

print("Original solution execution time: %.4f seconds" % (time.time() - start_time))

# Solution using merge
start_time = time.time()

merged_df = pd.merge(df_ftr, df_match, on=['pagePath', 'property'], how='left')
merged_df = merged_df.drop_duplicates(subset='pagePath')
merged_df = merged_df[['pagePath', 'campaignId']]
df_ftr = pd.merge(df_ftr, merged_df, on='pagePath', how='left')

print("New solution execution time: %.4f seconds" % (time.time() - start_time))

which for my solution returns

New solution execution time: 0.0174 seconds

If mine is less efficient, I’ll delete it.