Most efficient way to run matching table (500 x 3) on (1m x 10) pandas dataset

Question

The problem in 1 sentence: I am currently using a pandas apply function to replicate an Excel Vlookup, however I feel like this is not the most efficient way, as it takes a long time to process (10 minutes for a dataset with 600k rows and 10-15 columns.

I have a dataset with Google Analytics data. I want to assign a specific campaignID to each unique URL in the dataset and am currently using a matching table + apply lambda function to do this:

Column A	Column B
df_ftr	Dataset with Google Analytics data
df_match	Matching table
campaignId	Unique key that is generated by our CRM system and has to be assigned to a given page
pagePath	URL that I use as a lookup value in the matching table
property	The dataset contains data from 3 different websites. This is an additional condition that needs to be satisfied in order to be able to assign a given campaignID.

df_ftr['campaignId'] = df_ftr['pagePath'].progress_apply(lambda x: df_match['campaignId'][(df_match['pagePath'] == x) & (df_match['property'] == i)].values[0] if len(df_match.loc[df_match['pagePath'] == x].loc[df_match['property'] == i].values) > 0 else np.nan)

Snippet of Google Analytics dataset (example):

Date	pagePath	property	Pageviews, etc
06/03/23	/	0	34
06/03/23	/about-us	0	12
06/03/23	/about-us	1	32

Snippet of Matching table (example):

pagePath	property	campaignId
/	0	POE-3732-CNS
/about-us	0	EHE-7648-FHD
/about-us	1	OWS-2739-WJS

What is the most efficient way to approach this?

I already tried to speed up the process with pandarallel (multi core processing) and while this cuts about 25% of the total time required, it still takes too much time to run on a frequent interval.

Asked By: Jeroen Vester

||

Source

Answer 1

welcome to StackOverflow.

Given your matching table contains unique entries for combinations of pagePath and property, you could do the following:

pd.merge(df_google_analytics, df_matching, on=['pagePath', 'property'])

Hope that helps.

Answered By: Lukas Hestermeyer

Answer 2

A little difficult to compare your solution to mine since you did not provide th function you applied, but you can check

import pandas as pd
import numpy as np
import time

df_ftr = pd.DataFrame({
    'Date': ['06/03/23', '06/03/23', '06/03/23'],
    'pagePath': ['/', '/about-us', '/about-us'],
    'property': [0, 0, 1],
    'Pageviews': [34, 12, 32]
})
df_match = pd.DataFrame({
    'pagePath': ['/', '/about-us', '/about-us'],
    'property': [0, 0, 1],
    'campaignId': ['POE-3732-CNS', 'EHE-7648-FHD', 'OWS-2739-WJS']
})

# Original solution using apply and lambda
start_time = time.time()

i = 0 
df_ftr['campaignId'] = df_ftr['pagePath'].progress_apply(lambda x: df_match['campaignId'][(df_match['pagePath'] == x) & (df_match['property'] == i)].values[0] if len(df_match.loc[df_match['pagePath'] == x].loc[df_match['property'] == i].values) > 0 else np.nan)

print("Original solution execution time: %.4f seconds" % (time.time() - start_time))

# Solution using merge
start_time = time.time()

merged_df = pd.merge(df_ftr, df_match, on=['pagePath', 'property'], how='left')
merged_df = merged_df.drop_duplicates(subset='pagePath')
merged_df = merged_df[['pagePath', 'campaignId']]
df_ftr = pd.merge(df_ftr, merged_df, on='pagePath', how='left')

print("New solution execution time: %.4f seconds" % (time.time() - start_time))

which for my solution returns

New solution execution time: 0.0174 seconds

If mine is less efficient, I’ll delete it.

Answered By: Serge de Gosson de Varennes

Most efficient way to run matching table (500 x 3) on (1m x 10) pandas dataset

Question:

Answers: