Is there a fast way to merge overlapping columns between two pandas dataframes?
Question:
I have a DataFrame with employee information that is missing some records, and want to fill these in using another DataFrame. The way I’m doing it now is below, but takes way too long because it’s a lot of rows.
df_missing = df_cleaned.loc[(df_cleaned["HOURLY_BASE_RATE"]<= 0) | (df_cleaned["HOURLY_BASE_RATE"].isna())]
df_missing_in_integration = df_missing[["ASSOCIATE_ID", "COUNTRY"]].merge(df_integration_wages, on=["ASSOCIATE_ID", "COUNTRY"])
for index, row in df_missing_in_integration.iterrows():
associate_id = row["ASSOCIATE_ID"]
associate_country = row["COUNTRY"]
associate_index = df_cleaned.index[(df_cleaned["ASSOCIATE_ID"] == associate_id) & (df_cleaned["COUNTRY"] == associate_country)]
df_cleaned.loc[associate_index, "HOURLY_BASE_RATE"] = row["HOURLY_BASE_RATE"]
df_cleaned.loc[associate_index, "CURRENCY"] = row["CURRENCY"]
df_cleaned.loc[associate_index, "PAY_COMPONENT"] = row["PAY_COMPONENT"]
df_cleaned.loc[associate_index, "FTE"] = row["FTE"]
Is there a faster way to fill in those missing values based on a unique combination of the ASSOCIATE_ID and COUNTRY columns? I’ve tried merge, but this gives me extra columns instead of filling in the values in the existing columns. I’ve also tried combine_first, but for some reason I still have the nan values when I try this.
Here’s some example DataFrames:
df_cleaned = pd.DataFrame({"ASSOCIATE_ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "COUNTRY": ["USA", "USA", "BEL", "GER", "BEL", "USA", "GER", "GER", "NLD", "NLD"], "HOURLY_BASE_RATE": [15, np.nan, 20, 18, np.nan, np.nan, 43, 38, np.nan, 13], "CURRENCY": ["USD", "USD", "EUR", "EUR", "EUR", "USD", "EUR", "EUR", "EUR", "EUR"], "PAY_COMPONENT": ["Hourly", np.nan, "Hourly", "Hourly", np.nan, np.nan, "Hourly", "Hourly", np.nan, "Hourly"], "FTE": [1, 1, 0.8, 1, np.nan, np.nan, 0.75, 0.75, np.nan, 1], "LOCATION_TYPE": ["Stores", "Stores", "Distribution Center", "Stores", "Headquarters", "Headquarters", "Headquarters", "Distribution Center", "Stores", "Stores"})
df_integration_wages = pd.DataFrame({"ASSOCIATE_ID": [2, 5, 6, 9, 11, 12], "COUNTRY": ["USA", "USA", "USA", "NLD", "BEL", "BEL"], "HOURLY_BASE_RATE": [2500, 23, 37, 20, 32, 16], "CURRENCY": ["USD", "USD", "USD", "EUR", "EUR", "EUR"], "PAY_COMPONENT": ["Monthly", "Hourly", "Hourly", "Hourly", "Hourly", "Hourly"], "FTE": [1, 0.6, 1, 1, 0.8, 1]})
I only want to replace the rows where the wage was missing. There are more rows in the integration file, but I don’t want to include those. Is there a faster way to achieve what want?
Answers:
Use a merge
and combine_first
:
keys = ['ASSOCIATE_ID', 'COUNTRY']
out = df_cleaned.combine_first(df_integration_wages.merge(df_cleaned[keys],
how='right'))
Output:
ASSOCIATE_ID COUNTRY HOURLY_BASE_RATE CURRENCY PAY_COMPONENT FTE
0 1 USA 15.0 USD Hourly 1.00
1 2 USA 2500.0 USD Monthly 1.00
2 3 BEL 20.0 EUR Hourly 0.80
3 4 GER 18.0 EUR Hourly 1.00
4 5 BEL NaN EUR NaN NaN
5 6 USA 37.0 USD Hourly 1.00
6 7 GER 43.0 EUR Hourly 0.75
7 8 GER 38.0 EUR Hourly 0.75
8 9 NLD 20.0 EUR Hourly 1.00
9 10 NLD 13.0 EUR Hourly 1.00
-
Convert all your negative wages to na
df_cleaned[df_cleaned["HOURLY_BASE_RATE"] < 0] = np.nan
-
Merge with df_integration_wages
df_cleaned = df_cleaned.merge(df_integration_wages, on=["ASSOCIATE_ID", "COUNTRY"], suffixes=[‘old’, ‘new’])
-
Update missing values:
df_cleaned.loc[df_cleaned[‘HOURLY_BASE_RATE_old’].isna(),[‘HOURLY_BASE_RATE_old’, ‘CURRENCY_old’, ‘PAY_COMPONENT_old’, ‘FTE_old’] = df_cleaned[[‘HOURLY_BASE_RATE_new’, ‘CURRENCY_new’, ‘PAY_COMPONENT_new’, ‘FTE_new’]]
- drop ‘new’ columns and rename ‘old’ columns
I have a DataFrame with employee information that is missing some records, and want to fill these in using another DataFrame. The way I’m doing it now is below, but takes way too long because it’s a lot of rows.
df_missing = df_cleaned.loc[(df_cleaned["HOURLY_BASE_RATE"]<= 0) | (df_cleaned["HOURLY_BASE_RATE"].isna())]
df_missing_in_integration = df_missing[["ASSOCIATE_ID", "COUNTRY"]].merge(df_integration_wages, on=["ASSOCIATE_ID", "COUNTRY"])
for index, row in df_missing_in_integration.iterrows():
associate_id = row["ASSOCIATE_ID"]
associate_country = row["COUNTRY"]
associate_index = df_cleaned.index[(df_cleaned["ASSOCIATE_ID"] == associate_id) & (df_cleaned["COUNTRY"] == associate_country)]
df_cleaned.loc[associate_index, "HOURLY_BASE_RATE"] = row["HOURLY_BASE_RATE"]
df_cleaned.loc[associate_index, "CURRENCY"] = row["CURRENCY"]
df_cleaned.loc[associate_index, "PAY_COMPONENT"] = row["PAY_COMPONENT"]
df_cleaned.loc[associate_index, "FTE"] = row["FTE"]
Is there a faster way to fill in those missing values based on a unique combination of the ASSOCIATE_ID and COUNTRY columns? I’ve tried merge, but this gives me extra columns instead of filling in the values in the existing columns. I’ve also tried combine_first, but for some reason I still have the nan values when I try this.
Here’s some example DataFrames:
df_cleaned = pd.DataFrame({"ASSOCIATE_ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "COUNTRY": ["USA", "USA", "BEL", "GER", "BEL", "USA", "GER", "GER", "NLD", "NLD"], "HOURLY_BASE_RATE": [15, np.nan, 20, 18, np.nan, np.nan, 43, 38, np.nan, 13], "CURRENCY": ["USD", "USD", "EUR", "EUR", "EUR", "USD", "EUR", "EUR", "EUR", "EUR"], "PAY_COMPONENT": ["Hourly", np.nan, "Hourly", "Hourly", np.nan, np.nan, "Hourly", "Hourly", np.nan, "Hourly"], "FTE": [1, 1, 0.8, 1, np.nan, np.nan, 0.75, 0.75, np.nan, 1], "LOCATION_TYPE": ["Stores", "Stores", "Distribution Center", "Stores", "Headquarters", "Headquarters", "Headquarters", "Distribution Center", "Stores", "Stores"})
df_integration_wages = pd.DataFrame({"ASSOCIATE_ID": [2, 5, 6, 9, 11, 12], "COUNTRY": ["USA", "USA", "USA", "NLD", "BEL", "BEL"], "HOURLY_BASE_RATE": [2500, 23, 37, 20, 32, 16], "CURRENCY": ["USD", "USD", "USD", "EUR", "EUR", "EUR"], "PAY_COMPONENT": ["Monthly", "Hourly", "Hourly", "Hourly", "Hourly", "Hourly"], "FTE": [1, 0.6, 1, 1, 0.8, 1]})
I only want to replace the rows where the wage was missing. There are more rows in the integration file, but I don’t want to include those. Is there a faster way to achieve what want?
Use a merge
and combine_first
:
keys = ['ASSOCIATE_ID', 'COUNTRY']
out = df_cleaned.combine_first(df_integration_wages.merge(df_cleaned[keys],
how='right'))
Output:
ASSOCIATE_ID COUNTRY HOURLY_BASE_RATE CURRENCY PAY_COMPONENT FTE
0 1 USA 15.0 USD Hourly 1.00
1 2 USA 2500.0 USD Monthly 1.00
2 3 BEL 20.0 EUR Hourly 0.80
3 4 GER 18.0 EUR Hourly 1.00
4 5 BEL NaN EUR NaN NaN
5 6 USA 37.0 USD Hourly 1.00
6 7 GER 43.0 EUR Hourly 0.75
7 8 GER 38.0 EUR Hourly 0.75
8 9 NLD 20.0 EUR Hourly 1.00
9 10 NLD 13.0 EUR Hourly 1.00
-
Convert all your negative wages to na
df_cleaned[df_cleaned["HOURLY_BASE_RATE"] < 0] = np.nan
-
Merge with df_integration_wages
df_cleaned = df_cleaned.merge(df_integration_wages, on=["ASSOCIATE_ID", "COUNTRY"], suffixes=[‘old’, ‘new’])
-
Update missing values:
df_cleaned.loc[df_cleaned[‘HOURLY_BASE_RATE_old’].isna(),[‘HOURLY_BASE_RATE_old’, ‘CURRENCY_old’, ‘PAY_COMPONENT_old’, ‘FTE_old’] = df_cleaned[[‘HOURLY_BASE_RATE_new’, ‘CURRENCY_new’, ‘PAY_COMPONENT_new’, ‘FTE_new’]]
- drop ‘new’ columns and rename ‘old’ columns