Is there a fast way to merge overlapping columns between two pandas dataframes?

Question

I have a DataFrame with employee information that is missing some records, and want to fill these in using another DataFrame. The way I’m doing it now is below, but takes way too long because it’s a lot of rows.

df_missing = df_cleaned.loc[(df_cleaned["HOURLY_BASE_RATE"]<= 0) | (df_cleaned["HOURLY_BASE_RATE"].isna())]
df_missing_in_integration = df_missing[["ASSOCIATE_ID", "COUNTRY"]].merge(df_integration_wages, on=["ASSOCIATE_ID", "COUNTRY"])

for index, row in df_missing_in_integration.iterrows():
  associate_id = row["ASSOCIATE_ID"]
  associate_country = row["COUNTRY"]
  associate_index = df_cleaned.index[(df_cleaned["ASSOCIATE_ID"] == associate_id) & (df_cleaned["COUNTRY"] == associate_country)]
  df_cleaned.loc[associate_index, "HOURLY_BASE_RATE"] = row["HOURLY_BASE_RATE"]
  df_cleaned.loc[associate_index, "CURRENCY"] = row["CURRENCY"]
  df_cleaned.loc[associate_index, "PAY_COMPONENT"] = row["PAY_COMPONENT"]
  df_cleaned.loc[associate_index, "FTE"] = row["FTE"]

Is there a faster way to fill in those missing values based on a unique combination of the ASSOCIATE_ID and COUNTRY columns? I’ve tried merge, but this gives me extra columns instead of filling in the values in the existing columns. I’ve also tried combine_first, but for some reason I still have the nan values when I try this.

Here’s some example DataFrames:

df_cleaned = pd.DataFrame({"ASSOCIATE_ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "COUNTRY": ["USA", "USA", "BEL", "GER", "BEL", "USA", "GER", "GER", "NLD", "NLD"], "HOURLY_BASE_RATE": [15, np.nan, 20, 18, np.nan, np.nan, 43, 38, np.nan, 13], "CURRENCY": ["USD", "USD", "EUR", "EUR", "EUR", "USD", "EUR", "EUR", "EUR", "EUR"], "PAY_COMPONENT": ["Hourly", np.nan, "Hourly", "Hourly", np.nan, np.nan, "Hourly", "Hourly", np.nan, "Hourly"], "FTE": [1, 1, 0.8, 1, np.nan, np.nan, 0.75, 0.75, np.nan, 1], "LOCATION_TYPE": ["Stores", "Stores", "Distribution Center", "Stores", "Headquarters", "Headquarters", "Headquarters", "Distribution Center", "Stores", "Stores"})
df_integration_wages = pd.DataFrame({"ASSOCIATE_ID": [2, 5, 6, 9, 11, 12], "COUNTRY": ["USA", "USA", "USA", "NLD", "BEL", "BEL"], "HOURLY_BASE_RATE": [2500, 23, 37, 20, 32, 16], "CURRENCY": ["USD", "USD", "USD", "EUR", "EUR", "EUR"], "PAY_COMPONENT": ["Monthly", "Hourly", "Hourly", "Hourly", "Hourly", "Hourly"], "FTE": [1, 0.6, 1, 1, 0.8, 1]})

I only want to replace the rows where the wage was missing. There are more rows in the integration file, but I don’t want to include those. Is there a faster way to achieve what want?

Asked By: MKJ

||

Source

Answer 1

Use a merge and combine_first:

keys = ['ASSOCIATE_ID', 'COUNTRY']

out = df_cleaned.combine_first(df_integration_wages.merge(df_cleaned[keys],
                                                          how='right'))

Output:

   ASSOCIATE_ID COUNTRY  HOURLY_BASE_RATE CURRENCY PAY_COMPONENT   FTE
0             1     USA              15.0      USD        Hourly  1.00
1             2     USA            2500.0      USD       Monthly  1.00
2             3     BEL              20.0      EUR        Hourly  0.80
3             4     GER              18.0      EUR        Hourly  1.00
4             5     BEL               NaN      EUR           NaN   NaN
5             6     USA              37.0      USD        Hourly  1.00
6             7     GER              43.0      EUR        Hourly  0.75
7             8     GER              38.0      EUR        Hourly  0.75
8             9     NLD              20.0      EUR        Hourly  1.00
9            10     NLD              13.0      EUR        Hourly  1.00

Answered By: mozway

Answer 2

Convert all your negative wages to na

df_cleaned[df_cleaned["HOURLY_BASE_RATE"] < 0] = np.nan
Merge with df_integration_wages

df_cleaned = df_cleaned.merge(df_integration_wages, on=["ASSOCIATE_ID", "COUNTRY"], suffixes=[‘old’, ‘new’])
Update missing values:

df_cleaned.loc[df_cleaned[‘HOURLY_BASE_RATE_old’].isna(),[‘HOURLY_BASE_RATE_old’, ‘CURRENCY_old’, ‘PAY_COMPONENT_old’, ‘FTE_old’] = df_cleaned[[‘HOURLY_BASE_RATE_new’, ‘CURRENCY_new’, ‘PAY_COMPONENT_new’, ‘FTE_new’]]

drop ‘new’ columns and rename ‘old’ columns

Answered By: Bigga

Is there a fast way to merge overlapping columns between two pandas dataframes?

Question:

Answers: