Is there a fast way to merge overlapping columns between two pandas dataframes?

Question:

I have a DataFrame with employee information that is missing some records, and want to fill these in using another DataFrame. The way I’m doing it now is below, but takes way too long because it’s a lot of rows.

df_missing = df_cleaned.loc[(df_cleaned["HOURLY_BASE_RATE"]<= 0) | (df_cleaned["HOURLY_BASE_RATE"].isna())]
df_missing_in_integration = df_missing[["ASSOCIATE_ID", "COUNTRY"]].merge(df_integration_wages, on=["ASSOCIATE_ID", "COUNTRY"])

for index, row in df_missing_in_integration.iterrows():
  associate_id = row["ASSOCIATE_ID"]
  associate_country = row["COUNTRY"]
  associate_index = df_cleaned.index[(df_cleaned["ASSOCIATE_ID"] == associate_id) & (df_cleaned["COUNTRY"] == associate_country)]
  df_cleaned.loc[associate_index, "HOURLY_BASE_RATE"] = row["HOURLY_BASE_RATE"]
  df_cleaned.loc[associate_index, "CURRENCY"] = row["CURRENCY"]
  df_cleaned.loc[associate_index, "PAY_COMPONENT"] = row["PAY_COMPONENT"]
  df_cleaned.loc[associate_index, "FTE"] = row["FTE"]

Is there a faster way to fill in those missing values based on a unique combination of the ASSOCIATE_ID and COUNTRY columns? I’ve tried merge, but this gives me extra columns instead of filling in the values in the existing columns. I’ve also tried combine_first, but for some reason I still have the nan values when I try this.

Here’s some example DataFrames:

df_cleaned = pd.DataFrame({"ASSOCIATE_ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "COUNTRY": ["USA", "USA", "BEL", "GER", "BEL", "USA", "GER", "GER", "NLD", "NLD"], "HOURLY_BASE_RATE": [15, np.nan, 20, 18, np.nan, np.nan, 43, 38, np.nan, 13], "CURRENCY": ["USD", "USD", "EUR", "EUR", "EUR", "USD", "EUR", "EUR", "EUR", "EUR"], "PAY_COMPONENT": ["Hourly", np.nan, "Hourly", "Hourly", np.nan, np.nan, "Hourly", "Hourly", np.nan, "Hourly"], "FTE": [1, 1, 0.8, 1, np.nan, np.nan, 0.75, 0.75, np.nan, 1], "LOCATION_TYPE": ["Stores", "Stores", "Distribution Center", "Stores", "Headquarters", "Headquarters", "Headquarters", "Distribution Center", "Stores", "Stores"})
df_integration_wages = pd.DataFrame({"ASSOCIATE_ID": [2, 5, 6, 9, 11, 12], "COUNTRY": ["USA", "USA", "USA", "NLD", "BEL", "BEL"], "HOURLY_BASE_RATE": [2500, 23, 37, 20, 32, 16], "CURRENCY": ["USD", "USD", "USD", "EUR", "EUR", "EUR"], "PAY_COMPONENT": ["Monthly", "Hourly", "Hourly", "Hourly", "Hourly", "Hourly"], "FTE": [1, 0.6, 1, 1, 0.8, 1]})

I only want to replace the rows where the wage was missing. There are more rows in the integration file, but I don’t want to include those. Is there a faster way to achieve what want?

Asked By: MKJ

||

Answers:

Use a merge and combine_first:

keys = ['ASSOCIATE_ID', 'COUNTRY']

out = df_cleaned.combine_first(df_integration_wages.merge(df_cleaned[keys],
                                                          how='right'))

Output:

   ASSOCIATE_ID COUNTRY  HOURLY_BASE_RATE CURRENCY PAY_COMPONENT   FTE
0             1     USA              15.0      USD        Hourly  1.00
1             2     USA            2500.0      USD       Monthly  1.00
2             3     BEL              20.0      EUR        Hourly  0.80
3             4     GER              18.0      EUR        Hourly  1.00
4             5     BEL               NaN      EUR           NaN   NaN
5             6     USA              37.0      USD        Hourly  1.00
6             7     GER              43.0      EUR        Hourly  0.75
7             8     GER              38.0      EUR        Hourly  0.75
8             9     NLD              20.0      EUR        Hourly  1.00
9            10     NLD              13.0      EUR        Hourly  1.00
Answered By: mozway
  1. Convert all your negative wages to na

    df_cleaned[df_cleaned["HOURLY_BASE_RATE"] < 0] = np.nan

  2. Merge with df_integration_wages

    df_cleaned = df_cleaned.merge(df_integration_wages, on=["ASSOCIATE_ID", "COUNTRY"], suffixes=[‘old’, ‘new’])

  3. Update missing values:

df_cleaned.loc[df_cleaned[‘HOURLY_BASE_RATE_old’].isna(),[‘HOURLY_BASE_RATE_old’, ‘CURRENCY_old’, ‘PAY_COMPONENT_old’, ‘FTE_old’] = df_cleaned[[‘HOURLY_BASE_RATE_new’, ‘CURRENCY_new’, ‘PAY_COMPONENT_new’, ‘FTE_new’]]

  1. drop ‘new’ columns and rename ‘old’ columns
Answered By: Bigga
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.