How can I add a new column to a dataframe (df1) that is the sum of multiple lookup values from df1 in another dataframe (df2)

Question:

Say I have 2 dataframes:

df1

     id       guid               name      item1        item2        item3        item4        item5         item6       item7        item8       item9
0  3031958124  85558-261955282  Alonso  85558-57439  85558-54608  85558-91361  85558-40647  85558-41305  85558-79979  85558-33076  85558-89956  85558-12554
1  3031958127  85558-261955282  Jeff    85558-57439  85558-39280  85558-91361  85558-55987  85558-83083  85558-79979  85558-33076  85558-41872  85558-12554
2  3031958129  85558-261955282  Mike    85558-57439  85558-39280  85558-91361  85558-55987  85558-40647  85558-79979  85558-33076  85558-88297  85558-12534
...

df2 where item_lookup is the index

             item_type   cost  value  target 
item_lookup
85558-57439  item1       9500   25.1   1.9
85558-54608  item2       8000   18.7   0.0 
85558-91361  item3       7000   16.5   0.9
...

I want to add the sum of cost, value, and target for each item1 through item9 using item_lookup (df2) and store that as a column on df1.

So the result should look like:
df1

     id       guid               name      item1        item2        item3        item4        item5         item6       item7        item8       item9       cost   value  target
0  3031958124  85558-261955282  Alonso  85558-57439  85558-54608  85558-91361  85558-40647  85558-41305  85558-79979  85558-33076  85558-89956  85558-12554  58000   192.5   38.3
1  3031958127  85558-261955282  Jeff    85558-57439  85558-39280  85558-91361  85558-55987  85558-83083  85558-79979  85558-33076  85558-41872  85558-12554  59400   183.2   87.7
2  3031958129  85558-261955282  Mike    85558-57439  85558-39280  85558-91361  85558-55987  85558-40647  85558-79979  85558-33076  85558-88297  85558-12534  58000   101.5   18.1
...

I’ve tried following similar solutions online that use .map, however these examples are only for single columns whereas I am trying to sum values for 9 columns.

Asked By: Sean Sailer

||

Answers:

You can do this by using df.apply, basically looping through the rows and then looping through the items in the row and calculating the sum


Since i couldn’t use your dfs because they are incomplete, i made mine.

given df1:

  item1 item2 item3
0     b     e     j
1     d     a     d
2     j     b     a
3     c     j     f
4     e     f     c
5     a     d     b
6     f     c     e

and df2

             cost  value  target
item_lookup                     
a              19     20      12
b              16     14      14
c              20     18      18
d              17     12      14
e              20     15      17
f              19     20      12
j              11     17      12

you can use the following function to get what you need

def add_items(row):
     row["cost"] = row["target"] = row["value"] = 0
     # get the columns that have item in the name
     cols = [col for col in df1.columns if "item" in col]
     # get each of the columns look it up in df2 and add it to our new cols
     for col in cols:
         item_lookup = row[col]
         lookup_result = df2.loc[item_lookup]
         row["cost"] += lookup_result["cost"]
         row["target"] += lookup_result["target"]
         row["value"] += lookup_result["value"]
     return row

and then apply it

>>> df1.apply(add_items, axis=1)
  item1 item2 item3  cost  target  value
0     b     e     j    47      43     46
1     d     a     d    53      40     44
2     j     b     a    46      38     51
3     c     j     f    50      42     55
4     e     f     c    59      47     53
5     a     d     b    52      40     46
6     f     c     e    59      47     53
Answered By: zaki98

I got a simpler solution here. First, save the item_lookup values that correspond with cost, target and value to a dictionary. and then use .map() and .sum() to create columns:

df2.reset_index(drop=False, inplace=True)

map_cost = dict(zip(df2['item_lookup'], df2['cost']))
map_value = dict(zip(df2['item_lookup'], df2['value']))
map_target = dict(zip(df2['item_lookup'], df2['target']))


df1['cost'] = df1.apply(lambda x: x.map(map_cost)).sum(axis=1)
df1['value'] = df1.apply(lambda x: x.map(map_value)).sum(axis=1)
df1['target'] = df1.apply(lambda x: x.map(map_target)).sum(axis=1)


df1

Output:

           id             guid    name        item1        item2        item3     cost  value  target
0  3031958124  85558-261955282  Alonso  85558-57439  85558-54608  85558-91361  24500.0   60.3     2.8
1  3031958127  85558-261955282    Jeff  85558-57439  85558-39280  85558-91361  16500.0   41.6     2.8
2  3031958129  85558-261955282    Mike  85558-57439  85558-39280  85558-91361  16500.0   41.6     2.8
Answered By: ali bakhtiari

I prefer the following solution that has elements of the solutions proposed by @ali bakhtiari and @zaki98, but is more explicit, performant, and flexible. Use applymap as the item lookup is the same for all item cols, assuming item_lookup in df2 uniquely identifies each row (all solutions assume this); however, the solution I propose also handles the case of an item_lookup in df1 that is not present in df2.
For df1 and `df2 as follows,

df1

df2

define columns to sum, sum_cols, and item columns in df1, item_cols, then append each summed column to df1 as follows:

sum_cols = ['cost', 'value', 'target']
item_cols = [col for col in df1.columns if 'item' in col]

df2.set_index('item_lookup', inplace=True)
for sc in sum_cols:
    df1[sc] = df1[item_cols] 
                    .applymap(lambda x: df2.at[x, sc] if x in df2.index else 0) 
                    .sum(axis=1)

This problem also seems like a good use case for the more performant at over loc as only a single, numeric value is looked up each time (see this SO post). It’s not necessary to set item_lookup as the index on df2 but doing so should improve performance on large datasets. If an item_lookup in df1 is not present in df2, you could also replace with NaN, e.g. so that the number of missing item_lookup values could be counted, with minimal additional effort.

Output df1:

enter image description here