sum rows from two different data frames based on the value of columns

Question

I have two data frames

df1

            ID  Year Primary_Location Secondary_Location  Sales
0           11  2023          NewYork            Chicago    100
1           11  2023             Lyon      Chicago,Paris    200
2           11  2023           Berlin              Paris    300
3           12  2022          Newyork            Chicago    150
4           12  2022             Lyon      Chicago,Paris    250
5           12  2022           Berlin              Paris    400

df2

            ID  Year Primary_Location  Sales
0           11  2023          Chicago    150
1           11  2023            Paris    200
2           12  2022          Chicago    300
3           12  2022            Paris    350

I would like for each group having the same ID & Year:
to add the column Sales from df2 to Sales in df1 where Primary_Location in df2 appear (contained) in Secondary_Location in df1.

For example: For ID=11 & Year=2023, Sales for Lyon would be added to Sales for Chicago & Sales for Paris of df_2.

New Sales of Paris for that row would be 200+150+200=550.

The expected output would be :

df_primary_output



            ID  Year Primary_Location Secondary_Location  Sales
0           11  2023          NewYork            Chicago    250
1           11  2023             Lyon      Chicago,Paris    550
2           11  2023           Berlin              Paris    500
3           12  2022          Newyork            Chicago    400
4           12  2022             Lyon      Chicago,Paris    900
5           12  2022           Berlin              Paris    750

Here are the dataframes to start with :

import pandas as pd

df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
                   'Year': [2023, 2023, 2023, 2022, 2022, 2022],
                   'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
                   'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
                   'Sales': [100, 200, 300, 150, 250, 400]
                   })

df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
                   'Year': [2023, 2023, 2022, 2022],
                   'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
                   'Sales': [150, 200, 300, 350]
                   })

EDIT: pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Would be great if the solution could work for these inputs as well:

df1

       Day  ID  Year Primary_Location Secondary_Location  Sales
0       1   11  2023          NewYork            Chicago    100
1       1   11  2023           Berlin            Chicago    300
2       1   11  2022          Newyork            Chicago    150
3       1   11  2022           Berlin            Chicago    400

df2

     Day    ID  Year Primary_Location  Sales
0     1     11  2023          Chicago    150
1     1     11  2022          Chicago    300

The expected output would be :

df_primary_output



       Day  ID  Year Primary_Location Secondary_Location  Sales
0       1   11  2023          NewYork            Chicago    250
1       1   11  2023           Berlin            Chicago    450
2       1   11  2022          Newyork            Chicago    450
3       1   11  2022           Berlin            Chicago    700

Asked By: the phoenix

||

Source

Answer 1

Not so easy your question…

Proposed script

import pandas as pd

df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
                   'Year': [2023, 2023, 2023, 2022, 2022, 2022],
                   'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
                   'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
                   'Sales': [100, 200, 300, 150, 250, 400]
                   })

df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
                   'Year': [2023, 2023, 2022, 2022],
                   'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
                   'Sales': [150, 200, 300, 350]
                   })

tot = []
def func(g, iterdf, len_df1, i = 0):
    global tot
    kv = {g['Primary_Location'].iloc[i]:g['Sales'].iloc[i] for i in range(len(g))}
    while i < len_df1:
        row = next(iterdf)[1]
        # Select specific df1 rows to modify by ID and Year criteria
        if g['ID'][g.index[0]]==row['ID'] and g['Year'][g.index[0]]==row['Year']:
            tot.append(row['Sales'] + sum([kv[town] for town in row['Secondary_Location'].split(',') if town in kv]))
        i+=1

df2.groupby(['ID', 'Year'], sort=False).apply(lambda g: func(g, df1.iterrows(), len(df1)))
df1['Sales'] = tot
print(df1)

Result :

   ID  Year Primary_Location Secondary_Location  Sales
0  11  2023          NewYork            Chicago    250
1  11  2023             Lyon      Chicago,Paris    550
2  11  2023           Berlin              Paris    500
3  12  2022          Newyork            Chicago    450
4  12  2022             Lyon      Chicago,Paris    900
5  12  2022           Berlin              Paris    750

Are you sure of the result in line 3, my script found 450 and not 400 ?

Explanation :

1 – group(...).apply(...) sends two groups from df2 one by one to func() :

   ID  Year Primary_Location  Sales
0  11  2023          Chicago    150
1  11  2023            Paris    200

   ID  Year Primary_Location  Sales
2  12  2022          Chicago    300
3  12  2022            Paris    350

2 – kv returns dictionnaries from df2 like this :

(each iteration corresponds to a group ie ID + Year)

call 1 - {'Chicago': 100, 'Paris': 200}
call 2 - {'Chicago': 300, 'Paris': 350}

3 – Function while followed by used of next(iterator) allows to explore rows in g (group) one by one :

while i < len_df1:
    row = next(iterdf)[1]
...
    i+=1

4 – The if condition in while loop allows to filter df1 rows in order that ID and Year corresponds to df2‘s ones.

And for each correspondance to append to the global list tot df1 and df2 sales values

5 – tot is a global list for memorizing values and is passed to df1 for the Sales column creation :

df1['Sales'] = tot

Result with the new dataframes sample :

   ID  Year Primary_Location Secondary_Location  Sales
0  11  2023          NewYork            Chicago    250
1  11  2023           Berlin            Chicago    450
2  11  2022          Newyork            Chicago    450
3  11  2022           Berlin            Chicago    700

Answered By: Laurent B.

Answer 2

This should work:

s = 'Secondary_Location'
(df1.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)['Sales_2'].sum()
.add(df1['Sales']))

or

df3 = (df1.assign(Secondary_Location = df1['Secondary_Location'].str.split(',')) #split Secondary_Location column into list and explode it so each row has one value
.explode('Secondary_Location'))


(df3[['ID','Year','Secondary_Location']].apply(tuple,axis=1) #create a series where ID, Year and Secondary_Location are a combined into a tuple so we can map our series created below to bring in the values needed.
.map(df2.set_index(['ID','Year','Primary_Location'])['Sales']) #create a series with lookup values in index, and make a series by selecting Sales column
.groupby(level=0).sum() #when exploding the column above, the index was repeated, so groupby(level=0).sum() will combine back to original form.
.add(df1['Sales'])) #add in original sales column

Original Answer:

s = 'Secondary_Location'
(df.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)
.agg({**dict.fromkeys(df,'first'),**{s:','.join,'Sales_2':'sum'}})
.assign(Sales = lambda x: x['Sales'] + x['Sales_2'])
.drop('Sales_2',axis=1))

Output:

   ID  Year Primary_Location Secondary_Location  Sales
0  11  2023          NewYork            Chicago    250
1  11  2023             Lyon      Chicago,Paris    550
2  11  2023           Berlin              Paris    500
3  12  2022          Newyork            Chicago    450
4  12  2022             Lyon      Chicago,Paris    900
5  12  2022           Berlin              Paris    750

Answered By: rhug123

sum rows from two different data frames based on the value of columns

Question:

Answers: