Map keywords value to add extra column to long dataframe

Question:

I have two dataframes:

df_1_long = pd.DataFrame({
    'company_name': ['Company A', 'Company B', 'Company C'],
    'company_country': ['USA', 'Poland', 'Canada'],
    'keyword': ['holding', 'services', 'source'],
    'value': [1,0,1]
    })

and the second dataframe:

df_training = pd.DataFrame({
     'holding': [1, 0, 0],
     'services': [0, 1, 0],
     'source': [0, 0, 1],
     'sector': ['Retail', 'Finance', 'Energy']
     })

The columns in df_training ['holding', 'services', 'source'] are the keywords in column keyword in df_1_long.

I would like to assign a sector to df_1_long – if ‘keyword’ in df_1_long value is 1 and in df_training value is 1 for the keyword in a column -> then assign a sector from df_training.

The output should look like that:

expected_output = pd.DataFrame({
    'company_name': ['Company A', 'Company B', 'Company C'],
    'company_country': ['USA', 'Poland', 'Canada'],
    'keyword': ['holding', 'services', 'source'],
    'value': [1,0,1],
    'sector': ['Retail', 'no_sector', 'Energy']
    })

I tried this piece of code, but I keep getting errors:

merged_df = pd.merge(df_1_long, df_training, left_on='keyword', right_on=df_training.columns[:-1])
df_1_long['sector'] = merged_df['sector'].where(merged_df['value'] == 1, np.nan)
Asked By: katarz

||

Answers:

You can use from_dummies to reshape and map those values:

mapper = dict(zip(pd.from_dummies(df_training.set_index('sector')).iloc[:, 0],
                  df_training['sector']))

df_1_long['sector'] = (df_1_long['keyword'].map(mapper)
                       .mask(df_1_long['value'].eq(0), 'no_sector')
                       )

Output:

  company_name company_country   keyword  value     sector
0    Company A             USA   holding      1     Retail
1    Company B          Poland  services      0  no_sector
2    Company C          Canada    source      1     Energy
Answered By: mozway

Use DataFrame.stack for reshape with filter 1 values and get dictionary from MultiIndex used for Series.map:

s = df_training.set_index('sector').stack().loc[lambda x: x.eq(1)].index.swaplevel()

m = df_1_long['value'].eq(1)
df_1_long['sector'] = np.where(m, df_1_long['keyword'].map(dict(s)), 'no_sector')

Another solution with DataFrame.melt, DataFrame.query and DataFrame.set_index:

d = df_training.melt('sector').query('value == 1').set_index('variable')['sector']


m = df_1_long['value'].eq(1)
df_1_long['sector'] = np.where(m, df_1_long['keyword'].map(d), 'no_sector')
print (df_1_long)
  company_name company_country   keyword  value     sector
0    Company A             USA   holding      1     Retail
1    Company B          Poland  services      0  no_sector
2    Company C          Canada    source      1     Energy
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.