Create binary columns out of data nested in another dfs columns

Question:

This one is weird —

let’s say I have a df like this:

user_id     city      state     network
123         austin    tx        att
113         houston   tx        tmobile
343         miami     fl        att
356         seattle   wa        verizon

and I have another df1 like this (these 2 dfs wont be the same shape):

col1
'network': 'att'
'city': 'austin'
'state': 'tx'
'city': 'seattle'

I’m trying to build a final_df like this:

user_id     is_network_att      is_city_austin      is_state_tx    is_city_seattle
123         1                   1                   1              0
113         0                   0                   1              0
343         1                   0                   0              0
356         0                   0                   0              1

Easier to just show it – but a sentence to describe it:
I’m trying to create conditional/true-false columns out of df1.col1 in a new final_df that use df column’s data.

Strategies I’m tying:

-throw the df1 columns in a list or dictionary and loop through each element and then somehow loop through each row and incorporate and if statement for each row

-maybe make a makeshift column in df1 of the exact code that would create the column in final_df and somehow use the text in this columnd as code

**here’s a handful of the rows i’m trying to put in the dictionary

Here's a handful of rows in that I'm trying to put in a dictionary:
912      'organization': 'atlantic metro communications'
913          'isp_name': 'Atlantic Metro Communications'
915                       'location_name': 'martinez ca'
917                       'location_name': 'martinez ca'
918                       'location_name': 'martinez ca'
919                       'location_name': 'martinez ca'
920                     'isp_name': 'Hurricane Electric'
922                 'organization': 'hurricane electric'
923                 'organization': 'hurricane electric'
924                     'isp_name': 'Hurricane Electric'
925                           'count_users_per_ip': 28.0
926      'organization': 'atlantic metro communications'
927          'isp_name': 'Atlantic Metro Communications'
928                     'isp_name': 'Hurricane Electric'
929                 'organization': 'hurricane electric'
930                     'isp_name': 'Hurricane Electric'
931                 'organization': 'hurricane electric'
932                    'location_name': 'hermosillo son'
933      'organization': 'atlantic metro communications'
934          'isp_name': 'Atlantic Metro Communications'
935                             'location_state': ' son'
966                           'count_users_per_ip': 28.0
1057                       'count_users_per_device': 4.0
1218                           'count_ips_per_user': 3.0
1408                    'moderated_action': 'SOFT_BLOCK'
1418                    'moderated_action': 'SOFT_BLOCK'
1430                    'moderated_action': 'SOFT_BLOCK'
1438                    'moderated_action': 'SOFT_BLOCK'
1517                            'app_build': '405000004'
1605                            'app_build': '405000004'

Update – heres as far as Ive got:

def transpose_features(df1,col1,main_df,attr1,attr2):
    from ast import literal_eval

    # dic = literal_eval(f"{{{', '.join(df1[col1])}}}")
    
    dic = {}                                                               
    for i in df_features[attr1].tolist(): 
        dic[i] = df_features[df_features[attr1]==i][attr2].tolist()                                                        

    df_final = (main_df.drop(columns=list(dic))
             .join(main_df[list(dic)].eq(dic).astype(int)
                   .rename(columns=lambda x: f'is_{x}_{dic[x]}')
                  )
          )

    print(df_final.shape)
    return df_final
    
df_final = transpose_features(
    df1 = df_features
    ,col1 = 'attr'
    ,main_df = df
    ,attr1 = 'attr1'
    ,attr2 = 'attr2'
)

df_final.head()

-This code pulls all the values into a list attaches that list to each key in the dictionary. But the issue now is – I need to basically an or statement in the method @mozway provided – that says "does user have ANY of the values in the list in each dict key".

Hard to even type that.

Asked By: max

||

Answers:

Assuming that df1 contains strings, you can first merge them and convert to dictionary, then use it as a reference for comparison with eq:

from ast import literal_eval

# or use a different method to create the dictionary
dic = literal_eval(f"{{{', '.join(df1['col1'])}}}")
# {'network': 'att', 'city': 'austin', 'state': 'tx'}

out = (df.drop(columns=list(dic))
         .join(df[list(dic)].eq(dic).astype(int)
               .rename(columns=lambda x: f'is_{x}_{dic[x]}')
              )
      )

Output:

   user_id  is_network_att  is_city_austin  is_state_tx
0      123               1               1            1
1      113               0               0            1
2      343               1               0            0

Reproducible input:

df = pd.DataFrame({'user_id': [123, 113, 343],
                   'city': ['austin', 'houston', 'miami'],
                   'state': ['tx', 'tx', 'fl'],
                   'network': ['att', 'tmobile', 'att']})

df1 = pd.DataFrame({'col1': ['"network": "att"', '"city": "austin"', '"state": "tx"']})
update to work with duplicated keys

Use a Series instead to handle duplicated keys:

s = df1['col1'].str.extract(r"^'(.*)':s*'(.*)'$").set_index(0)[1]
it = iter(s)

out = (df.drop(columns=s.index)
         .join(df[s.index].eq(s.tolist()).astype(int)
               .rename(columns=lambda x: f'is_{x}_{next(it)}')
              )
      )

Output:

   user_id  is_network_att  is_city_austin  is_state_tx  is_city_seattle
0      123               1               1            1                0
1      113               0               0            1                0
2      343               1               0            0                0
3      356               0               0            0                1

Reproducible input for the new df1:

df1 = pd.DataFrame({'col1': ["'network': 'att'",
                             "'city': 'austin'",
                             "'state': 'tx'",
                             "'city': 'seattle'"]})
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.