Create binary columns out of data nested in another dfs columns
Question:
This one is weird —
let’s say I have a df
like this:
user_id city state network
123 austin tx att
113 houston tx tmobile
343 miami fl att
356 seattle wa verizon
and I have another df1
like this (these 2 dfs wont be the same shape):
col1
'network': 'att'
'city': 'austin'
'state': 'tx'
'city': 'seattle'
I’m trying to build a final_df
like this:
user_id is_network_att is_city_austin is_state_tx is_city_seattle
123 1 1 1 0
113 0 0 1 0
343 1 0 0 0
356 0 0 0 1
Easier to just show it – but a sentence to describe it:
I’m trying to create conditional/true-false columns out of df1.col1
in a new final_df
that use df
column’s data.
Strategies I’m tying:
-throw the df1 columns in a list or dictionary and loop through each element and then somehow loop through each row and incorporate and if statement for each row
-maybe make a makeshift column in df1
of the exact code that would create the column in final_df
and somehow use the text in this columnd as code
**here’s a handful of the rows i’m trying to put in the dictionary
Here's a handful of rows in that I'm trying to put in a dictionary:
912 'organization': 'atlantic metro communications'
913 'isp_name': 'Atlantic Metro Communications'
915 'location_name': 'martinez ca'
917 'location_name': 'martinez ca'
918 'location_name': 'martinez ca'
919 'location_name': 'martinez ca'
920 'isp_name': 'Hurricane Electric'
922 'organization': 'hurricane electric'
923 'organization': 'hurricane electric'
924 'isp_name': 'Hurricane Electric'
925 'count_users_per_ip': 28.0
926 'organization': 'atlantic metro communications'
927 'isp_name': 'Atlantic Metro Communications'
928 'isp_name': 'Hurricane Electric'
929 'organization': 'hurricane electric'
930 'isp_name': 'Hurricane Electric'
931 'organization': 'hurricane electric'
932 'location_name': 'hermosillo son'
933 'organization': 'atlantic metro communications'
934 'isp_name': 'Atlantic Metro Communications'
935 'location_state': ' son'
966 'count_users_per_ip': 28.0
1057 'count_users_per_device': 4.0
1218 'count_ips_per_user': 3.0
1408 'moderated_action': 'SOFT_BLOCK'
1418 'moderated_action': 'SOFT_BLOCK'
1430 'moderated_action': 'SOFT_BLOCK'
1438 'moderated_action': 'SOFT_BLOCK'
1517 'app_build': '405000004'
1605 'app_build': '405000004'
Update – heres as far as Ive got:
def transpose_features(df1,col1,main_df,attr1,attr2):
from ast import literal_eval
# dic = literal_eval(f"{{{', '.join(df1[col1])}}}")
dic = {}
for i in df_features[attr1].tolist():
dic[i] = df_features[df_features[attr1]==i][attr2].tolist()
df_final = (main_df.drop(columns=list(dic))
.join(main_df[list(dic)].eq(dic).astype(int)
.rename(columns=lambda x: f'is_{x}_{dic[x]}')
)
)
print(df_final.shape)
return df_final
df_final = transpose_features(
df1 = df_features
,col1 = 'attr'
,main_df = df
,attr1 = 'attr1'
,attr2 = 'attr2'
)
df_final.head()
-This code pulls all the values into a list attaches that list to each key in the dictionary. But the issue now is – I need to basically an or
statement in the method @mozway provided – that says "does user have ANY of the values in the list in each dict key".
Hard to even type that.
Answers:
Assuming that df1
contains strings, you can first merge them and convert to dictionary, then use it as a reference for comparison with eq
:
from ast import literal_eval
# or use a different method to create the dictionary
dic = literal_eval(f"{{{', '.join(df1['col1'])}}}")
# {'network': 'att', 'city': 'austin', 'state': 'tx'}
out = (df.drop(columns=list(dic))
.join(df[list(dic)].eq(dic).astype(int)
.rename(columns=lambda x: f'is_{x}_{dic[x]}')
)
)
Output:
user_id is_network_att is_city_austin is_state_tx
0 123 1 1 1
1 113 0 0 1
2 343 1 0 0
Reproducible input:
df = pd.DataFrame({'user_id': [123, 113, 343],
'city': ['austin', 'houston', 'miami'],
'state': ['tx', 'tx', 'fl'],
'network': ['att', 'tmobile', 'att']})
df1 = pd.DataFrame({'col1': ['"network": "att"', '"city": "austin"', '"state": "tx"']})
update to work with duplicated keys
Use a Series instead to handle duplicated keys:
s = df1['col1'].str.extract(r"^'(.*)':s*'(.*)'$").set_index(0)[1]
it = iter(s)
out = (df.drop(columns=s.index)
.join(df[s.index].eq(s.tolist()).astype(int)
.rename(columns=lambda x: f'is_{x}_{next(it)}')
)
)
Output:
user_id is_network_att is_city_austin is_state_tx is_city_seattle
0 123 1 1 1 0
1 113 0 0 1 0
2 343 1 0 0 0
3 356 0 0 0 1
Reproducible input for the new df1
:
df1 = pd.DataFrame({'col1': ["'network': 'att'",
"'city': 'austin'",
"'state': 'tx'",
"'city': 'seattle'"]})
This one is weird —
let’s say I have a df
like this:
user_id city state network
123 austin tx att
113 houston tx tmobile
343 miami fl att
356 seattle wa verizon
and I have another df1
like this (these 2 dfs wont be the same shape):
col1
'network': 'att'
'city': 'austin'
'state': 'tx'
'city': 'seattle'
I’m trying to build a final_df
like this:
user_id is_network_att is_city_austin is_state_tx is_city_seattle
123 1 1 1 0
113 0 0 1 0
343 1 0 0 0
356 0 0 0 1
Easier to just show it – but a sentence to describe it:
I’m trying to create conditional/true-false columns out of df1.col1
in a new final_df
that use df
column’s data.
Strategies I’m tying:
-throw the df1 columns in a list or dictionary and loop through each element and then somehow loop through each row and incorporate and if statement for each row
-maybe make a makeshift column in df1
of the exact code that would create the column in final_df
and somehow use the text in this columnd as code
**here’s a handful of the rows i’m trying to put in the dictionary
Here's a handful of rows in that I'm trying to put in a dictionary:
912 'organization': 'atlantic metro communications'
913 'isp_name': 'Atlantic Metro Communications'
915 'location_name': 'martinez ca'
917 'location_name': 'martinez ca'
918 'location_name': 'martinez ca'
919 'location_name': 'martinez ca'
920 'isp_name': 'Hurricane Electric'
922 'organization': 'hurricane electric'
923 'organization': 'hurricane electric'
924 'isp_name': 'Hurricane Electric'
925 'count_users_per_ip': 28.0
926 'organization': 'atlantic metro communications'
927 'isp_name': 'Atlantic Metro Communications'
928 'isp_name': 'Hurricane Electric'
929 'organization': 'hurricane electric'
930 'isp_name': 'Hurricane Electric'
931 'organization': 'hurricane electric'
932 'location_name': 'hermosillo son'
933 'organization': 'atlantic metro communications'
934 'isp_name': 'Atlantic Metro Communications'
935 'location_state': ' son'
966 'count_users_per_ip': 28.0
1057 'count_users_per_device': 4.0
1218 'count_ips_per_user': 3.0
1408 'moderated_action': 'SOFT_BLOCK'
1418 'moderated_action': 'SOFT_BLOCK'
1430 'moderated_action': 'SOFT_BLOCK'
1438 'moderated_action': 'SOFT_BLOCK'
1517 'app_build': '405000004'
1605 'app_build': '405000004'
Update – heres as far as Ive got:
def transpose_features(df1,col1,main_df,attr1,attr2):
from ast import literal_eval
# dic = literal_eval(f"{{{', '.join(df1[col1])}}}")
dic = {}
for i in df_features[attr1].tolist():
dic[i] = df_features[df_features[attr1]==i][attr2].tolist()
df_final = (main_df.drop(columns=list(dic))
.join(main_df[list(dic)].eq(dic).astype(int)
.rename(columns=lambda x: f'is_{x}_{dic[x]}')
)
)
print(df_final.shape)
return df_final
df_final = transpose_features(
df1 = df_features
,col1 = 'attr'
,main_df = df
,attr1 = 'attr1'
,attr2 = 'attr2'
)
df_final.head()
-This code pulls all the values into a list attaches that list to each key in the dictionary. But the issue now is – I need to basically an or
statement in the method @mozway provided – that says "does user have ANY of the values in the list in each dict key".
Hard to even type that.
Assuming that df1
contains strings, you can first merge them and convert to dictionary, then use it as a reference for comparison with eq
:
from ast import literal_eval
# or use a different method to create the dictionary
dic = literal_eval(f"{{{', '.join(df1['col1'])}}}")
# {'network': 'att', 'city': 'austin', 'state': 'tx'}
out = (df.drop(columns=list(dic))
.join(df[list(dic)].eq(dic).astype(int)
.rename(columns=lambda x: f'is_{x}_{dic[x]}')
)
)
Output:
user_id is_network_att is_city_austin is_state_tx
0 123 1 1 1
1 113 0 0 1
2 343 1 0 0
Reproducible input:
df = pd.DataFrame({'user_id': [123, 113, 343],
'city': ['austin', 'houston', 'miami'],
'state': ['tx', 'tx', 'fl'],
'network': ['att', 'tmobile', 'att']})
df1 = pd.DataFrame({'col1': ['"network": "att"', '"city": "austin"', '"state": "tx"']})
update to work with duplicated keys
Use a Series instead to handle duplicated keys:
s = df1['col1'].str.extract(r"^'(.*)':s*'(.*)'$").set_index(0)[1]
it = iter(s)
out = (df.drop(columns=s.index)
.join(df[s.index].eq(s.tolist()).astype(int)
.rename(columns=lambda x: f'is_{x}_{next(it)}')
)
)
Output:
user_id is_network_att is_city_austin is_state_tx is_city_seattle
0 123 1 1 1 0
1 113 0 0 1 0
2 343 1 0 0 0
3 356 0 0 0 1
Reproducible input for the new df1
:
df1 = pd.DataFrame({'col1': ["'network': 'att'",
"'city': 'austin'",
"'state': 'tx'",
"'city': 'seattle'"]})