Is it possible to join reference data into a nested dict in a pandas dataframe?
Question:
I am trying to join two pandas data frames – the "left" table, which contains a column with a complex type (an array of dicts) and the "right" table is a flat reference table.
pseudo table representation of these as follows
left_df
parent_id
array_column
1
[{id: 1}, {id: 3}]
2
[{id: 2}, {id: 4}]
right_df
id
value
1
one
2
two
3
three
4
four
I’m aiming to lookup/join the values from the right df into the array in the array_column of the left df using id’s, but have found this quite tricky.
desired outcome
parent_id
array_column
1
[{id: 1, value:’one’}, {id: 3, value: ‘three’}]
2
[{id: 2, value: ‘two’}, {id: 4, value: ‘four’}]
My naive approach to start with was to use a merge, as per the following approach.
desired_df = pd.merge(left_df, right_df, how='outer', left_on = 'array_column.['id']', right_on = 'id')
Obviously this failed – not quite sure how I can progress further. Effectively the aim is to lookup reference data onto dicts within an array, but after much searching I’ve not been able to articulate the problem well enough for a google result to show something that can help.
Appreciate any guidance anyone can share on this, whether using pandas or not. Thank you!
Answers:
Merge might not be the right approach since you are storing complex object types like list of dict having said that you can create a dictionary from the right_df then use it with map
to substitute and append the new key-val pairs in left_df
d = right_df.set_index('id')['value']
left_df['array_column'] = left_df['array_column'].map(lambda x: [{**y, 'value': d.get(y['id'])} for y in x])
Result
parent_id array_column
0 1 [{'id': 1, 'value': 'one'}, {'id': 3, 'value': 'three'}]
1 2 [{'id': 2, 'value': 'two'}, {'id': 4, 'value': 'four'}]
With merge it would look like:
temp = left_df.explode("array_column")
temp = temp.merge(
right_df, left_on=temp["array_column"].apply(lambda x: x.get("id")), right_on="id"
).drop(columns="id")
temp["array_column"] = temp.apply(
lambda x: {**x["array_column"], "value": x["value"]}, axis=1
)
out = temp.groupby("parent_id")["array_column"].agg(list).reset_index()
print(out)
parent_id array_columns
0 1 [{'id': 1, 'value': 'one'}, {'id': 3, 'value':...
1 2 [{'id': 2, 'value': 'two'}, {'id': 4, 'value':...
I am trying to join two pandas data frames – the "left" table, which contains a column with a complex type (an array of dicts) and the "right" table is a flat reference table.
pseudo table representation of these as follows
left_df
parent_id | array_column |
---|---|
1 | [{id: 1}, {id: 3}] |
2 | [{id: 2}, {id: 4}] |
right_df
id | value |
---|---|
1 | one |
2 | two |
3 | three |
4 | four |
I’m aiming to lookup/join the values from the right df into the array in the array_column of the left df using id’s, but have found this quite tricky.
desired outcome
parent_id | array_column |
---|---|
1 | [{id: 1, value:’one’}, {id: 3, value: ‘three’}] |
2 | [{id: 2, value: ‘two’}, {id: 4, value: ‘four’}] |
My naive approach to start with was to use a merge, as per the following approach.
desired_df = pd.merge(left_df, right_df, how='outer', left_on = 'array_column.['id']', right_on = 'id')
Obviously this failed – not quite sure how I can progress further. Effectively the aim is to lookup reference data onto dicts within an array, but after much searching I’ve not been able to articulate the problem well enough for a google result to show something that can help.
Appreciate any guidance anyone can share on this, whether using pandas or not. Thank you!
Merge might not be the right approach since you are storing complex object types like list of dict having said that you can create a dictionary from the right_df then use it with map
to substitute and append the new key-val pairs in left_df
d = right_df.set_index('id')['value']
left_df['array_column'] = left_df['array_column'].map(lambda x: [{**y, 'value': d.get(y['id'])} for y in x])
Result
parent_id array_column
0 1 [{'id': 1, 'value': 'one'}, {'id': 3, 'value': 'three'}]
1 2 [{'id': 2, 'value': 'two'}, {'id': 4, 'value': 'four'}]
With merge it would look like:
temp = left_df.explode("array_column")
temp = temp.merge(
right_df, left_on=temp["array_column"].apply(lambda x: x.get("id")), right_on="id"
).drop(columns="id")
temp["array_column"] = temp.apply(
lambda x: {**x["array_column"], "value": x["value"]}, axis=1
)
out = temp.groupby("parent_id")["array_column"].agg(list).reset_index()
print(out)
parent_id array_columns
0 1 [{'id': 1, 'value': 'one'}, {'id': 3, 'value':...
1 2 [{'id': 2, 'value': 'two'}, {'id': 4, 'value':...