How to concatenate multiple json columns in panda
Question:
I have a df with the following format:
id json_1 json_2 json_3
1 {a:b} {a:c} {c:d}
2 {a:b} {b:c} null
3 {a:c} {c:d} {a:g}
I want to create a new column which concatenates (i.e., takes union) json_1, json_2, and json_3 columns.
json_1, json_2, and json_3 are dictionary text.
Desired output:
id json_1 json_2 json_3 final_json
1 {a:b} {a:c} {c:d} [{a:b}, {a:c}, {c:d}]
2 {a:b} {b:c} null [{a:b}, {b:c}]
3 {a:c} {c:d} {a:g} [{a:c}, {c:d}, {a:g}]
Answers:
If need filter out missing values and join dictionaries use list comprehension with pd.notna
:
If need filter columns with json
substrings:
df['final_json'] = df.filter(like='json').apply(lambda x: [y for y in x if pd.notna(y)], axis=1)
If need filter columns by list:
df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: [y for y in x if pd.notna(y)], axis=1)
Depending on the type of data and additional requirements, this should do the work
df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: set(x) - set(['null']), axis=1)
[Out]:
id json_1 json_2 json_3 final_json
0 1 {a:b} {a:c} {c:d} {{c:d}, {a:c}, {a:b}}
1 2 {a:b} {b:c} null {{b:c}, {a:b}}
2 3 {a:c} {c:d} {a:g} {{a:g}, {c:d}, {a:c}}
As per OP’s new Edit, if the goal is just to get that desired output, assuming one is proceeding from the previous operation, then that can be achieved through various methods, such as:
-
Using js.dumps()
import json as js
df['final_json'] = df['final_json'].apply(lambda x: js.dumps(x))
-
Using list()
df['final_json'] = df['final_json'].apply(lambda x: list(x))
-
Using str()
df['final_json'] = df['final_json'].apply(lambda x: str(x))
They all give the following dataframe
id json_1 json_2 json_3 final_json
0 1 {a:b} {a:c} {c:d} ["{c:d}", "{a:c}", "{a:b}"]
1 2 {a:b} {b:c} null ["{b:c}", "{a:b}"]
2 3 {a:c} {c:d} {a:g} ["{a:g}", "{c:d}", "{a:c}"]
It would be a matter of selecting the approach that better fits OPs use case, noting that there might be other ways to do that.
As an alternative, here is a one liner that will give OP the same output as the updated desired output, but by starting from OP’s dataframe in the question
df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: [i for i in x if i != 'null'], axis=1)
[Out]:
id json_1 json_2 json_3 final_json
0 1 {a:b} {a:c} {c:d} [{a:b}, {a:c}, {c:d}]
1 2 {a:b} {b:c} null [{a:b}, {b:c}]
2 3 {a:c} {c:d} {a:g} [{a:c}, {c:d}, {a:g}]
If the columns can hold values that are NaN
, one might consider either the following operation (or jezrael’s answer)
df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: [i for i in x if i != 'null' and i != np.nan], axis=1)
I have a df with the following format:
id json_1 json_2 json_3
1 {a:b} {a:c} {c:d}
2 {a:b} {b:c} null
3 {a:c} {c:d} {a:g}
I want to create a new column which concatenates (i.e., takes union) json_1, json_2, and json_3 columns.
json_1, json_2, and json_3 are dictionary text.
Desired output:
id json_1 json_2 json_3 final_json
1 {a:b} {a:c} {c:d} [{a:b}, {a:c}, {c:d}]
2 {a:b} {b:c} null [{a:b}, {b:c}]
3 {a:c} {c:d} {a:g} [{a:c}, {c:d}, {a:g}]
If need filter out missing values and join dictionaries use list comprehension with pd.notna
:
If need filter columns with json
substrings:
df['final_json'] = df.filter(like='json').apply(lambda x: [y for y in x if pd.notna(y)], axis=1)
If need filter columns by list:
df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: [y for y in x if pd.notna(y)], axis=1)
Depending on the type of data and additional requirements, this should do the work
df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: set(x) - set(['null']), axis=1)
[Out]:
id json_1 json_2 json_3 final_json
0 1 {a:b} {a:c} {c:d} {{c:d}, {a:c}, {a:b}}
1 2 {a:b} {b:c} null {{b:c}, {a:b}}
2 3 {a:c} {c:d} {a:g} {{a:g}, {c:d}, {a:c}}
As per OP’s new Edit, if the goal is just to get that desired output, assuming one is proceeding from the previous operation, then that can be achieved through various methods, such as:
-
Using
js.dumps()
import json as js df['final_json'] = df['final_json'].apply(lambda x: js.dumps(x))
-
Using
list()
df['final_json'] = df['final_json'].apply(lambda x: list(x))
-
Using
str()
df['final_json'] = df['final_json'].apply(lambda x: str(x))
They all give the following dataframe
id json_1 json_2 json_3 final_json
0 1 {a:b} {a:c} {c:d} ["{c:d}", "{a:c}", "{a:b}"]
1 2 {a:b} {b:c} null ["{b:c}", "{a:b}"]
2 3 {a:c} {c:d} {a:g} ["{a:g}", "{c:d}", "{a:c}"]
It would be a matter of selecting the approach that better fits OPs use case, noting that there might be other ways to do that.
As an alternative, here is a one liner that will give OP the same output as the updated desired output, but by starting from OP’s dataframe in the question
df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: [i for i in x if i != 'null'], axis=1)
[Out]:
id json_1 json_2 json_3 final_json
0 1 {a:b} {a:c} {c:d} [{a:b}, {a:c}, {c:d}]
1 2 {a:b} {b:c} null [{a:b}, {b:c}]
2 3 {a:c} {c:d} {a:g} [{a:c}, {c:d}, {a:g}]
If the columns can hold values that are NaN
, one might consider either the following operation (or jezrael’s answer)
df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: [i for i in x if i != 'null' and i != np.nan], axis=1)