How to concatenate multiple json columns in panda

Question:

I have a df with the following format:

id json_1 json_2 json_3 
1  {a:b}  {a:c}  {c:d}
2  {a:b}  {b:c}  null
3  {a:c}  {c:d}  {a:g}

I want to create a new column which concatenates (i.e., takes union) json_1, json_2, and json_3 columns.

json_1, json_2, and json_3 are dictionary text.

Desired output:

 id json_1 json_2 json_3 final_json
 1  {a:b}  {a:c}  {c:d}   [{a:b}, {a:c}, {c:d}]
 2  {a:b}  {b:c}  null    [{a:b}, {b:c}]
 3  {a:c}  {c:d}  {a:g}   [{a:c}, {c:d}, {a:g}] 
Asked By: user2512443

||

Answers:

If need filter out missing values and join dictionaries use list comprehension with pd.notna:

If need filter columns with json substrings:

df['final_json'] = df.filter(like='json').apply(lambda x: [y for y in x if pd.notna(y)], axis=1)

If need filter columns by list:

df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: [y for y in x if pd.notna(y)], axis=1)
Answered By: jezrael

Depending on the type of data and additional requirements, this should do the work

df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: set(x) - set(['null']), axis=1)

[Out]:
   id json_1 json_2 json_3             final_json
0   1  {a:b}  {a:c}  {c:d}  {{c:d}, {a:c}, {a:b}}
1   2  {a:b}  {b:c}   null         {{b:c}, {a:b}}
2   3  {a:c}  {c:d}  {a:g}  {{a:g}, {c:d}, {a:c}}

As per OP’s new Edit, if the goal is just to get that desired output, assuming one is proceeding from the previous operation, then that can be achieved through various methods, such as:

  • Using js.dumps()

    import json as js
    
    df['final_json'] = df['final_json'].apply(lambda x: js.dumps(x))
    
  • Using list()

    df['final_json'] = df['final_json'].apply(lambda x: list(x))
    
  • Using str()

    df['final_json'] = df['final_json'].apply(lambda x: str(x))
    

They all give the following dataframe

   id json_1 json_2 json_3                   final_json
0   1  {a:b}  {a:c}  {c:d}  ["{c:d}", "{a:c}", "{a:b}"]
1   2  {a:b}  {b:c}   null           ["{b:c}", "{a:b}"]
2   3  {a:c}  {c:d}  {a:g}  ["{a:g}", "{c:d}", "{a:c}"]

It would be a matter of selecting the approach that better fits OPs use case, noting that there might be other ways to do that.


As an alternative, here is a one liner that will give OP the same output as the updated desired output, but by starting from OP’s dataframe in the question

df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: [i for i in x if i != 'null'], axis=1)

[Out]:
   id json_1 json_2 json_3             final_json
0   1  {a:b}  {a:c}  {c:d}  [{a:b}, {a:c}, {c:d}]
1   2  {a:b}  {b:c}   null         [{a:b}, {b:c}]
2   3  {a:c}  {c:d}  {a:g}  [{a:c}, {c:d}, {a:g}]

If the columns can hold values that are NaN, one might consider either the following operation (or jezrael’s answer)

df['final_json'] = df[['json_1', 'json_2', 'json_3']].apply(lambda x: [i for i in x if i != 'null' and i != np.nan], axis=1)
Answered By: Gonçalo Peres