Pandas dataframe to JSONL (JSON Lines) conversion
Question:
I need to convert pandas data frame to JSONL format. I couldn’t find a good package to do it and tried to implement myself, but it looks a bit ugly and not efficient.
For example, given a pandas df:
label pattern
0 DRUG aspirin
1 DRUG trazodone
2 DRUG citalopram
I need to convert to txt file of the form:
{"label":"DRUG","pattern":[{"lower":"aspirin"}]}
{"label":"DRUG","pattern":[{"lower":"trazodone"}]}
{"label":"DRUG","pattern":[{"lower":"citalopram"}]}
I tried with to_dict('records')
, but I’m missing [ ]
and nested ‘lower’ key.
df.to_dict('record')
creates:
[{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
I thought about converting the ‘pattern’ columns and include nested ‘lower’?
UPD
So far, I succeeded to convert ‘pattern’ into list:
df_new = pd.concat((df[['label']], df[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_new.columns = ['label', 'pattern']
df_new.head()
The result:
label pattern
0 DRUG [aspirin]
1 DRUG [trazodone]
2 DRUG [citalopram]
and then:
df_new.to_dict(orient=’records’)
[{'label': 'DRUG', 'pattern': ['aspirin']},
{'label': 'DRUG', 'pattern': ['trazodone']},
{'label': 'DRUG', 'pattern': ['citalopram']}]
UPD 2
Eventually, I managed to get what I want, but in the most non-pythonic way.
df_1 = pd.DataFrame(df[['pattern']].apply(lambda x: {'lower': x[0]}, axis=1))
df_1.columns = ['pattern']
df_fin = pd.concat((df[['label']], df_1[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_fin.columns = ['label', 'pattern']
df_fin.to_json(orient='records')
'{'label': 'DRUG', 'pattern': [{'lower': 'aspirin'}]}
{'label': 'DRUG', 'pattern': [{'lower': 'trazodone'}]}
{'label': 'DRUG', 'pattern': [{'lower': 'citalopram'}]}'
Any chance you can show a neat solution?
Answers:
In versions of Pandas > 0.19.0
, DataFrame.to_json
has a parameter, lines
, that will write out JSONL format.
Given that, a more succinct version of your solution might look like this:
import pandas as pd
data = [{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)
# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})
# Output in JSONL format
print(df.to_json(orient='records', lines=True))
Output:
{"label":"DRUG","pattern":{"lower":"aspirin"}}
{"label":"DRUG","pattern":{"lower":"trazodone"}}
{"label":"DRUG","pattern":{"lower":"citalopram"}}
To write to a file I have modified the last line of @kmsquire
# Output in JSONL format
import pandas as pd
data = [{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)
# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})
# Output in JSONL format into a file
f=open('records.jsonl')
print(df.to_json(orient='records', lines=True),file=f, flush=False)
Very short code that should work for easy coping-pasting.
output_path = "/data/meow/my_output.jsonl"
with open(output_path, "w") as f:
f.write(df_result.to_json(orient='records', lines=True, force_ascii=False))
If you are using jupyter notebook, you should use with open(output_path, "w") as f
instead of f = open(output_path, "w")
to make sure file is saved (correctly close) and ready to read in next cell.
I need to convert pandas data frame to JSONL format. I couldn’t find a good package to do it and tried to implement myself, but it looks a bit ugly and not efficient.
For example, given a pandas df:
label pattern
0 DRUG aspirin
1 DRUG trazodone
2 DRUG citalopram
I need to convert to txt file of the form:
{"label":"DRUG","pattern":[{"lower":"aspirin"}]}
{"label":"DRUG","pattern":[{"lower":"trazodone"}]}
{"label":"DRUG","pattern":[{"lower":"citalopram"}]}
I tried with to_dict('records')
, but I’m missing [ ]
and nested ‘lower’ key.
df.to_dict('record')
creates:
[{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
I thought about converting the ‘pattern’ columns and include nested ‘lower’?
UPD
So far, I succeeded to convert ‘pattern’ into list:
df_new = pd.concat((df[['label']], df[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_new.columns = ['label', 'pattern']
df_new.head()
The result:
label pattern
0 DRUG [aspirin]
1 DRUG [trazodone]
2 DRUG [citalopram]
and then:
df_new.to_dict(orient=’records’)
[{'label': 'DRUG', 'pattern': ['aspirin']},
{'label': 'DRUG', 'pattern': ['trazodone']},
{'label': 'DRUG', 'pattern': ['citalopram']}]
UPD 2
Eventually, I managed to get what I want, but in the most non-pythonic way.
df_1 = pd.DataFrame(df[['pattern']].apply(lambda x: {'lower': x[0]}, axis=1))
df_1.columns = ['pattern']
df_fin = pd.concat((df[['label']], df_1[['pattern']].apply(lambda x: x.tolist(), axis=1)), axis=1)
df_fin.columns = ['label', 'pattern']
df_fin.to_json(orient='records')
'{'label': 'DRUG', 'pattern': [{'lower': 'aspirin'}]}
{'label': 'DRUG', 'pattern': [{'lower': 'trazodone'}]}
{'label': 'DRUG', 'pattern': [{'lower': 'citalopram'}]}'
Any chance you can show a neat solution?
In versions of Pandas > 0.19.0
, DataFrame.to_json
has a parameter, lines
, that will write out JSONL format.
Given that, a more succinct version of your solution might look like this:
import pandas as pd
data = [{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)
# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})
# Output in JSONL format
print(df.to_json(orient='records', lines=True))
Output:
{"label":"DRUG","pattern":{"lower":"aspirin"}}
{"label":"DRUG","pattern":{"lower":"trazodone"}}
{"label":"DRUG","pattern":{"lower":"citalopram"}}
To write to a file I have modified the last line of @kmsquire
# Output in JSONL format
import pandas as pd
data = [{'label': 'DRUG', 'pattern': 'aspirin'},
{'label': 'DRUG', 'pattern': 'trazodone'},
{'label': 'DRUG', 'pattern': 'citalopram'}]
df = pd.DataFrame(data)
# Wrap pattern column in a dictionary
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})
# Output in JSONL format into a file
f=open('records.jsonl')
print(df.to_json(orient='records', lines=True),file=f, flush=False)
Very short code that should work for easy coping-pasting.
output_path = "/data/meow/my_output.jsonl"
with open(output_path, "w") as f:
f.write(df_result.to_json(orient='records', lines=True, force_ascii=False))
If you are using jupyter notebook, you should use with open(output_path, "w") as f
instead of f = open(output_path, "w")
to make sure file is saved (correctly close) and ready to read in next cell.