Pandas row to json
Question:
I have a dataframe in pandas and my goal is to write each row of the dataframe as a new json file.
I’m a bit stuck right now. My intuition was to iterate over the rows of the dataframe (using df.iterrows) and use json.dumps to dump the file but to no avail.
Any thoughts?
Answers:
Pandas DataFrames have a to_json method that will do it for you:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
If you want each row in its own file you can iterate over the index (and use the index to help name them):
for i in df.index:
df.loc[i].to_json("row{}.json".format(i))
Looping over indices is very inefficient.
A faster technique:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)
Using apply, this can be done as
def writejson(row):
with open(row["filename"]+'.json', "w") as outfile:
json.dump(row["json"], outfile, indent=2)
in_df.apply(writejson, axis=1)
Assuming the dataframe has a column named “filename” with filename for each json row.
Extending the answer of @MrE, if you’re looking to convert multiple columns from a single row into another column with the content in json format (and not separate json files as output) I’ve had speed issues while using:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)
I’ve achieved significant speed improvements on a dataset of 175K records and 5 columns using this line of code:
df['json'] = df.to_json(orient='records', lines=True).splitlines()
Speed went from >1 min to 350 ms.
Here’s a simple solution:
transform a dataframe to json per record, one json per line. then simply split the lines
list_of_jsons = df.to_json(orient='records', lines=True).splitlines()
I have a dataframe in pandas and my goal is to write each row of the dataframe as a new json file.
I’m a bit stuck right now. My intuition was to iterate over the rows of the dataframe (using df.iterrows) and use json.dumps to dump the file but to no avail.
Any thoughts?
Pandas DataFrames have a to_json method that will do it for you:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
If you want each row in its own file you can iterate over the index (and use the index to help name them):
for i in df.index:
df.loc[i].to_json("row{}.json".format(i))
Looping over indices is very inefficient.
A faster technique:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)
Using apply, this can be done as
def writejson(row):
with open(row["filename"]+'.json', "w") as outfile:
json.dump(row["json"], outfile, indent=2)
in_df.apply(writejson, axis=1)
Assuming the dataframe has a column named “filename” with filename for each json row.
Extending the answer of @MrE, if you’re looking to convert multiple columns from a single row into another column with the content in json format (and not separate json files as output) I’ve had speed issues while using:
df['json'] = df.apply(lambda x: x.to_json(), axis=1)
I’ve achieved significant speed improvements on a dataset of 175K records and 5 columns using this line of code:
df['json'] = df.to_json(orient='records', lines=True).splitlines()
Speed went from >1 min to 350 ms.
Here’s a simple solution:
transform a dataframe to json per record, one json per line. then simply split the lines
list_of_jsons = df.to_json(orient='records', lines=True).splitlines()