How to Convert Dask DataFrame Into List of Dictionaries?
Question:
I need to convert a dask dataframe into a list of dictionaries as the response for an API endpoint. I know I can convert the dask dataframe to pandas, and then from there I can convert to dictionary, but it would be better to map each partition to a dict, and then concatenate.
What I tried:
df = dd.read_csv(path, usecols=cols)
dd.compute(df.to_dict(orient='records'))
Error I’m getting:
AttributeError: 'DataFrame' object has no attribute 'to_dict'
Answers:
You can do it as follows
import dask.bag as db
db.from_delayed(df.map_partitions(pd.DataFrame.to_dict, orient='records'
).to_delayed())
which gives you a bag which you could compute (if it fits in memory) or otherwise manipulate.
Note that to_delayed/from_delayed should not be necessary, there is also a to_bag
method, but it doesn’t seem to do the right thing.
Also, you are not really getting much from the dataframe
model here, you may want to start with db.read_text
and the builtin CSV module.
Try this:
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))
It will return a list of dictionaries wherein each row will be converted to the dictionary.
The answer of Kunal Bafna is easiest to implement, and has fewer dependencies.
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))
I need to convert a dask dataframe into a list of dictionaries as the response for an API endpoint. I know I can convert the dask dataframe to pandas, and then from there I can convert to dictionary, but it would be better to map each partition to a dict, and then concatenate.
What I tried:
df = dd.read_csv(path, usecols=cols)
dd.compute(df.to_dict(orient='records'))
Error I’m getting:
AttributeError: 'DataFrame' object has no attribute 'to_dict'
You can do it as follows
import dask.bag as db
db.from_delayed(df.map_partitions(pd.DataFrame.to_dict, orient='records'
).to_delayed())
which gives you a bag which you could compute (if it fits in memory) or otherwise manipulate.
Note that to_delayed/from_delayed should not be necessary, there is also a to_bag
method, but it doesn’t seem to do the right thing.
Also, you are not really getting much from the dataframe
model here, you may want to start with db.read_text
and the builtin CSV module.
Try this:
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))
It will return a list of dictionaries wherein each row will be converted to the dictionary.
The answer of Kunal Bafna is easiest to implement, and has fewer dependencies.
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))