create a multiple CSV in Dask with Differnet name

Question:

I am using dask.dataframe.to_csv to write a CSV.

I was expecting 13 CSVs. but it writes only 2 CSVs (rewriting the old one which i do not want)

for i in range(13):
    df_temp=df1.groupby("temp").get_group(unique_cases[i])
    df_temp.to_csv(path_store_csv+"*.csv")

I also tried this but it did not work:

for i in range(13):
    df_temp=df1.groupby("temp").get_group(unique_cases[i])
    df_temp.to_csv("to/my/path"+str(i)+".csv")

this creates 13 folders in which 2 CSVs are there (completely wrong)

In pandas Pandas DataFrames in a loop, df.to_csv() we can do like this. Is it possible to do with Dask?

Thanks in advance

Asked By: Coder

||

Answers:

I would give the dask.dataframe.to_csv documentation a close read. Specifically:

Store Dask DataFrame to CSV files

One filename per partition will be created.

Dask is doing exactly what it says it will – writing one file per partition, either under the path you give it (treating the path as a directory) or expanding the asterix into partition ids. Either way, it writes one file per partition for each iteration of your loop. So yes, the way you have it written, the files are overwriting each other.

If you want to write outputs which are organized by group, you have a couple of options:

  • you could repartition the data by the group column and write out with dask. This would ensure files are organized by group, but you might have multiple groups in any one file. See the docs on dask.dataframe.set_index and dask.dataframe.repartition for more info. I can’t provide a working example without more information about your dataframe because a working approach here will depend on your dask cluster setup and the size of your data and partitions. But this is the only option that will not involve computing the entire frame multiple times, so would be by far the fastest and most scalable.

  • you could write each group to a partitioned dataset. This will be faster than the third option and won’t break if the groups are larger than memory. But the frame will need to be computed multiple times in order to collect and write each group. For example:

    grouped = df1.groupby("temp")
    for case in unique_cases:
        grouped.get_group(case).to_csv(
            path_store_csv + f"/group-{case}/*.csv"
        )
    

    This will write out 13 folders, each with multiple csvs for the partitions in the dataframe. This is not wrong.

  • you could compute each grouped item so the write is done by pandas. This will be the slowest, and will only work if each group is small enough to fit into memory. But if you can fit every group into memory, this will give you one file per group:

    grouped = df1.groupby("temp")
    for case in unique_cases:
        grouped.get_group(case).compute().to_csv(
            path_store_csv + f"-{case}.csv"
        )
    

Generally, dask is designed to behave similarly enough to pandas that the behavior is intuitive to people with pandas experience. But because it is intended to work with data which is too large to fit into memory or to parallelize across many machines, it necessarily cannot and should not work exactly like pandas. So it’s best to keep in mind how your data is partitioned and how this might affect operations like groupby, read/write, reshuffle, etc. If dask isn’t behaving in a way you expect, go to the docs right away to see if there is guidance on the difference.

Answered By: Michael Delgado
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.