printing the uploading progress using `xarray.Dataset.to_zarr` function

Question:

I’m trying to upload an xarray dataset to GCP using the function ds.to_zarr(store=store), and it works perfect. However, I would like to show the progress of big datasets. Is there any option to chunk my dataset in a way I can use tqdm or someting like that to log the uploading progress?

Here is the code that I currently have:

import os

import xarray as xr
import numpy as np
import gcsfs
from dask.diagnostics import ProgressBar

if __name__ == '__main__':
    # for testing
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "service-account.json"

    # create xarray
    data_arr = np.random.rand(5000, 100, 100)
    data_xarr = xr.DataArray(data_arr,
                             dims=["x", "y", "z"])

    # define store
    gcp_blob_uri = "gs://gprlib/test.zarr"
    gcs = gcsfs.GCSFileSystem()
    store = gcs.get_mapper(gcp_blob_uri)

    # delayed to_zarr computation -> seems that it does not work
    write_job = data_xarr
        .to_dataset(name="data")
        .to_zarr(store, mode="w", compute=False)

    print(write_job)
Asked By: Henry Ruiz

||

Answers:

xarray.Dataset.to_zarr has an optional argument compute which is True by default:

compute (bool, optional) – If True write array data immediately, otherwise return a dask.delayed.Delayed object that can be computed to write array data later. Metadata is always updated eagerly.

Using this, you can track the progress using dask’s own dask.distributed.progress bar:

write_job = ds.to_zarr(store, compute=False)
write_job = write_job.persist()

# this will return an interactive (non-blocking) widget if in a notebook
# environment. To force the widget to block, provide notebook=False.
distributed.progress(write_job, notebook=False)
[##############                          ] | 35% Completed |  4.5s

Note that for this to work, the dataset must consist of chunked dask arrays. If the data is in memory, you could use a single chunk per array with ds.chunk().to_zarr.

Answered By: Michael Delgado
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.