How can I speed up xarray resample (much slower than pandas resample)

Question:

Here is an MWE for resampling a time series in xarray vs. pandas. The 10Min resample takes 6.8 seconds in xarray and 0.003 seconds in pandas. Is there some way to get the Pandas speed in xarray? Pandas resample seems to be independent of the period, while xarray scales with the period.

import numpy as np
import xarray as xr
import pandas as pd
import time

def make_ds(freq):
    size = 100000
    times = pd.date_range('2000-01-01', periods=size, freq=freq)
    ds = xr.Dataset({
        'foo': xr.DataArray(
            data   = np.random.random(size),
            dims   = ['time'],
            coords = {'time': times}
        )})
    return ds

for f in ["1s", "1Min", "10Min"]:
    ds = make_ds(f)

    start = time.time()
    ds_r = ds.resample({'time':"1H"}).mean()
    print(f, 'xr', str(time.time() - start))

    start = time.time()
    ds_r = ds.to_dataframe().resample("1H").mean()
    print(f, 'pd', str(time.time() - start))
: 1s xr 0.040313720703125
: 1s pd 0.0033435821533203125
: 1Min xr 0.5757267475128174
: 1Min pd 0.0025794506072998047
: 10Min xr 6.798743486404419
: 10Min pd 0.0029947757720947266
Asked By: mankoff

||

Answers:

As per the xarray GH issue this is an implementation issue. The solution is to do the resampling (actually a GroupBy) in other code. My solution is to use the fast Pandas resample and then rebuild the xarray dataset:

df_h = ds.to_dataframe().resample("1H").mean()  # what we want (quickly), but in Pandas form
vals = [xr.DataArray(data=df_h[c], dims=['time'], coords={'time':df_h.index}, attrs=ds[c].attrs) for c in df_h.columns]
ds_h = xr.Dataset(dict(zip(df_h.columns,vals)), attrs=ds.attrs)
Answered By: mankoff