How to handle large volume of data in a single array using xarray?

Question:

I have 16 years of daily meteorological data in NetCDF, it has and each data contain a grid size of 501 x 572. This means each year has dimensions of 365 x 501 x 572. I converted it into a one-dimensional array. Then I am trying to plot probability distribution. But since the data size is so large, the python kernel restarts. How to optimize my code to convert 16 (years) x 365 (days) x 501 (lat) x 572 (lon) into a single array to plot distribution? I used chunks to optimize the input, but still, it fails when I convert it into a single array. It shows kernel restarts on the laptop. How to do it? How can I handle this much of data using xarray?

import matplotlib.pyplot as plt
import xarray as xr
import numpy as np
import seaborn as sns

fname='20*.nc'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
ds=xr.open_mfdataset(fname,parallel=True,chunks=100)
prec = ds.irwin_cdr.values.flatten()
sns.displot(prec, bins=50,  color="g", ax=ax)
Asked By: Krishnaap

||

Answers:

You mention:

16 (years) x 365 (days) x 501 (lat) x 572 (lon)

This is equal to 1.67E9 values. Assuming they’re float64, that’s eight byte per value, i.e. 13.4 gigabyte of RAM. That’s challenging. You could half the RAM usage by converting to float32.

Xarray has many tools for dealing with large data, calling .values will turn it into it a numpy array and load everything into memory. I’m not sure what the displot does behind the scenes (you can read its source if you want to): but it seems like you want to compute some kind of histogram.

In that case, your real problem is: how do I compute a histogram of a very large array? — Fortunately, that question has been answered already:

Numpy histogram of large arrays

Compute a histogram piece by piece (np.histogram returns the edges + counts), then sum all the counts.

ds = xr.open_mfdataset(fname, parallel=True, chunks="auto")
da = ds["irwin_cdr"]
step = ...
bins = np.arange(da.min(), da.max() + step, step)

hist, _ = np.histogram(da.isel(time=0).values.ravel())
for i in range(1, len(da["time"]):
    hist += np.histogram(da.isel(time=i).values.ravel())[0]

# Now do your plotting with bins, hist

This will only read a single timestep into memory at a time.

There’s a gist here on how to use seaborn with precomputed histograms:
https://gist.github.com/pierdom/d639a1d3b8934ee31db8b2ab9997ae92

I reckon this might do the trick:

bin_midpoint = 0.5 * (bins[:-1] + bins[1:])
sns.histplot(x=bin_midpoint, weights=hist, discrete=True)
Answered By: Huite Bootsma