Creating netcdf file from csv using xarray with 3d var

Question:

I’m trying to transform a csv file with year, lat, long and pressure into a 3 dimensional netcdf pressure(time, lat, long).

However, my list is with duplicate values ​​as below:

year,lon,lat,pressure
1/1/00,79.4939,34.4713,11981569640
1/1/01,79.4939,34.4713,11870476671
1/1/02,79.4939,34.4713,11858633008
1/1/00,77.9513,35.5452,11254617090
1/1/01,77.9513,35.5452,11267424230
1/1/02,77.9513,35.5452,11297377976
1/1/00,77.9295,35.5188,1031160490

I have the same year, lon, lat for one pressure

My first attempt was using straight:

import pandas as pd
import xarray as xr
csv_file = '.csv'
df = pd.read_csv(csv_file)
df = df.set_index(["year", "lon", "lat"])
xr = df.to_xarray()
nc=xr.to_netcdf('netcdf.nc')`

So I’ve tried to follow How to convert a csv file to grid with Xarray? but I crashed.

I think I need to rearrange this csv to have unique values ​​of lat long as a function of time, varying only the values ​​of pressure.

Something like this:

longitude,latitude,1/1/2000,1/1/2001,1/1/2002....
79.4939,34.4713 11981569640 ...
77.9513,35.5452 11254617090 ... 
77.9295,35.5188 1031160490 ... 

I can use "pd.melt" to create my netcdf:

df = pd.melt(df, id_vars=["year","lon", "lat"], var_name="year", value_name="PRESSURE")

Just a example of my file with two years:

https://1drv.ms/u/s!AhZf0QH5jEVSjWQ7WNCwJsrKBwor?e=UndUkV

Using this code below its where I wanna get:

filename = '13.csv'
colnames = ['year','lon','lat','pressure']
df = pd.read_csv(filename, names = colnames)

df["year"]= pd.to_datetime(df["year"], errors='coerce') 
xr = df.set_index(['year','lon','lat']).to_xarray()

#xr['time'].attrs={'units':'hours since 2018-01-01'}
xr['lat'].attrs={'units':'degrees', 'long_name':'Latitude'}
xr['lon'].attrs={'units':'degrees', 'long_name':'Longitude'}
xr['pressure'].attrs={'units':'pa', 'long_name':'Pressure'}

xr.to_netcdf('my_netcdf.nc')
Asked By: HLGNT

||

Answers:

The requested task is not directly possible with this data — it’s not on regular horizontal grid but rather data collected from different points.
Here is the plot:
enter image description here

So, to make it to the regular grid, one should interpolate, but as the density of the data in some region is really high and in other region rather small, it’s not wise to select regular grid spacing with very small step as there are more than ~40000 unique longitude and ~30000 unique latitude values. Basically, putting this to regular grid would mean array 40k x 30k .

I would suggest making just netCDF containing all the points (irregularly spaced) and using this dataset for further analysis.

Here is some code to turn the input xlsx file to netCDF:

#!/usr/bin/env ipython
import xarray as xr
import pandas as pd
import numpy as np
# -----------------
import pandas as pd
df = pd.read_excel('13.xlsx');
df.columns = ['date','lon','lat','pres'];
for cval in df.columns:
    df[cval] = pd.to_numeric(df[cval],errors = 'coerce')
# --------------------------------------
ddf = xr.Dataset.from_dataframe(df);
ddf.to_netcdf('simple_netcdf.nc')
Answered By: msi_gerva

So you have a couple options if you want to save this data as a netCDF (or zarr/HDF5 or any other storage format for data on a regular grid).

The first would be to proceed with your current plan, in which case you absolutely need to address the total size of the resulting hypercube somehow. You could use the sparse library and save your data in a format which supports sparse data. I don’t recommend this option, as your giant sparse cube is going to be super unweildy. but if you really want a 3D irregular grid with your stations at irregular intervals within this grid, you can do this. Alternatively, you can regrid your data, to force the data to be on a regular grid. This will still result in very large, sparse data, but it would be slightly more usable than if the coordinates were irregularly spaced. This is a good option if you’re looking to overlay your data on another gridded dataset, for example. If you go that route, you should probably consider using pd.cut for discretizing the lat/lon values into regular bins.

A third option is to treat your observations/stations/whatever your collections of points are as just that – a collection of points, and assign a common "point ID" to each point. Then, lat/lon becomes an attribute of the point, not an indexing coordinate. This approach requires a bit of a shift in thinking about how xarray/netCDFs work, but this type of indexing is used commonly in observational data, where you may have many perpendicular dimensions such as point ID, positional time index, band, etc., but the position and timestamp of each observation are actually variables indexed by these other dimensions.

To demonstrate this, I’ve set up a little dataset similar in structure to yours:

import xarray as xr, numpy as np, pandas as pd

years = pd.date_range("2000-01-01", freq="YS", periods=3)
# generate 20 random stations on earth
n_stations = 20
lats = np.random.random(size=n_stations) * 180 - 90
lons = np.random.random(size=n_stations) * 360 - 180

# generate data for all combos of (lat, lon) pairs and time
pressure = (np.random.random(size=(n_stations * len(years))) * 1.1e10 + 1e9).astype(int)

df = pd.DataFrame({'year': (list(years) * n_stations), 'lat': [l for l in lats for _ in years], 'lon': [l for l in lons for _ in years], 'pressure': pressure})

This looks like this:

In [4]: df
Out[4]:
         year        lat         lon     pressure
0  2000-01-01  47.518457 -122.971638   6592720223
1  2001-01-01  47.518457 -122.971638   3181381723
2  2002-01-01  47.518457 -122.971638   4295719754
3  2000-01-01 -61.557495  -80.201070   3843828897
4  2001-01-01 -61.557495  -80.201070  11028409576
5  2002-01-01 -61.557495  -80.201070   2369538294
6  2000-01-01 -69.549806 -108.064884   4736968141
7  2001-01-01 -69.549806 -108.064884   5362327422
8  2002-01-01 -69.549806 -108.064884   5786865879
...
55 2001-01-01   7.065455  -56.622611   1159025195
56 2002-01-01   7.065455  -56.622611   2861490045
57 2000-01-01  10.176521  -93.359717  10668195383
58 2001-01-01  10.176521  -93.359717   6179278941
59 2002-01-01  10.176521  -93.359717   8096958866

The important bit here is that we need to restructure the data so that lat and lon move together with a new point index. You can assign this index in a wide variety of ways, but one easy one if you have 2-dimensional data (point ID by time, here) is to unstack the data into a pandas dataframe:

In [11]: reshaped = df.set_index(['year', 'lat', 'lon']).pressure.unstack('year')
    ...: reshaped
Out[11]:
year                     2000-01-01   2001-01-01   2002-01-01
lat        lon
-69.549806 -108.064884   4736968141   5362327422   5786865879
-61.557495 -80.201070    3843828897  11028409576   2369538294
-26.232121 -42.518353   11071436453   3324450900  10017446009
-17.632865 -43.825574    9624163047   4327094339   5194657461
-10.397045  13.041766    3644097094   4970975759  10215709500
-5.046885  -160.372459  10848978249   5362828700   3165559292
 2.535630   105.366159   7565267947   9150340532   1244019860
 3.070028   54.610328    5774184805   2190428768   3410656879
 7.065455  -56.622611   10487542202   1159025195   2861490045
 10.176521 -93.359717   10668195383   6179278941   8096958866
 11.533859 -8.406768     2311635381   7860849630   9199114517
 15.157955 -113.279669  11984888049  10749492217   8554513278
 20.534460 -9.486914     4636773154  11988039892   7941587610
 32.064057 -55.641618    6209291077   7651976538   9282714003
 42.013715 -55.603621   10377165416  11385104693   7612481121
 43.445033  48.639165    7650284975   2174961057   5519531845
 47.518457 -122.971638   6592720223   3181381723   4295719754
 61.276641 -34.552255   11778765056   2864520584   8978044061
 71.118582  98.074277    8543534134   1709130344   4596373347
 86.568656 -32.057453    2511358407   5623460467  11854301741

Now, we can drop the lat lon index (we’ll pick them back up later) and replace them with a station ID index:

In [12]: press_df = reshaped.reset_index(drop=True).rename_axis('station_id')
    ...: press_df
Out[12]:
year         2000-01-01   2001-01-01   2002-01-01
station_id
0            4736968141   5362327422   5786865879
1            3843828897  11028409576   2369538294
2           11071436453   3324450900  10017446009
3            9624163047   4327094339   5194657461
4            3644097094   4970975759  10215709500
5           10848978249   5362828700   3165559292
6            7565267947   9150340532   1244019860
7            5774184805   2190428768   3410656879
8           10487542202   1159025195   2861490045
9           10668195383   6179278941   8096958866
10           2311635381   7860849630   9199114517
11          11984888049  10749492217   8554513278
12           4636773154  11988039892   7941587610
13           6209291077   7651976538   9282714003
14          10377165416  11385104693   7612481121
15           7650284975   2174961057   5519531845
16           6592720223   3181381723   4295719754
17          11778765056   2864520584   8978044061
18           8543534134   1709130344   4596373347
19           2511358407   5623460467  11854301741

Now, let’s keep track of the lat/lons, keeping their order (and thus station_id value) consistent:

In [13]: latlons = reshaped.index.to_frame().reset_index(drop=True).rename_axis('station_id')
    ...: latlons
Out[13]:
                  lat         lon
station_id
0          -69.549806 -108.064884
1          -61.557495  -80.201070
2          -26.232121  -42.518353
3          -17.632865  -43.825574
4          -10.397045   13.041766
5           -5.046885 -160.372459
6            2.535630  105.366159
7            3.070028   54.610328
8            7.065455  -56.622611
9           10.176521  -93.359717
10          11.533859   -8.406768
11          15.157955 -113.279669
12          20.534460   -9.486914
13          32.064057  -55.641618
14          42.013715  -55.603621
15          43.445033   48.639165
16          47.518457 -122.971638
17          61.276641  -34.552255
18          71.118582   98.074277
19          86.568656  -32.057453

We can now re-stack the table and convert to an xarray DataArray:

In [16]: press_df = reshaped.reset_index(drop=True).rename_axis('station_id')
    ...: press_df
Out[16]:
year         2000-01-01   2001-01-01   2002-01-01
station_id
0            4736968141   5362327422   5786865879
1            3843828897  11028409576   2369538294
2           11071436453   3324450900  10017446009
3            9624163047   4327094339   5194657461
4            3644097094   4970975759  10215709500
5           10848978249   5362828700   3165559292
6            7565267947   9150340532   1244019860
7            5774184805   2190428768   3410656879
8           10487542202   1159025195   2861490045
9           10668195383   6179278941   8096958866
10           2311635381   7860849630   9199114517
11          11984888049  10749492217   8554513278
12           4636773154  11988039892   7941587610
13           6209291077   7651976538   9282714003
14          10377165416  11385104693   7612481121
15           7650284975   2174961057   5519531845
16           6592720223   3181381723   4295719754
17          11778765056   2864520584   8978044061
18           8543534134   1709130344   4596373347
19           2511358407   5623460467  11854301741

In [17]: press_da = press_df.stack().to_xarray()
    ...: press_da
Out[17]:
<xarray.DataArray (station_id: 20, year: 3)>
array([[ 4736968141,  5362327422,  5786865879],
       [ 3843828897, 11028409576,  2369538294],
       [11071436453,  3324450900, 10017446009],
       [ 9624163047,  4327094339,  5194657461],
       [ 3644097094,  4970975759, 10215709500],
       [10848978249,  5362828700,  3165559292],
       [ 7565267947,  9150340532,  1244019860],
       [ 5774184805,  2190428768,  3410656879],
       [10487542202,  1159025195,  2861490045],
       [10668195383,  6179278941,  8096958866],
       [ 2311635381,  7860849630,  9199114517],
       [11984888049, 10749492217,  8554513278],
       [ 4636773154, 11988039892,  7941587610],
       [ 6209291077,  7651976538,  9282714003],
       [10377165416, 11385104693,  7612481121],
       [ 7650284975,  2174961057,  5519531845],
       [ 6592720223,  3181381723,  4295719754],
       [11778765056,  2864520584,  8978044061],
       [ 8543534134,  1709130344,  4596373347],
       [ 2511358407,  5623460467, 11854301741]])
Coordinates:
  * station_id  (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
  * year        (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01

Note that the dimensions here are (station_id, year), not (lat, lon). We can add the (lat, lon), indexed by station_id, as coordinates:

In [19]: press_da = press_da.assign_coords(**latlons.to_xarray())

In [20]: press_da
Out[20]:
<xarray.DataArray (station_id: 20, year: 3)>
array([[ 4736968141,  5362327422,  5786865879],
       [ 3843828897, 11028409576,  2369538294],
       [11071436453,  3324450900, 10017446009],
       [ 9624163047,  4327094339,  5194657461],
       [ 3644097094,  4970975759, 10215709500],
       [10848978249,  5362828700,  3165559292],
       [ 7565267947,  9150340532,  1244019860],
       [ 5774184805,  2190428768,  3410656879],
       [10487542202,  1159025195,  2861490045],
       [10668195383,  6179278941,  8096958866],
       [ 2311635381,  7860849630,  9199114517],
       [11984888049, 10749492217,  8554513278],
       [ 4636773154, 11988039892,  7941587610],
       [ 6209291077,  7651976538,  9282714003],
       [10377165416, 11385104693,  7612481121],
       [ 7650284975,  2174961057,  5519531845],
       [ 6592720223,  3181381723,  4295719754],
       [11778765056,  2864520584,  8978044061],
       [ 8543534134,  1709130344,  4596373347],
       [ 2511358407,  5623460467, 11854301741]])
Coordinates:
  * station_id  (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
  * year        (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
    lat         (station_id) float64 -69.55 -61.56 -26.23 ... 61.28 71.12 86.57
    lon         (station_id) float64 -108.1 -80.2 -42.52 ... -34.55 98.07 -32.06

And now we have all our data, with year perpendicular to station ID, making data analysis along the year dimension easy, but with no need to handle a sparse array.

If you’d like, you can now document the DataArray & Dataset and then write to netcdf:

In [24]: import datetime
    ...: ds = press_da.to_dataset(name="pressure")
    ...: ds.pressure.attrs.update({
    ...:     "units": "big numbers",
    ...:     "long_name": "Pressure!",
    ...:     "cell_method": "random numbers",
    ...: })
    ...: ds.attrs.update({
    ...:     "created": datetime.datetime.now(),
    ...:     "author": "me",
    ...:     "method": "moving random data around",
    ...:     "etc": "etc",
    ...: })

In [25]: ds
Out[25]:
<xarray.Dataset>
Dimensions:     (station_id: 20, year: 3)
Coordinates:
  * station_id  (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
  * year        (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
    lat         (station_id) float64 -69.55 -61.56 -26.23 ... 61.28 71.12 86.57
    lon         (station_id) float64 -108.1 -80.2 -42.52 ... -34.55 98.07 -32.06
Data variables:
    pressure    (station_id, year) int64 4736968141 5362327422 ... 11854301741
Attributes:
    created:  2022-11-08 10:36:39.875581
    author:   me
    method:   moving random data around
    etc:      etc
Answered By: Michael Delgado
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.