Creating netcdf file from csv using xarray with 3d var
Question:
I’m trying to transform a csv file with year, lat, long and pressure into a 3 dimensional netcdf pressure(time, lat, long).
However, my list is with duplicate values as below:
year,lon,lat,pressure
1/1/00,79.4939,34.4713,11981569640
1/1/01,79.4939,34.4713,11870476671
1/1/02,79.4939,34.4713,11858633008
1/1/00,77.9513,35.5452,11254617090
1/1/01,77.9513,35.5452,11267424230
1/1/02,77.9513,35.5452,11297377976
1/1/00,77.9295,35.5188,1031160490
I have the same year, lon, lat for one pressure
My first attempt was using straight:
import pandas as pd
import xarray as xr
csv_file = '.csv'
df = pd.read_csv(csv_file)
df = df.set_index(["year", "lon", "lat"])
xr = df.to_xarray()
nc=xr.to_netcdf('netcdf.nc')`
So I’ve tried to follow How to convert a csv file to grid with Xarray? but I crashed.
I think I need to rearrange this csv to have unique values of lat long as a function of time, varying only the values of pressure.
Something like this:
longitude,latitude,1/1/2000,1/1/2001,1/1/2002....
79.4939,34.4713 11981569640 ...
77.9513,35.5452 11254617090 ...
77.9295,35.5188 1031160490 ...
I can use "pd.melt" to create my netcdf:
df = pd.melt(df, id_vars=["year","lon", "lat"], var_name="year", value_name="PRESSURE")
Just a example of my file with two years:
https://1drv.ms/u/s!AhZf0QH5jEVSjWQ7WNCwJsrKBwor?e=UndUkV
Using this code below its where I wanna get:
filename = '13.csv'
colnames = ['year','lon','lat','pressure']
df = pd.read_csv(filename, names = colnames)
df["year"]= pd.to_datetime(df["year"], errors='coerce')
xr = df.set_index(['year','lon','lat']).to_xarray()
#xr['time'].attrs={'units':'hours since 2018-01-01'}
xr['lat'].attrs={'units':'degrees', 'long_name':'Latitude'}
xr['lon'].attrs={'units':'degrees', 'long_name':'Longitude'}
xr['pressure'].attrs={'units':'pa', 'long_name':'Pressure'}
xr.to_netcdf('my_netcdf.nc')
Answers:
The requested task is not directly possible with this data — it’s not on regular horizontal grid but rather data collected from different points.
Here is the plot:
So, to make it to the regular grid, one should interpolate, but as the density of the data in some region is really high and in other region rather small, it’s not wise to select regular grid spacing with very small step as there are more than ~40000 unique longitude and ~30000 unique latitude values. Basically, putting this to regular grid would mean array 40k x 30k .
I would suggest making just netCDF containing all the points (irregularly spaced) and using this dataset for further analysis.
Here is some code to turn the input xlsx file to netCDF:
#!/usr/bin/env ipython
import xarray as xr
import pandas as pd
import numpy as np
# -----------------
import pandas as pd
df = pd.read_excel('13.xlsx');
df.columns = ['date','lon','lat','pres'];
for cval in df.columns:
df[cval] = pd.to_numeric(df[cval],errors = 'coerce')
# --------------------------------------
ddf = xr.Dataset.from_dataframe(df);
ddf.to_netcdf('simple_netcdf.nc')
So you have a couple options if you want to save this data as a netCDF (or zarr/HDF5 or any other storage format for data on a regular grid).
The first would be to proceed with your current plan, in which case you absolutely need to address the total size of the resulting hypercube somehow. You could use the sparse
library and save your data in a format which supports sparse data. I don’t recommend this option, as your giant sparse cube is going to be super unweildy. but if you really want a 3D irregular grid with your stations at irregular intervals within this grid, you can do this. Alternatively, you can regrid your data, to force the data to be on a regular grid. This will still result in very large, sparse data, but it would be slightly more usable than if the coordinates were irregularly spaced. This is a good option if you’re looking to overlay your data on another gridded dataset, for example. If you go that route, you should probably consider using pd.cut
for discretizing the lat/lon values into regular bins.
A third option is to treat your observations/stations/whatever your collections of points are as just that – a collection of points, and assign a common "point ID" to each point. Then, lat/lon becomes an attribute of the point, not an indexing coordinate. This approach requires a bit of a shift in thinking about how xarray/netCDFs work, but this type of indexing is used commonly in observational data, where you may have many perpendicular dimensions such as point ID, positional time index, band, etc., but the position and timestamp of each observation are actually variables indexed by these other dimensions.
To demonstrate this, I’ve set up a little dataset similar in structure to yours:
import xarray as xr, numpy as np, pandas as pd
years = pd.date_range("2000-01-01", freq="YS", periods=3)
# generate 20 random stations on earth
n_stations = 20
lats = np.random.random(size=n_stations) * 180 - 90
lons = np.random.random(size=n_stations) * 360 - 180
# generate data for all combos of (lat, lon) pairs and time
pressure = (np.random.random(size=(n_stations * len(years))) * 1.1e10 + 1e9).astype(int)
df = pd.DataFrame({'year': (list(years) * n_stations), 'lat': [l for l in lats for _ in years], 'lon': [l for l in lons for _ in years], 'pressure': pressure})
This looks like this:
In [4]: df
Out[4]:
year lat lon pressure
0 2000-01-01 47.518457 -122.971638 6592720223
1 2001-01-01 47.518457 -122.971638 3181381723
2 2002-01-01 47.518457 -122.971638 4295719754
3 2000-01-01 -61.557495 -80.201070 3843828897
4 2001-01-01 -61.557495 -80.201070 11028409576
5 2002-01-01 -61.557495 -80.201070 2369538294
6 2000-01-01 -69.549806 -108.064884 4736968141
7 2001-01-01 -69.549806 -108.064884 5362327422
8 2002-01-01 -69.549806 -108.064884 5786865879
...
55 2001-01-01 7.065455 -56.622611 1159025195
56 2002-01-01 7.065455 -56.622611 2861490045
57 2000-01-01 10.176521 -93.359717 10668195383
58 2001-01-01 10.176521 -93.359717 6179278941
59 2002-01-01 10.176521 -93.359717 8096958866
The important bit here is that we need to restructure the data so that lat and lon move together with a new point index. You can assign this index in a wide variety of ways, but one easy one if you have 2-dimensional data (point ID by time, here) is to unstack the data into a pandas dataframe:
In [11]: reshaped = df.set_index(['year', 'lat', 'lon']).pressure.unstack('year')
...: reshaped
Out[11]:
year 2000-01-01 2001-01-01 2002-01-01
lat lon
-69.549806 -108.064884 4736968141 5362327422 5786865879
-61.557495 -80.201070 3843828897 11028409576 2369538294
-26.232121 -42.518353 11071436453 3324450900 10017446009
-17.632865 -43.825574 9624163047 4327094339 5194657461
-10.397045 13.041766 3644097094 4970975759 10215709500
-5.046885 -160.372459 10848978249 5362828700 3165559292
2.535630 105.366159 7565267947 9150340532 1244019860
3.070028 54.610328 5774184805 2190428768 3410656879
7.065455 -56.622611 10487542202 1159025195 2861490045
10.176521 -93.359717 10668195383 6179278941 8096958866
11.533859 -8.406768 2311635381 7860849630 9199114517
15.157955 -113.279669 11984888049 10749492217 8554513278
20.534460 -9.486914 4636773154 11988039892 7941587610
32.064057 -55.641618 6209291077 7651976538 9282714003
42.013715 -55.603621 10377165416 11385104693 7612481121
43.445033 48.639165 7650284975 2174961057 5519531845
47.518457 -122.971638 6592720223 3181381723 4295719754
61.276641 -34.552255 11778765056 2864520584 8978044061
71.118582 98.074277 8543534134 1709130344 4596373347
86.568656 -32.057453 2511358407 5623460467 11854301741
Now, we can drop the lat lon index (we’ll pick them back up later) and replace them with a station ID index:
In [12]: press_df = reshaped.reset_index(drop=True).rename_axis('station_id')
...: press_df
Out[12]:
year 2000-01-01 2001-01-01 2002-01-01
station_id
0 4736968141 5362327422 5786865879
1 3843828897 11028409576 2369538294
2 11071436453 3324450900 10017446009
3 9624163047 4327094339 5194657461
4 3644097094 4970975759 10215709500
5 10848978249 5362828700 3165559292
6 7565267947 9150340532 1244019860
7 5774184805 2190428768 3410656879
8 10487542202 1159025195 2861490045
9 10668195383 6179278941 8096958866
10 2311635381 7860849630 9199114517
11 11984888049 10749492217 8554513278
12 4636773154 11988039892 7941587610
13 6209291077 7651976538 9282714003
14 10377165416 11385104693 7612481121
15 7650284975 2174961057 5519531845
16 6592720223 3181381723 4295719754
17 11778765056 2864520584 8978044061
18 8543534134 1709130344 4596373347
19 2511358407 5623460467 11854301741
Now, let’s keep track of the lat/lons, keeping their order (and thus station_id value) consistent:
In [13]: latlons = reshaped.index.to_frame().reset_index(drop=True).rename_axis('station_id')
...: latlons
Out[13]:
lat lon
station_id
0 -69.549806 -108.064884
1 -61.557495 -80.201070
2 -26.232121 -42.518353
3 -17.632865 -43.825574
4 -10.397045 13.041766
5 -5.046885 -160.372459
6 2.535630 105.366159
7 3.070028 54.610328
8 7.065455 -56.622611
9 10.176521 -93.359717
10 11.533859 -8.406768
11 15.157955 -113.279669
12 20.534460 -9.486914
13 32.064057 -55.641618
14 42.013715 -55.603621
15 43.445033 48.639165
16 47.518457 -122.971638
17 61.276641 -34.552255
18 71.118582 98.074277
19 86.568656 -32.057453
We can now re-stack the table and convert to an xarray DataArray:
In [16]: press_df = reshaped.reset_index(drop=True).rename_axis('station_id')
...: press_df
Out[16]:
year 2000-01-01 2001-01-01 2002-01-01
station_id
0 4736968141 5362327422 5786865879
1 3843828897 11028409576 2369538294
2 11071436453 3324450900 10017446009
3 9624163047 4327094339 5194657461
4 3644097094 4970975759 10215709500
5 10848978249 5362828700 3165559292
6 7565267947 9150340532 1244019860
7 5774184805 2190428768 3410656879
8 10487542202 1159025195 2861490045
9 10668195383 6179278941 8096958866
10 2311635381 7860849630 9199114517
11 11984888049 10749492217 8554513278
12 4636773154 11988039892 7941587610
13 6209291077 7651976538 9282714003
14 10377165416 11385104693 7612481121
15 7650284975 2174961057 5519531845
16 6592720223 3181381723 4295719754
17 11778765056 2864520584 8978044061
18 8543534134 1709130344 4596373347
19 2511358407 5623460467 11854301741
In [17]: press_da = press_df.stack().to_xarray()
...: press_da
Out[17]:
<xarray.DataArray (station_id: 20, year: 3)>
array([[ 4736968141, 5362327422, 5786865879],
[ 3843828897, 11028409576, 2369538294],
[11071436453, 3324450900, 10017446009],
[ 9624163047, 4327094339, 5194657461],
[ 3644097094, 4970975759, 10215709500],
[10848978249, 5362828700, 3165559292],
[ 7565267947, 9150340532, 1244019860],
[ 5774184805, 2190428768, 3410656879],
[10487542202, 1159025195, 2861490045],
[10668195383, 6179278941, 8096958866],
[ 2311635381, 7860849630, 9199114517],
[11984888049, 10749492217, 8554513278],
[ 4636773154, 11988039892, 7941587610],
[ 6209291077, 7651976538, 9282714003],
[10377165416, 11385104693, 7612481121],
[ 7650284975, 2174961057, 5519531845],
[ 6592720223, 3181381723, 4295719754],
[11778765056, 2864520584, 8978044061],
[ 8543534134, 1709130344, 4596373347],
[ 2511358407, 5623460467, 11854301741]])
Coordinates:
* station_id (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
* year (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
Note that the dimensions here are (station_id, year)
, not (lat, lon)
. We can add the (lat, lon), indexed by station_id
, as coordinates:
In [19]: press_da = press_da.assign_coords(**latlons.to_xarray())
In [20]: press_da
Out[20]:
<xarray.DataArray (station_id: 20, year: 3)>
array([[ 4736968141, 5362327422, 5786865879],
[ 3843828897, 11028409576, 2369538294],
[11071436453, 3324450900, 10017446009],
[ 9624163047, 4327094339, 5194657461],
[ 3644097094, 4970975759, 10215709500],
[10848978249, 5362828700, 3165559292],
[ 7565267947, 9150340532, 1244019860],
[ 5774184805, 2190428768, 3410656879],
[10487542202, 1159025195, 2861490045],
[10668195383, 6179278941, 8096958866],
[ 2311635381, 7860849630, 9199114517],
[11984888049, 10749492217, 8554513278],
[ 4636773154, 11988039892, 7941587610],
[ 6209291077, 7651976538, 9282714003],
[10377165416, 11385104693, 7612481121],
[ 7650284975, 2174961057, 5519531845],
[ 6592720223, 3181381723, 4295719754],
[11778765056, 2864520584, 8978044061],
[ 8543534134, 1709130344, 4596373347],
[ 2511358407, 5623460467, 11854301741]])
Coordinates:
* station_id (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
* year (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
lat (station_id) float64 -69.55 -61.56 -26.23 ... 61.28 71.12 86.57
lon (station_id) float64 -108.1 -80.2 -42.52 ... -34.55 98.07 -32.06
And now we have all our data, with year perpendicular to station ID, making data analysis along the year dimension easy, but with no need to handle a sparse array.
If you’d like, you can now document the DataArray & Dataset and then write to netcdf:
In [24]: import datetime
...: ds = press_da.to_dataset(name="pressure")
...: ds.pressure.attrs.update({
...: "units": "big numbers",
...: "long_name": "Pressure!",
...: "cell_method": "random numbers",
...: })
...: ds.attrs.update({
...: "created": datetime.datetime.now(),
...: "author": "me",
...: "method": "moving random data around",
...: "etc": "etc",
...: })
In [25]: ds
Out[25]:
<xarray.Dataset>
Dimensions: (station_id: 20, year: 3)
Coordinates:
* station_id (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
* year (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
lat (station_id) float64 -69.55 -61.56 -26.23 ... 61.28 71.12 86.57
lon (station_id) float64 -108.1 -80.2 -42.52 ... -34.55 98.07 -32.06
Data variables:
pressure (station_id, year) int64 4736968141 5362327422 ... 11854301741
Attributes:
created: 2022-11-08 10:36:39.875581
author: me
method: moving random data around
etc: etc
I’m trying to transform a csv file with year, lat, long and pressure into a 3 dimensional netcdf pressure(time, lat, long).
However, my list is with duplicate values as below:
year,lon,lat,pressure
1/1/00,79.4939,34.4713,11981569640
1/1/01,79.4939,34.4713,11870476671
1/1/02,79.4939,34.4713,11858633008
1/1/00,77.9513,35.5452,11254617090
1/1/01,77.9513,35.5452,11267424230
1/1/02,77.9513,35.5452,11297377976
1/1/00,77.9295,35.5188,1031160490
I have the same year, lon, lat for one pressure
My first attempt was using straight:
import pandas as pd
import xarray as xr
csv_file = '.csv'
df = pd.read_csv(csv_file)
df = df.set_index(["year", "lon", "lat"])
xr = df.to_xarray()
nc=xr.to_netcdf('netcdf.nc')`
So I’ve tried to follow How to convert a csv file to grid with Xarray? but I crashed.
I think I need to rearrange this csv to have unique values of lat long as a function of time, varying only the values of pressure.
Something like this:
longitude,latitude,1/1/2000,1/1/2001,1/1/2002....
79.4939,34.4713 11981569640 ...
77.9513,35.5452 11254617090 ...
77.9295,35.5188 1031160490 ...
I can use "pd.melt" to create my netcdf:
df = pd.melt(df, id_vars=["year","lon", "lat"], var_name="year", value_name="PRESSURE")
Just a example of my file with two years:
https://1drv.ms/u/s!AhZf0QH5jEVSjWQ7WNCwJsrKBwor?e=UndUkV
Using this code below its where I wanna get:
filename = '13.csv'
colnames = ['year','lon','lat','pressure']
df = pd.read_csv(filename, names = colnames)
df["year"]= pd.to_datetime(df["year"], errors='coerce')
xr = df.set_index(['year','lon','lat']).to_xarray()
#xr['time'].attrs={'units':'hours since 2018-01-01'}
xr['lat'].attrs={'units':'degrees', 'long_name':'Latitude'}
xr['lon'].attrs={'units':'degrees', 'long_name':'Longitude'}
xr['pressure'].attrs={'units':'pa', 'long_name':'Pressure'}
xr.to_netcdf('my_netcdf.nc')
The requested task is not directly possible with this data — it’s not on regular horizontal grid but rather data collected from different points.
Here is the plot:
So, to make it to the regular grid, one should interpolate, but as the density of the data in some region is really high and in other region rather small, it’s not wise to select regular grid spacing with very small step as there are more than ~40000 unique longitude and ~30000 unique latitude values. Basically, putting this to regular grid would mean array 40k x 30k .
I would suggest making just netCDF containing all the points (irregularly spaced) and using this dataset for further analysis.
Here is some code to turn the input xlsx file to netCDF:
#!/usr/bin/env ipython
import xarray as xr
import pandas as pd
import numpy as np
# -----------------
import pandas as pd
df = pd.read_excel('13.xlsx');
df.columns = ['date','lon','lat','pres'];
for cval in df.columns:
df[cval] = pd.to_numeric(df[cval],errors = 'coerce')
# --------------------------------------
ddf = xr.Dataset.from_dataframe(df);
ddf.to_netcdf('simple_netcdf.nc')
So you have a couple options if you want to save this data as a netCDF (or zarr/HDF5 or any other storage format for data on a regular grid).
The first would be to proceed with your current plan, in which case you absolutely need to address the total size of the resulting hypercube somehow. You could use the sparse
library and save your data in a format which supports sparse data. I don’t recommend this option, as your giant sparse cube is going to be super unweildy. but if you really want a 3D irregular grid with your stations at irregular intervals within this grid, you can do this. Alternatively, you can regrid your data, to force the data to be on a regular grid. This will still result in very large, sparse data, but it would be slightly more usable than if the coordinates were irregularly spaced. This is a good option if you’re looking to overlay your data on another gridded dataset, for example. If you go that route, you should probably consider using pd.cut
for discretizing the lat/lon values into regular bins.
A third option is to treat your observations/stations/whatever your collections of points are as just that – a collection of points, and assign a common "point ID" to each point. Then, lat/lon becomes an attribute of the point, not an indexing coordinate. This approach requires a bit of a shift in thinking about how xarray/netCDFs work, but this type of indexing is used commonly in observational data, where you may have many perpendicular dimensions such as point ID, positional time index, band, etc., but the position and timestamp of each observation are actually variables indexed by these other dimensions.
To demonstrate this, I’ve set up a little dataset similar in structure to yours:
import xarray as xr, numpy as np, pandas as pd
years = pd.date_range("2000-01-01", freq="YS", periods=3)
# generate 20 random stations on earth
n_stations = 20
lats = np.random.random(size=n_stations) * 180 - 90
lons = np.random.random(size=n_stations) * 360 - 180
# generate data for all combos of (lat, lon) pairs and time
pressure = (np.random.random(size=(n_stations * len(years))) * 1.1e10 + 1e9).astype(int)
df = pd.DataFrame({'year': (list(years) * n_stations), 'lat': [l for l in lats for _ in years], 'lon': [l for l in lons for _ in years], 'pressure': pressure})
This looks like this:
In [4]: df
Out[4]:
year lat lon pressure
0 2000-01-01 47.518457 -122.971638 6592720223
1 2001-01-01 47.518457 -122.971638 3181381723
2 2002-01-01 47.518457 -122.971638 4295719754
3 2000-01-01 -61.557495 -80.201070 3843828897
4 2001-01-01 -61.557495 -80.201070 11028409576
5 2002-01-01 -61.557495 -80.201070 2369538294
6 2000-01-01 -69.549806 -108.064884 4736968141
7 2001-01-01 -69.549806 -108.064884 5362327422
8 2002-01-01 -69.549806 -108.064884 5786865879
...
55 2001-01-01 7.065455 -56.622611 1159025195
56 2002-01-01 7.065455 -56.622611 2861490045
57 2000-01-01 10.176521 -93.359717 10668195383
58 2001-01-01 10.176521 -93.359717 6179278941
59 2002-01-01 10.176521 -93.359717 8096958866
The important bit here is that we need to restructure the data so that lat and lon move together with a new point index. You can assign this index in a wide variety of ways, but one easy one if you have 2-dimensional data (point ID by time, here) is to unstack the data into a pandas dataframe:
In [11]: reshaped = df.set_index(['year', 'lat', 'lon']).pressure.unstack('year')
...: reshaped
Out[11]:
year 2000-01-01 2001-01-01 2002-01-01
lat lon
-69.549806 -108.064884 4736968141 5362327422 5786865879
-61.557495 -80.201070 3843828897 11028409576 2369538294
-26.232121 -42.518353 11071436453 3324450900 10017446009
-17.632865 -43.825574 9624163047 4327094339 5194657461
-10.397045 13.041766 3644097094 4970975759 10215709500
-5.046885 -160.372459 10848978249 5362828700 3165559292
2.535630 105.366159 7565267947 9150340532 1244019860
3.070028 54.610328 5774184805 2190428768 3410656879
7.065455 -56.622611 10487542202 1159025195 2861490045
10.176521 -93.359717 10668195383 6179278941 8096958866
11.533859 -8.406768 2311635381 7860849630 9199114517
15.157955 -113.279669 11984888049 10749492217 8554513278
20.534460 -9.486914 4636773154 11988039892 7941587610
32.064057 -55.641618 6209291077 7651976538 9282714003
42.013715 -55.603621 10377165416 11385104693 7612481121
43.445033 48.639165 7650284975 2174961057 5519531845
47.518457 -122.971638 6592720223 3181381723 4295719754
61.276641 -34.552255 11778765056 2864520584 8978044061
71.118582 98.074277 8543534134 1709130344 4596373347
86.568656 -32.057453 2511358407 5623460467 11854301741
Now, we can drop the lat lon index (we’ll pick them back up later) and replace them with a station ID index:
In [12]: press_df = reshaped.reset_index(drop=True).rename_axis('station_id')
...: press_df
Out[12]:
year 2000-01-01 2001-01-01 2002-01-01
station_id
0 4736968141 5362327422 5786865879
1 3843828897 11028409576 2369538294
2 11071436453 3324450900 10017446009
3 9624163047 4327094339 5194657461
4 3644097094 4970975759 10215709500
5 10848978249 5362828700 3165559292
6 7565267947 9150340532 1244019860
7 5774184805 2190428768 3410656879
8 10487542202 1159025195 2861490045
9 10668195383 6179278941 8096958866
10 2311635381 7860849630 9199114517
11 11984888049 10749492217 8554513278
12 4636773154 11988039892 7941587610
13 6209291077 7651976538 9282714003
14 10377165416 11385104693 7612481121
15 7650284975 2174961057 5519531845
16 6592720223 3181381723 4295719754
17 11778765056 2864520584 8978044061
18 8543534134 1709130344 4596373347
19 2511358407 5623460467 11854301741
Now, let’s keep track of the lat/lons, keeping their order (and thus station_id value) consistent:
In [13]: latlons = reshaped.index.to_frame().reset_index(drop=True).rename_axis('station_id')
...: latlons
Out[13]:
lat lon
station_id
0 -69.549806 -108.064884
1 -61.557495 -80.201070
2 -26.232121 -42.518353
3 -17.632865 -43.825574
4 -10.397045 13.041766
5 -5.046885 -160.372459
6 2.535630 105.366159
7 3.070028 54.610328
8 7.065455 -56.622611
9 10.176521 -93.359717
10 11.533859 -8.406768
11 15.157955 -113.279669
12 20.534460 -9.486914
13 32.064057 -55.641618
14 42.013715 -55.603621
15 43.445033 48.639165
16 47.518457 -122.971638
17 61.276641 -34.552255
18 71.118582 98.074277
19 86.568656 -32.057453
We can now re-stack the table and convert to an xarray DataArray:
In [16]: press_df = reshaped.reset_index(drop=True).rename_axis('station_id')
...: press_df
Out[16]:
year 2000-01-01 2001-01-01 2002-01-01
station_id
0 4736968141 5362327422 5786865879
1 3843828897 11028409576 2369538294
2 11071436453 3324450900 10017446009
3 9624163047 4327094339 5194657461
4 3644097094 4970975759 10215709500
5 10848978249 5362828700 3165559292
6 7565267947 9150340532 1244019860
7 5774184805 2190428768 3410656879
8 10487542202 1159025195 2861490045
9 10668195383 6179278941 8096958866
10 2311635381 7860849630 9199114517
11 11984888049 10749492217 8554513278
12 4636773154 11988039892 7941587610
13 6209291077 7651976538 9282714003
14 10377165416 11385104693 7612481121
15 7650284975 2174961057 5519531845
16 6592720223 3181381723 4295719754
17 11778765056 2864520584 8978044061
18 8543534134 1709130344 4596373347
19 2511358407 5623460467 11854301741
In [17]: press_da = press_df.stack().to_xarray()
...: press_da
Out[17]:
<xarray.DataArray (station_id: 20, year: 3)>
array([[ 4736968141, 5362327422, 5786865879],
[ 3843828897, 11028409576, 2369538294],
[11071436453, 3324450900, 10017446009],
[ 9624163047, 4327094339, 5194657461],
[ 3644097094, 4970975759, 10215709500],
[10848978249, 5362828700, 3165559292],
[ 7565267947, 9150340532, 1244019860],
[ 5774184805, 2190428768, 3410656879],
[10487542202, 1159025195, 2861490045],
[10668195383, 6179278941, 8096958866],
[ 2311635381, 7860849630, 9199114517],
[11984888049, 10749492217, 8554513278],
[ 4636773154, 11988039892, 7941587610],
[ 6209291077, 7651976538, 9282714003],
[10377165416, 11385104693, 7612481121],
[ 7650284975, 2174961057, 5519531845],
[ 6592720223, 3181381723, 4295719754],
[11778765056, 2864520584, 8978044061],
[ 8543534134, 1709130344, 4596373347],
[ 2511358407, 5623460467, 11854301741]])
Coordinates:
* station_id (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
* year (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
Note that the dimensions here are (station_id, year)
, not (lat, lon)
. We can add the (lat, lon), indexed by station_id
, as coordinates:
In [19]: press_da = press_da.assign_coords(**latlons.to_xarray())
In [20]: press_da
Out[20]:
<xarray.DataArray (station_id: 20, year: 3)>
array([[ 4736968141, 5362327422, 5786865879],
[ 3843828897, 11028409576, 2369538294],
[11071436453, 3324450900, 10017446009],
[ 9624163047, 4327094339, 5194657461],
[ 3644097094, 4970975759, 10215709500],
[10848978249, 5362828700, 3165559292],
[ 7565267947, 9150340532, 1244019860],
[ 5774184805, 2190428768, 3410656879],
[10487542202, 1159025195, 2861490045],
[10668195383, 6179278941, 8096958866],
[ 2311635381, 7860849630, 9199114517],
[11984888049, 10749492217, 8554513278],
[ 4636773154, 11988039892, 7941587610],
[ 6209291077, 7651976538, 9282714003],
[10377165416, 11385104693, 7612481121],
[ 7650284975, 2174961057, 5519531845],
[ 6592720223, 3181381723, 4295719754],
[11778765056, 2864520584, 8978044061],
[ 8543534134, 1709130344, 4596373347],
[ 2511358407, 5623460467, 11854301741]])
Coordinates:
* station_id (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
* year (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
lat (station_id) float64 -69.55 -61.56 -26.23 ... 61.28 71.12 86.57
lon (station_id) float64 -108.1 -80.2 -42.52 ... -34.55 98.07 -32.06
And now we have all our data, with year perpendicular to station ID, making data analysis along the year dimension easy, but with no need to handle a sparse array.
If you’d like, you can now document the DataArray & Dataset and then write to netcdf:
In [24]: import datetime
...: ds = press_da.to_dataset(name="pressure")
...: ds.pressure.attrs.update({
...: "units": "big numbers",
...: "long_name": "Pressure!",
...: "cell_method": "random numbers",
...: })
...: ds.attrs.update({
...: "created": datetime.datetime.now(),
...: "author": "me",
...: "method": "moving random data around",
...: "etc": "etc",
...: })
In [25]: ds
Out[25]:
<xarray.Dataset>
Dimensions: (station_id: 20, year: 3)
Coordinates:
* station_id (station_id) int64 0 1 2 3 4 5 6 7 8 ... 12 13 14 15 16 17 18 19
* year (year) datetime64[ns] 2000-01-01 2001-01-01 2002-01-01
lat (station_id) float64 -69.55 -61.56 -26.23 ... 61.28 71.12 86.57
lon (station_id) float64 -108.1 -80.2 -42.52 ... -34.55 98.07 -32.06
Data variables:
pressure (station_id, year) int64 4736968141 5362327422 ... 11854301741
Attributes:
created: 2022-11-08 10:36:39.875581
author: me
method: moving random data around
etc: etc