How to load multiple csv-files into xarray Dataset and concat along multiple dimensions?
Question:
There is a similar question to mine, but the data has a different structure and I run into errors. I have multiple .dat
files, that contain tables for different arbitrary times t=1,3,9,10,12
, etc. The tables in the different .dat
files have the same columns M_star, M_planet, separation
, and M_star
can be viewed as an index in steps of 0.5. Nevertheless, the length of the tables and the values of M_star
vary from file to file, e.g. for time t=1
I have
M_star M_planet separation
10.0 0.022 7.11
10.5 0.019 2.30
11.0 0.008 14.01
while for t=3
I have
M_star M_planet separation
9.5 0.308 1.32
10.0 0.522 4.18
10.5 0.019 3.40
11.0 0.338 0.91
11.5 0.150 1.20
What I would like to do is to load all the .dat
files into an xarray DataSet (at least I think this would be useful), so that I can access data in the columns M_planet
and separation
by providing precise values for t
and M_star
, e.g. I would like to do something like ds.sel(t=9, M_star=10.5)['M_planet']
to get the value of M_planet
at the given t
and M_star
coordinates. What I have tried sofar unsuccessfully is:
fnames = glob('table_t=*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,names=['M_star', 'M_planet', 'separation'], skiprows=1)
# first I load all the tables into a list of dataframes
dfs = [pd.read_csv(fname,**kw) for fname in fnames]
# then I include as a column to each dataframe the time, all the t-entries are same within a dataframe
dfs2= [df_i.assign(t=t) for df_i, z in zip(dfs, [1,2,3,4,9,10,12])]
# I try to make an array DataSet, but I run into an error
d = xr.concat([df_i.to_xarray() for df_i in df_s2], dim='t')
The last line throws an error: t already exists as coordinate or variable name.
How can I load my .dat
files into xarray and make t
and M_star
the dimensions/coordinates? Tnx
Answers:
The problem is occurring because you are assigning t
as a column in the dataframes, which are converted to data variables in the xarray datasets (indexed only by M_star) so the t values are interpreted as conflicts during the merge.
Additionally, since you’re combining along both M_star and t, you should use xr.combine_by_coords
rather than concat, which only works along one dimension. See the merging and combining data docs for an overview of the different options.
You can fix this by making sure t becomes a dimension/coordinate before merging. You could assign it as a dimension right away by adding it to the pandas index rather than the columns:
dfs2 = [
df_i.assign(t=t).set_index('t', append=True)
for df_i, z in zip(dfs, [1,2,3,4,9,10,12])
]
Alternatively you could move the t coordinate assignment into xarray:
d = xr.combine_by_coords(
[
df_i.to_xarray().expand_dims(t=[z])
for df_i, z in zip(df1, [1,2,3,4,9,10,12]))
],
)
Using Michael Delgado’s comments a solution to my problem can be coded that way:
fnames = glob('table_t=*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,names=['M_star', 'M_planet', 'separation'], skiprows=1)
# first I load all the tables into a list of dataframes
dfs = [pd.read_csv(fname,**kw) for fname in fnames]
# set_index('M_star') turns M_star from a df-column into an index, as Michael said this is necessary
# expand_dims(t=[t]) turns t into a dim/coordinate of the DataSet, which I also want
d = xr.combine_by_coords(
[
df_i.set_index('M_star').to_xarray().expand_dims(t=[t])
for df_i, t in zip(dfs, [1,2,3,4,9,10,12])
],
)
With this I have the DataSet d
in the form that I wanted, both t
and M_star
being my coordinates/dimentions, see below (the naming is different):
This allows me to do what I wanted – access values in the DataSet based on me providing precise values both along M_star
and t
:
print(float(d.sel(t=9, Log10M_h=11.5)['M_planet'].values))
>>> 0.019
But as Michael stated, I can also get an alternative solution by using only a pandas df instead of an array DataSet. For that I concatenate all the dataframes into one long one and I assign and additional column t
to keep track of this value, I actually don’t need t
to be an index. This is how the alternative using exclusively pandas would look like:
# in the 3 lines below we create a df with all the data files concatenated
df_s = [pd.read_csv(fname,**kw) for fname in fnames]
df_s2= [df_i.assign(t=t) for df_i, t in zip(df_s, [1,2,3,4,9,10,12])]
df = pd.concat(df_s2).reset_index(drop=True)
print(df[(df.t==3) & (df.M_star==10.5)]['M_planet'].values[0])
>>> 0.171
There is a similar question to mine, but the data has a different structure and I run into errors. I have multiple .dat
files, that contain tables for different arbitrary times t=1,3,9,10,12
, etc. The tables in the different .dat
files have the same columns M_star, M_planet, separation
, and M_star
can be viewed as an index in steps of 0.5. Nevertheless, the length of the tables and the values of M_star
vary from file to file, e.g. for time t=1
I have
M_star M_planet separation
10.0 0.022 7.11
10.5 0.019 2.30
11.0 0.008 14.01
while for t=3
I have
M_star M_planet separation
9.5 0.308 1.32
10.0 0.522 4.18
10.5 0.019 3.40
11.0 0.338 0.91
11.5 0.150 1.20
What I would like to do is to load all the .dat
files into an xarray DataSet (at least I think this would be useful), so that I can access data in the columns M_planet
and separation
by providing precise values for t
and M_star
, e.g. I would like to do something like ds.sel(t=9, M_star=10.5)['M_planet']
to get the value of M_planet
at the given t
and M_star
coordinates. What I have tried sofar unsuccessfully is:
fnames = glob('table_t=*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,names=['M_star', 'M_planet', 'separation'], skiprows=1)
# first I load all the tables into a list of dataframes
dfs = [pd.read_csv(fname,**kw) for fname in fnames]
# then I include as a column to each dataframe the time, all the t-entries are same within a dataframe
dfs2= [df_i.assign(t=t) for df_i, z in zip(dfs, [1,2,3,4,9,10,12])]
# I try to make an array DataSet, but I run into an error
d = xr.concat([df_i.to_xarray() for df_i in df_s2], dim='t')
The last line throws an error: t already exists as coordinate or variable name.
How can I load my .dat
files into xarray and make t
and M_star
the dimensions/coordinates? Tnx
The problem is occurring because you are assigning t
as a column in the dataframes, which are converted to data variables in the xarray datasets (indexed only by M_star) so the t values are interpreted as conflicts during the merge.
Additionally, since you’re combining along both M_star and t, you should use xr.combine_by_coords
rather than concat, which only works along one dimension. See the merging and combining data docs for an overview of the different options.
You can fix this by making sure t becomes a dimension/coordinate before merging. You could assign it as a dimension right away by adding it to the pandas index rather than the columns:
dfs2 = [
df_i.assign(t=t).set_index('t', append=True)
for df_i, z in zip(dfs, [1,2,3,4,9,10,12])
]
Alternatively you could move the t coordinate assignment into xarray:
d = xr.combine_by_coords(
[
df_i.to_xarray().expand_dims(t=[z])
for df_i, z in zip(df1, [1,2,3,4,9,10,12]))
],
)
Using Michael Delgado’s comments a solution to my problem can be coded that way:
fnames = glob('table_t=*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,names=['M_star', 'M_planet', 'separation'], skiprows=1)
# first I load all the tables into a list of dataframes
dfs = [pd.read_csv(fname,**kw) for fname in fnames]
# set_index('M_star') turns M_star from a df-column into an index, as Michael said this is necessary
# expand_dims(t=[t]) turns t into a dim/coordinate of the DataSet, which I also want
d = xr.combine_by_coords(
[
df_i.set_index('M_star').to_xarray().expand_dims(t=[t])
for df_i, t in zip(dfs, [1,2,3,4,9,10,12])
],
)
With this I have the DataSet d
in the form that I wanted, both t
and M_star
being my coordinates/dimentions, see below (the naming is different):
This allows me to do what I wanted – access values in the DataSet based on me providing precise values both along M_star
and t
:
print(float(d.sel(t=9, Log10M_h=11.5)['M_planet'].values))
>>> 0.019
But as Michael stated, I can also get an alternative solution by using only a pandas df instead of an array DataSet. For that I concatenate all the dataframes into one long one and I assign and additional column t
to keep track of this value, I actually don’t need t
to be an index. This is how the alternative using exclusively pandas would look like:
# in the 3 lines below we create a df with all the data files concatenated
df_s = [pd.read_csv(fname,**kw) for fname in fnames]
df_s2= [df_i.assign(t=t) for df_i, t in zip(df_s, [1,2,3,4,9,10,12])]
df = pd.concat(df_s2).reset_index(drop=True)
print(df[(df.t==3) & (df.M_star==10.5)]['M_planet'].values[0])
>>> 0.171