How to load multiple csv-files into xarray Dataset and concat along multiple dimensions?

Question:

There is a similar question to mine, but the data has a different structure and I run into errors. I have multiple .dat files, that contain tables for different arbitrary times t=1,3,9,10,12, etc. The tables in the different .dat files have the same columns M_star, M_planet, separation, and M_star can be viewed as an index in steps of 0.5. Nevertheless, the length of the tables and the values of M_star vary from file to file, e.g. for time t=1 I have

M_star M_planet separation
10.0   0.022    7.11
10.5   0.019    2.30
11.0   0.008    14.01

while for t=3 I have

M_star M_planet separation
9.5    0.308    1.32
10.0   0.522    4.18
10.5   0.019    3.40
11.0   0.338    0.91
11.5   0.150    1.20

What I would like to do is to load all the .dat files into an xarray DataSet (at least I think this would be useful), so that I can access data in the columns M_planet and separation by providing precise values for t and M_star, e.g. I would like to do something like ds.sel(t=9, M_star=10.5)['M_planet'] to get the value of M_planet at the given t and M_star coordinates. What I have tried sofar unsuccessfully is:

fnames = glob('table_t=*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,names=['M_star', 'M_planet', 'separation'], skiprows=1)

# first I load all the tables into a list of dataframes
dfs = [pd.read_csv(fname,**kw) for fname in fnames]
# then I include as a column to each dataframe the time, all the t-entries are same within a dataframe
dfs2= [df_i.assign(t=t) for df_i, z in zip(dfs, [1,2,3,4,9,10,12])]
# I try to make an array DataSet, but I run into an error
d = xr.concat([df_i.to_xarray() for df_i in df_s2], dim='t')

The last line throws an error: t already exists as coordinate or variable name.

How can I load my .dat files into xarray and make t and M_star the dimensions/coordinates? Tnx

Asked By: NeStack

||

Answers:

The problem is occurring because you are assigning t as a column in the dataframes, which are converted to data variables in the xarray datasets (indexed only by M_star) so the t values are interpreted as conflicts during the merge.

Additionally, since you’re combining along both M_star and t, you should use xr.combine_by_coords rather than concat, which only works along one dimension. See the merging and combining data docs for an overview of the different options.

You can fix this by making sure t becomes a dimension/coordinate before merging. You could assign it as a dimension right away by adding it to the pandas index rather than the columns:

dfs2 = [
    df_i.assign(t=t).set_index('t', append=True)
    for df_i, z in zip(dfs, [1,2,3,4,9,10,12])
]

Alternatively you could move the t coordinate assignment into xarray:

d = xr.combine_by_coords(
    [
        df_i.to_xarray().expand_dims(t=[z])
        for df_i, z in zip(df1, [1,2,3,4,9,10,12]))
    ],
)
Answered By: Michael Delgado

Using Michael Delgado’s comments a solution to my problem can be coded that way:

fnames = glob('table_t=*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,names=['M_star', 'M_planet', 'separation'], skiprows=1)

# first I load all the tables into a list of dataframes
dfs = [pd.read_csv(fname,**kw) for fname in fnames]

# set_index('M_star') turns M_star from a df-column into an index, as Michael said this is necessary
# expand_dims(t=[t]) turns t into a dim/coordinate of the DataSet, which I also want
d = xr.combine_by_coords(
    [
        df_i.set_index('M_star').to_xarray().expand_dims(t=[t])
        for df_i, t in zip(dfs, [1,2,3,4,9,10,12])
    ],
)

With this I have the DataSet d in the form that I wanted, both t and M_star being my coordinates/dimentions, see below (the naming is different):

enter image description here

This allows me to do what I wanted – access values in the DataSet based on me providing precise values both along M_star and t:

print(float(d.sel(t=9, Log10M_h=11.5)['M_planet'].values))
>>> 0.019

But as Michael stated, I can also get an alternative solution by using only a pandas df instead of an array DataSet. For that I concatenate all the dataframes into one long one and I assign and additional column t to keep track of this value, I actually don’t need t to be an index. This is how the alternative using exclusively pandas would look like:

# in the 3 lines below we create a df with all the data files concatenated
df_s = [pd.read_csv(fname,**kw) for fname in fnames]
df_s2= [df_i.assign(t=t) for df_i, t in zip(df_s, [1,2,3,4,9,10,12])]
df = pd.concat(df_s2).reset_index(drop=True)

print(df[(df.t==3) & (df.M_star==10.5)]['M_planet'].values[0])
>>> 0.171
Answered By: NeStack