Nested structured array to pandas dataframe with new column names

Question

How can I convert/explode a nested numpy structured array into a pandas dataframe, while keeping the headers from the nested arrays?

Using Python 3.8.3, numpy 1.18.5, pandas 1.3.4.

Example structured array: I am given a nested numpy structured array that looks like this, and I am just rebuilding it here for an MRE.

import numpy as np
import numpy.lib.recfunctions as rfn

arr1 = np.array([4, 5, 4, 5])
arr2 = np.array([0, 0, -1, -1])
arr3 = np.array([0.51, 0.89, 0.59, 0.94])
arr4 = np.array(
    [[0.52, 0.80, 0.62, 1.1], [0.41, 0.71, 0.46, 0.77], [0.68, 1.12, 0.78, 1.19]]
).T
arr5 = np.repeat(np.array([0.6, 0.2, 0.2]), 4).reshape(3, 4).T
arrs = (arr1, arr2, arr3, arr4, arr5)

dtypes = [
    ("state", "f8"),
    ("variability", "f8"),
    ("target", "f8"),
    ("measured", [("mean", "f8"), ("low", "f8"), ("hi", "f8")]),
    ("var", [("mid", "f8"), ("low", "f8"), ("hi", "f8")]),
]

example = np.column_stack(arrs)
example = rfn.unstructured_to_structured(example, dtype=np.dtype(dtypes))

Inspect example array

print(example)
print(example.dtype.names)

[(4.,  0., 0.51, (0.52, 0.41, 0.68), (0.6, 0.2, 0.2))
 (5.,  0., 0.89, (0.8 , 0.71, 1.12), (0.6, 0.2, 0.2))
 (4., -1., 0.59, (0.62, 0.46, 0.78), (0.6, 0.2, 0.2))
 (5., -1., 0.94, (1.1 , 0.77, 1.19), (0.6, 0.2, 0.2))]
('state', 'variability', 'target', 'measured', 'var')

print(example["measured"].dtype.names)

('mean', 'low', 'hi')

print(example["var"].dtype.names)

('mid', 'low', 'hi')

Desired pandas dataframe

state	variability	target	measured_mean	measured_low	measured_hi	var_mid	var_low	var_hi
4	0	0.51	0.52	0.41	0.68	0.6	0.2	0.2
5	0	0.89	0.8	0.71	1.12	0.6	0.2	0.2
4	-1	0.59	0.62	0.46	0.78	0.6	0.2	0.2
5	-1	0.94	1.1	0.77	1.19	0.6	0.2	0.2

Attempts

test = pd.DataFrame(example)
print(test)

   state  variability  target            measured              var
0    4.0          0.0    0.51  (0.52, 0.41, 0.68)  (0.6, 0.2, 0.2)
1    5.0          0.0    0.89   (0.8, 0.71, 1.12)  (0.6, 0.2, 0.2)
2    4.0         -1.0    0.59  (0.62, 0.46, 0.78)  (0.6, 0.2, 0.2)
3    5.0         -1.0    0.94   (1.1, 0.77, 1.19)  (0.6, 0.2, 0.2)

How to I unpack the measured and var columns to get/concatenate the column names, as shown above, based on the rec array?

Asked By: a11

||

Source

Answer 1

Hmm, did not find an elegant solution, but this brute force seems to work:

df1 = pd.DataFrame(example)

lodf = []
cols_to_use = []
for col_name in example.dtype.names:
    if example[col_name].dtype.names:
        df0 = pd.DataFrame(list(df1[col_name].values), columns=[f'{col_name}_{c0}' for c0 in example[col_name].dtype.names])
        lodf.append(df0)
    else:
        cols_to_use.append(col_name)
        
pd.concat([df1[cols_to_use]]+lodf, axis=1)

Answered By: Ezer K

Nested structured array to pandas dataframe with new column names

Question:

Answers: