2D to 3D numpy array by blocks

Question:

I have the following 2D dataframe conc corresponding to gas concentrations on 4 layers at a series of wavelengths wl :

    conc =  
        wl    gas1  gas2  gas3  layer
    0   5000  10    13    250    1
    1   5000  20    14    260    2
    2   5000  30    15    270    3
    3   5000  40    16    280    4
    4   5001  50    17    290    1
    5   5001  60    18    300    2
    6   5001  70    19    310    3
    7   5001  80    20    320    4
    ...
    497 5125  20    25    650    1
    498 5125  35    15    550    2
    499 5125  55    30    750    3
    500 5125  95    21    650    4

I would like to transform it into a 3D array by placing the 4-lines blocks (4 layers for 1 wavelength) along the third dimension.

I did it using a loop but as my array is very large it takes forever. Is there a way to do it without a loop ?

wls = set(conc.loc[:,"wl"])
new_3D_array = np.zeros((len(wls), 4, 3))   # 4 layers, 3 gases

for k, wl in enumerate(wls):
    sub_array = conc[conc.loc[:,"wl"]==wl]
    new_3D_array[k,:,:] = sub_array.loc[:,["gas1", "gas2", "gas3"]]
    

The desired output is

[[[ 10.  13. 250.]
  [ 20.  14. 260.]
  [ 30.  15. 270.]
  [ 40.  16. 280.]]

 [[ 50.  17. 290.]
  [ 60.  18. 300.]
  [ 70.  19. 310.]
  [ 80.  20. 320.]]

 ....

 [[ 20.  25. 650.]
  [ 35.  15. 550.]
  [ 55.  30. 750.]
  [ 95.  21. 650.]]]
Asked By: Dr. Paprika

||

Answers:

You can reshape:

size = conc[['wl', 'layer']].nunique()

out = (conc
  .set_index(['wl', 'layer'])
  .unstack('layer').stack('layer', dropna=False)
  .to_numpy().reshape((size['wl'], size['layer'], -1))
)

NB. the stack/unstack steps are only required if you don’t have all combinations of wl/layer and in order.

Output:

array([[[ 10,  13, 250],
        [ 20,  14, 260],
        [ 30,  15, 270],
        [ 40,  16, 280]],

       [[ 50,  17, 290],
        [ 60,  18, 300],
        [ 70,  19, 310],
        [ 80,  20, 320]],

       [[ 20,  25, 650],
        [ 35,  15, 550],
        [ 55,  30, 750],
        [ 95,  21, 650]]])
Answered By: mozway

If you want to do it with a loop you can just get each 2D tensor for each wavelength and concatenate them:

tensor_3D = []
for w in conc["w1"].unique():
    tensor_2D = conc[conc["w1"]==w].drop("w1", axis=1)
    tensor_3D.append(tensor_2D.to_numpy())

tensor_3D = np.array(tensor_3D)

EDIT:
I feel you might always have to do a loop for this, you can also do it in a more elegant way:

conc_grouped = conc.groupby("w1")
tensor_3D = np.array([conc_grouped.get_group(c).to_numpy() for c in conc_grouped.groups.keys()])
Answered By: João Santos

That’s xarray‘s job:

np.moveaxis(conc.set_index(['wl','layer']).to_xarray().to_array().to_numpy(),0,-1)

xarray is a multidimensional data based on labelled multi-index. It has its own API and logic (xarray is another python’s package for multi-indexed data representation. Here I just use pandas ability to convert to xarray).

Hence the to_array() to convert it into an array (indexable with integers. You couldn’t df.to_xarray()[0,0,0], while df.to_xarray().to_array()[0,0,0] is a legit request. But even there, it is still a "xarray’s array". That [0,0,0] returns a xarray item. Hence the seemingly redundant to_numpy() to convert it to a proper numpy’s array (whose values are just floats). So even tho those 3 conversions (pandas dataframe to xarray dataframe; xarray dataframe to xarray array; xarray array to numpy array) seem redundant, they are all needed. But the two last one do practically nothing anway; they don’t move or copy data.

I need the final moveaxis just because otherwise, the "remaining columns" (so gas1, gas2, gas3: the columns that are not used as index) axis is first. So to get "gas2" of 4th layer of 100th wavelength, you would need to index with [1,99,3] when you’re desired output should be indexed with [99,3,1]. But likewise, that moveaxis cost nothing anyway (no data copy or move. Just some tweak of strides).

Timing wise, it is indiscernable from mozway’s answer for an example of the same size as yours. On my machine, 5.6 ms for both (vs 10.1 ms for the for based solution).

It takes quite a lot of data before it becomes clear which one of the two vectorized solution is faster when n grows. After 1000 rows, the difference is visible, and with 100000 rows, for example, that solution takes 260 ms, vs 505 ms for mozway’s. It seems to asymptotically reach this ~2 timing ratio (vs 820 seconds for the for based solutions with 100000 rows. As often, vectorized solutions beat non vectorized one by a factor in the order of 1000. So, in comparison, the factor 2 you can have between two vectorized solutions seems of course not much).

Note that even tho you don’t need any import (other than pandas you already have) to run this line, it is dependent on package xarray, which is not always installed with pandas (so, on a system with pandas but no xarray, pandas method to_xarray exists, but just raise an error). In other words, you may need to pip install xarray

Answered By: chrslg
import pandas as pd
import numpy as np
data = {
    'wl': [5000, 5000, 5000, 5000, 5001, 5001, 5001, 5001, 5125, 5125, 5125, 5125],
    'gas1': [10, 20, 30, 40, 50, 60, 70, 80, 20, 35, 55, 95],
    'gas2': [13, 14, 15, 16, 17, 18, 19, 20, 25, 15, 30, 21],
    'gas3': [250, 260, 270, 280, 290, 300, 310, 320, 650, 550, 750, 650],
    'layer': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
}

conc = pd.DataFrame(data)
print(conc)


# Get unique wavelengths
wls = conc['wl'].unique()

# Reshape the DataFrame into a desired 3D structure
new_3D_array = conc.groupby('wl')[['gas1', 'gas2', 'gas3']].apply(
    lambda x: x.values.reshape(-1, 4, 3)
)
"""
reshape(-1, 4, 3) : where -1 represents the first 
two dimensions automatically 
inferred from the array size,
4 represents the number of layers (rows within a group), 
and 3 represents the number of gas columns.
To be very specific : 
-1 indicates that NumPy will calculate the first dimension (number of rows)
automatically to accommodate all the elements in 
your DataFrame group for a specific wavelength.
"""
# Convert the result to a NumPy array
new_3D_array = new_3D_array.to_numpy()
print(new_3D_array)

"""[array([[[ 10,  13, 250],
         [ 20,  14, 260],
         [ 30,  15, 270],
         [ 40,  16, 280]]], dtype=int64)
 array([[[ 50,  17, 290],
         [ 60,  18, 300],
         [ 70,  19, 310],
         [ 80,  20, 320]]], dtype=int64)
 array([[[ 20,  25, 650],
         [ 35,  15, 550],
         [ 55,  30, 750],
         [ 95,  21, 650]]], dtype=int64)]
"""
Answered By: Soudipta Dutta
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.