2D to 3D numpy array by blocks
Question:
I have the following 2D dataframe conc
corresponding to gas concentrations on 4 layers at a series of wavelengths wl
:
conc =
wl gas1 gas2 gas3 layer
0 5000 10 13 250 1
1 5000 20 14 260 2
2 5000 30 15 270 3
3 5000 40 16 280 4
4 5001 50 17 290 1
5 5001 60 18 300 2
6 5001 70 19 310 3
7 5001 80 20 320 4
...
497 5125 20 25 650 1
498 5125 35 15 550 2
499 5125 55 30 750 3
500 5125 95 21 650 4
I would like to transform it into a 3D array by placing the 4-lines blocks (4 layers for 1 wavelength) along the third dimension.
I did it using a loop but as my array is very large it takes forever. Is there a way to do it without a loop ?
wls = set(conc.loc[:,"wl"])
new_3D_array = np.zeros((len(wls), 4, 3)) # 4 layers, 3 gases
for k, wl in enumerate(wls):
sub_array = conc[conc.loc[:,"wl"]==wl]
new_3D_array[k,:,:] = sub_array.loc[:,["gas1", "gas2", "gas3"]]
The desired output is
[[[ 10. 13. 250.]
[ 20. 14. 260.]
[ 30. 15. 270.]
[ 40. 16. 280.]]
[[ 50. 17. 290.]
[ 60. 18. 300.]
[ 70. 19. 310.]
[ 80. 20. 320.]]
....
[[ 20. 25. 650.]
[ 35. 15. 550.]
[ 55. 30. 750.]
[ 95. 21. 650.]]]
Answers:
You can reshape:
size = conc[['wl', 'layer']].nunique()
out = (conc
.set_index(['wl', 'layer'])
.unstack('layer').stack('layer', dropna=False)
.to_numpy().reshape((size['wl'], size['layer'], -1))
)
NB. the stack
/unstack
steps are only required if you don’t have all combinations of wl/layer and in order.
Output:
array([[[ 10, 13, 250],
[ 20, 14, 260],
[ 30, 15, 270],
[ 40, 16, 280]],
[[ 50, 17, 290],
[ 60, 18, 300],
[ 70, 19, 310],
[ 80, 20, 320]],
[[ 20, 25, 650],
[ 35, 15, 550],
[ 55, 30, 750],
[ 95, 21, 650]]])
If you want to do it with a loop you can just get each 2D tensor for each wavelength and concatenate them:
tensor_3D = []
for w in conc["w1"].unique():
tensor_2D = conc[conc["w1"]==w].drop("w1", axis=1)
tensor_3D.append(tensor_2D.to_numpy())
tensor_3D = np.array(tensor_3D)
EDIT:
I feel you might always have to do a loop for this, you can also do it in a more elegant way:
conc_grouped = conc.groupby("w1")
tensor_3D = np.array([conc_grouped.get_group(c).to_numpy() for c in conc_grouped.groups.keys()])
That’s xarray
‘s job:
np.moveaxis(conc.set_index(['wl','layer']).to_xarray().to_array().to_numpy(),0,-1)
xarray is a multidimensional data based on labelled multi-index. It has its own API and logic (xarray is another python’s package for multi-indexed data representation. Here I just use pandas ability to convert to xarray).
Hence the to_array()
to convert it into an array (indexable with integers. You couldn’t df.to_xarray()[0,0,0]
, while df.to_xarray().to_array()[0,0,0]
is a legit request. But even there, it is still a "xarray’s array". That [0,0,0]
returns a xarray item. Hence the seemingly redundant to_numpy()
to convert it to a proper numpy’s array (whose values are just floats). So even tho those 3 conversions (pandas dataframe to xarray dataframe; xarray dataframe to xarray array; xarray array to numpy array) seem redundant, they are all needed. But the two last one do practically nothing anway; they don’t move or copy data.
I need the final moveaxis
just because otherwise, the "remaining columns" (so gas1, gas2, gas3: the columns that are not used as index) axis is first. So to get "gas2" of 4th layer of 100th wavelength, you would need to index with [1,99,3]
when you’re desired output should be indexed with [99,3,1]
. But likewise, that moveaxis cost nothing anyway (no data copy or move. Just some tweak of strides).
Timing wise, it is indiscernable from mozway’s answer for an example of the same size as yours. On my machine, 5.6 ms for both (vs 10.1 ms for the for
based solution).
It takes quite a lot of data before it becomes clear which one of the two vectorized solution is faster when n grows. After 1000 rows, the difference is visible, and with 100000 rows, for example, that solution takes 260 ms, vs 505 ms for mozway’s. It seems to asymptotically reach this ~2 timing ratio (vs 820 seconds for the for
based solutions with 100000 rows. As often, vectorized solutions beat non vectorized one by a factor in the order of 1000. So, in comparison, the factor 2 you can have between two vectorized solutions seems of course not much).
Note that even tho you don’t need any import (other than pandas you already have) to run this line, it is dependent on package xarray
, which is not always installed with pandas (so, on a system with pandas
but no xarray
, pandas method to_xarray
exists, but just raise an error). In other words, you may need to pip install xarray
import pandas as pd
import numpy as np
data = {
'wl': [5000, 5000, 5000, 5000, 5001, 5001, 5001, 5001, 5125, 5125, 5125, 5125],
'gas1': [10, 20, 30, 40, 50, 60, 70, 80, 20, 35, 55, 95],
'gas2': [13, 14, 15, 16, 17, 18, 19, 20, 25, 15, 30, 21],
'gas3': [250, 260, 270, 280, 290, 300, 310, 320, 650, 550, 750, 650],
'layer': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
}
conc = pd.DataFrame(data)
print(conc)
# Get unique wavelengths
wls = conc['wl'].unique()
# Reshape the DataFrame into a desired 3D structure
new_3D_array = conc.groupby('wl')[['gas1', 'gas2', 'gas3']].apply(
lambda x: x.values.reshape(-1, 4, 3)
)
"""
reshape(-1, 4, 3) : where -1 represents the first
two dimensions automatically
inferred from the array size,
4 represents the number of layers (rows within a group),
and 3 represents the number of gas columns.
To be very specific :
-1 indicates that NumPy will calculate the first dimension (number of rows)
automatically to accommodate all the elements in
your DataFrame group for a specific wavelength.
"""
# Convert the result to a NumPy array
new_3D_array = new_3D_array.to_numpy()
print(new_3D_array)
"""[array([[[ 10, 13, 250],
[ 20, 14, 260],
[ 30, 15, 270],
[ 40, 16, 280]]], dtype=int64)
array([[[ 50, 17, 290],
[ 60, 18, 300],
[ 70, 19, 310],
[ 80, 20, 320]]], dtype=int64)
array([[[ 20, 25, 650],
[ 35, 15, 550],
[ 55, 30, 750],
[ 95, 21, 650]]], dtype=int64)]
"""
I have the following 2D dataframe conc
corresponding to gas concentrations on 4 layers at a series of wavelengths wl
:
conc =
wl gas1 gas2 gas3 layer
0 5000 10 13 250 1
1 5000 20 14 260 2
2 5000 30 15 270 3
3 5000 40 16 280 4
4 5001 50 17 290 1
5 5001 60 18 300 2
6 5001 70 19 310 3
7 5001 80 20 320 4
...
497 5125 20 25 650 1
498 5125 35 15 550 2
499 5125 55 30 750 3
500 5125 95 21 650 4
I would like to transform it into a 3D array by placing the 4-lines blocks (4 layers for 1 wavelength) along the third dimension.
I did it using a loop but as my array is very large it takes forever. Is there a way to do it without a loop ?
wls = set(conc.loc[:,"wl"])
new_3D_array = np.zeros((len(wls), 4, 3)) # 4 layers, 3 gases
for k, wl in enumerate(wls):
sub_array = conc[conc.loc[:,"wl"]==wl]
new_3D_array[k,:,:] = sub_array.loc[:,["gas1", "gas2", "gas3"]]
The desired output is
[[[ 10. 13. 250.]
[ 20. 14. 260.]
[ 30. 15. 270.]
[ 40. 16. 280.]]
[[ 50. 17. 290.]
[ 60. 18. 300.]
[ 70. 19. 310.]
[ 80. 20. 320.]]
....
[[ 20. 25. 650.]
[ 35. 15. 550.]
[ 55. 30. 750.]
[ 95. 21. 650.]]]
You can reshape:
size = conc[['wl', 'layer']].nunique()
out = (conc
.set_index(['wl', 'layer'])
.unstack('layer').stack('layer', dropna=False)
.to_numpy().reshape((size['wl'], size['layer'], -1))
)
NB. the stack
/unstack
steps are only required if you don’t have all combinations of wl/layer and in order.
Output:
array([[[ 10, 13, 250],
[ 20, 14, 260],
[ 30, 15, 270],
[ 40, 16, 280]],
[[ 50, 17, 290],
[ 60, 18, 300],
[ 70, 19, 310],
[ 80, 20, 320]],
[[ 20, 25, 650],
[ 35, 15, 550],
[ 55, 30, 750],
[ 95, 21, 650]]])
If you want to do it with a loop you can just get each 2D tensor for each wavelength and concatenate them:
tensor_3D = []
for w in conc["w1"].unique():
tensor_2D = conc[conc["w1"]==w].drop("w1", axis=1)
tensor_3D.append(tensor_2D.to_numpy())
tensor_3D = np.array(tensor_3D)
EDIT:
I feel you might always have to do a loop for this, you can also do it in a more elegant way:
conc_grouped = conc.groupby("w1")
tensor_3D = np.array([conc_grouped.get_group(c).to_numpy() for c in conc_grouped.groups.keys()])
That’s xarray
‘s job:
np.moveaxis(conc.set_index(['wl','layer']).to_xarray().to_array().to_numpy(),0,-1)
xarray is a multidimensional data based on labelled multi-index. It has its own API and logic (xarray is another python’s package for multi-indexed data representation. Here I just use pandas ability to convert to xarray).
Hence the to_array()
to convert it into an array (indexable with integers. You couldn’t df.to_xarray()[0,0,0]
, while df.to_xarray().to_array()[0,0,0]
is a legit request. But even there, it is still a "xarray’s array". That [0,0,0]
returns a xarray item. Hence the seemingly redundant to_numpy()
to convert it to a proper numpy’s array (whose values are just floats). So even tho those 3 conversions (pandas dataframe to xarray dataframe; xarray dataframe to xarray array; xarray array to numpy array) seem redundant, they are all needed. But the two last one do practically nothing anway; they don’t move or copy data.
I need the final moveaxis
just because otherwise, the "remaining columns" (so gas1, gas2, gas3: the columns that are not used as index) axis is first. So to get "gas2" of 4th layer of 100th wavelength, you would need to index with [1,99,3]
when you’re desired output should be indexed with [99,3,1]
. But likewise, that moveaxis cost nothing anyway (no data copy or move. Just some tweak of strides).
Timing wise, it is indiscernable from mozway’s answer for an example of the same size as yours. On my machine, 5.6 ms for both (vs 10.1 ms for the for
based solution).
It takes quite a lot of data before it becomes clear which one of the two vectorized solution is faster when n grows. After 1000 rows, the difference is visible, and with 100000 rows, for example, that solution takes 260 ms, vs 505 ms for mozway’s. It seems to asymptotically reach this ~2 timing ratio (vs 820 seconds for the for
based solutions with 100000 rows. As often, vectorized solutions beat non vectorized one by a factor in the order of 1000. So, in comparison, the factor 2 you can have between two vectorized solutions seems of course not much).
Note that even tho you don’t need any import (other than pandas you already have) to run this line, it is dependent on package xarray
, which is not always installed with pandas (so, on a system with pandas
but no xarray
, pandas method to_xarray
exists, but just raise an error). In other words, you may need to pip install xarray
import pandas as pd
import numpy as np
data = {
'wl': [5000, 5000, 5000, 5000, 5001, 5001, 5001, 5001, 5125, 5125, 5125, 5125],
'gas1': [10, 20, 30, 40, 50, 60, 70, 80, 20, 35, 55, 95],
'gas2': [13, 14, 15, 16, 17, 18, 19, 20, 25, 15, 30, 21],
'gas3': [250, 260, 270, 280, 290, 300, 310, 320, 650, 550, 750, 650],
'layer': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]
}
conc = pd.DataFrame(data)
print(conc)
# Get unique wavelengths
wls = conc['wl'].unique()
# Reshape the DataFrame into a desired 3D structure
new_3D_array = conc.groupby('wl')[['gas1', 'gas2', 'gas3']].apply(
lambda x: x.values.reshape(-1, 4, 3)
)
"""
reshape(-1, 4, 3) : where -1 represents the first
two dimensions automatically
inferred from the array size,
4 represents the number of layers (rows within a group),
and 3 represents the number of gas columns.
To be very specific :
-1 indicates that NumPy will calculate the first dimension (number of rows)
automatically to accommodate all the elements in
your DataFrame group for a specific wavelength.
"""
# Convert the result to a NumPy array
new_3D_array = new_3D_array.to_numpy()
print(new_3D_array)
"""[array([[[ 10, 13, 250],
[ 20, 14, 260],
[ 30, 15, 270],
[ 40, 16, 280]]], dtype=int64)
array([[[ 50, 17, 290],
[ 60, 18, 300],
[ 70, 19, 310],
[ 80, 20, 320]]], dtype=int64)
array([[[ 20, 25, 650],
[ 35, 15, 550],
[ 55, 30, 750],
[ 95, 21, 650]]], dtype=int64)]
"""