Random Sampling a Multi-level column
Question:
level column DataFrame that looks like this:
df
Solid Liquid Gas
pen paper pipe water juice milk oxygen nitrogen helium
5 2 1 4 3 1 7 8 10
5 2 1 4 3 1 7 8 10
5 2 1 4 3 1 7 8 10
4 4 7 3 2 0 6 7 9
3 7 9 4 6 5 3 3 4
What I wanted was to randomly choose 2 columns among "Solid", "Liquid", and "Gas" with 3 sub-columns with them.
for example if Solid and Gas were to randomly selected, the expected result should be:
Solid Gas
pen paper pipe oxygen nitrogen helium
5 2 1 7 8 10
5 2 1 7 8 10
5 2 1 7 8 10
4 4 7 6 7 9
3 7 9 3 3 4
I have tried this code but it did not give me the same result.
result = df.sample(n=5, axis=1)
result
[output]
Solid Gas
pipe oxygen
1 7
1 7
1 7
1 7
7 6
9 3
Can anyone please help me figure this one out? Thank you 🙂
Answers:
You can sample the first level columns and then select the sampled columns:
df[pd.Series(df.columns.levels[0]).sample(2)]
Or use the random.sample
function:
import random
df[random.sample(df.columns.levels[0].tolist(),2)]
import itertools
import pandas as pd
import numpy as np
from pandas import DataFrame as df
from itertools import zip_longest
arrays = [np.array(['Liquid', 'Liquid','Liquid', 'Solid', 'Solid','Solid', 'Gas', 'Gas', 'Gas']),
np.array(['water', 'coke', 'juice', 'pen', 'paper', 'pipe', 'oxygen', 'helium','nitrogen'])]
df = pd.DataFrame(np.random.randn(3, 9), columns=arrays)
print(df.to_string())
"""
Liquid Solid Gas
water coke juice pen paper pipe oxygen helium nitrogen
0 -0.502442 0.625491 0.875741 1.207261 0.593263 1.121527 1.524995 1.097759 0.290256
1 0.008101 0.584346 -0.031353 -0.697306 1.802649 -0.536247 -0.329816 -0.884734 -1.844672
2 -0.335997 2.174418 0.022494 -0.524555 -0.245928 -0.646416 -1.444915 -0.728856 -0.698570
"""
l0 = ['Liquid','Solid','Gas']
l1 = [['water','coke'],['pen'],['helium','nitrogen']]
aa = [pd.DataFrame({'a': a,'b':b}) for a,b in zip(l0,l1)]
print(aa)
"""
[ a b
0 Liquid water
1 Liquid coke, a b
0 Solid pen, a b
0 Gas helium
1 Gas nitrogen]
"""
bb = pd.concat(aa)
print(bb)
"""
a b
0 Liquid water
1 Liquid coke
0 Solid pen
0 Gas helium
1 Gas nitrogen
"""
cc = pd.concat(aa).values
print(cc)
"""
[['Liquid' 'water']
['Liquid' 'coke']
['Solid' 'pen']
['Gas' 'helium']
['Gas' 'nitrogen']]
"""
dd = df[cc]
print(dd)
"""
Liquid Solid Gas
water coke pen helium nitrogen
0 1.683209 -0.523310 -0.440744 0.158327 -1.114051
1 0.616965 -0.586281 -0.007159 0.812071 -1.336370
2 1.569241 -0.190732 -1.504943 -0.813679 0.248096
"""
"""
In a similar way, if we want only 2 columns.
selected 2 items from Liquid and from Gas. Then :
"""
l2 = ['Liquid','Gas']
l3 = [['water','coke'],['helium','nitrogen']]
p = pd.concat([pd.DataFrame({'a':a,'b':b})for a,b in zip(l2,l3)]).values
print(p)
p1 = df[p]
print(p1)
"""
Liquid Gas
water coke helium nitrogen
0 -1.047081 0.195301 -1.709490 -1.483606
1 -0.685039 -0.038681 0.305787 -0.889225
2 -1.034577 -0.504169 -0.087984 -1.079425
"""
level column DataFrame that looks like this:
df
Solid Liquid Gas
pen paper pipe water juice milk oxygen nitrogen helium
5 2 1 4 3 1 7 8 10
5 2 1 4 3 1 7 8 10
5 2 1 4 3 1 7 8 10
4 4 7 3 2 0 6 7 9
3 7 9 4 6 5 3 3 4
What I wanted was to randomly choose 2 columns among "Solid", "Liquid", and "Gas" with 3 sub-columns with them.
for example if Solid and Gas were to randomly selected, the expected result should be:
Solid Gas
pen paper pipe oxygen nitrogen helium
5 2 1 7 8 10
5 2 1 7 8 10
5 2 1 7 8 10
4 4 7 6 7 9
3 7 9 3 3 4
I have tried this code but it did not give me the same result.
result = df.sample(n=5, axis=1)
result
[output]
Solid Gas
pipe oxygen
1 7
1 7
1 7
1 7
7 6
9 3
Can anyone please help me figure this one out? Thank you 🙂
You can sample the first level columns and then select the sampled columns:
df[pd.Series(df.columns.levels[0]).sample(2)]
Or use the random.sample
function:
import random
df[random.sample(df.columns.levels[0].tolist(),2)]
import itertools
import pandas as pd
import numpy as np
from pandas import DataFrame as df
from itertools import zip_longest
arrays = [np.array(['Liquid', 'Liquid','Liquid', 'Solid', 'Solid','Solid', 'Gas', 'Gas', 'Gas']),
np.array(['water', 'coke', 'juice', 'pen', 'paper', 'pipe', 'oxygen', 'helium','nitrogen'])]
df = pd.DataFrame(np.random.randn(3, 9), columns=arrays)
print(df.to_string())
"""
Liquid Solid Gas
water coke juice pen paper pipe oxygen helium nitrogen
0 -0.502442 0.625491 0.875741 1.207261 0.593263 1.121527 1.524995 1.097759 0.290256
1 0.008101 0.584346 -0.031353 -0.697306 1.802649 -0.536247 -0.329816 -0.884734 -1.844672
2 -0.335997 2.174418 0.022494 -0.524555 -0.245928 -0.646416 -1.444915 -0.728856 -0.698570
"""
l0 = ['Liquid','Solid','Gas']
l1 = [['water','coke'],['pen'],['helium','nitrogen']]
aa = [pd.DataFrame({'a': a,'b':b}) for a,b in zip(l0,l1)]
print(aa)
"""
[ a b
0 Liquid water
1 Liquid coke, a b
0 Solid pen, a b
0 Gas helium
1 Gas nitrogen]
"""
bb = pd.concat(aa)
print(bb)
"""
a b
0 Liquid water
1 Liquid coke
0 Solid pen
0 Gas helium
1 Gas nitrogen
"""
cc = pd.concat(aa).values
print(cc)
"""
[['Liquid' 'water']
['Liquid' 'coke']
['Solid' 'pen']
['Gas' 'helium']
['Gas' 'nitrogen']]
"""
dd = df[cc]
print(dd)
"""
Liquid Solid Gas
water coke pen helium nitrogen
0 1.683209 -0.523310 -0.440744 0.158327 -1.114051
1 0.616965 -0.586281 -0.007159 0.812071 -1.336370
2 1.569241 -0.190732 -1.504943 -0.813679 0.248096
"""
"""
In a similar way, if we want only 2 columns.
selected 2 items from Liquid and from Gas. Then :
"""
l2 = ['Liquid','Gas']
l3 = [['water','coke'],['helium','nitrogen']]
p = pd.concat([pd.DataFrame({'a':a,'b':b})for a,b in zip(l2,l3)]).values
print(p)
p1 = df[p]
print(p1)
"""
Liquid Gas
water coke helium nitrogen
0 -1.047081 0.195301 -1.709490 -1.483606
1 -0.685039 -0.038681 0.305787 -0.889225
2 -1.034577 -0.504169 -0.087984 -1.079425
"""