Random Sampling a Multi-level column

Question:

level column DataFrame that looks like this:

df

Solid             Liquid                Gas
pen paper pipe    water juice milk      oxygen nitrogen helium
5   2     1       4     3     1         7      8        10
5   2     1       4     3     1         7      8        10
5   2     1       4     3     1         7      8        10
4   4     7       3     2     0         6      7        9
3   7     9       4     6     5         3      3        4

What I wanted was to randomly choose 2 columns among "Solid", "Liquid", and "Gas" with 3 sub-columns with them.

for example if Solid and Gas were to randomly selected, the expected result should be:

Solid             Gas
pen paper pipe    oxygen nitrogen helium
5   2     1       7      8        10
5   2     1       7      8        10
5   2     1       7      8        10
4   4     7       6      7        9
3   7     9       3      3        4

I have tried this code but it did not give me the same result.

result = df.sample(n=5, axis=1)
result

[output]

Solid    Gas
pipe     oxygen
1        7
1        7
1        7
1        7
7        6
9        3

Can anyone please help me figure this one out? Thank you 🙂

Asked By: Kim Yejun

||

Answers:

You can sample the first level columns and then select the sampled columns:

df[pd.Series(df.columns.levels[0]).sample(2)]

Or use the random.sample function:

import random
df[random.sample(df.columns.levels[0].tolist(),2)]
Answered By: Allen Qin
import itertools
import pandas as pd
import numpy as np
from pandas import DataFrame as df

from itertools import zip_longest

arrays = [np.array(['Liquid', 'Liquid','Liquid', 'Solid', 'Solid','Solid', 'Gas', 'Gas', 'Gas']),
          np.array(['water', 'coke', 'juice', 'pen', 'paper', 'pipe', 'oxygen', 'helium','nitrogen'])]

df = pd.DataFrame(np.random.randn(3, 9), columns=arrays)

print(df.to_string())
"""
     Liquid                         Solid                           Gas                    
      water      coke     juice       pen     paper      pipe    oxygen    helium  nitrogen
0 -0.502442  0.625491  0.875741  1.207261  0.593263  1.121527  1.524995  1.097759  0.290256
1  0.008101  0.584346 -0.031353 -0.697306  1.802649 -0.536247 -0.329816 -0.884734 -1.844672
2 -0.335997  2.174418  0.022494 -0.524555 -0.245928 -0.646416 -1.444915 -0.728856 -0.698570


"""

l0 = ['Liquid','Solid','Gas']
l1 = [['water','coke'],['pen'],['helium','nitrogen']]

aa = [pd.DataFrame({'a': a,'b':b}) for a,b in zip(l0,l1)]
print(aa)

"""
[        a      b
0  Liquid  water
1  Liquid   coke,        a    b
0  Solid  pen,      a         b
0  Gas    helium
1  Gas  nitrogen]
"""
bb = pd.concat(aa)
print(bb)
"""
       a         b
0  Liquid     water
1  Liquid      coke
0   Solid       pen
0     Gas    helium
1     Gas  nitrogen
"""
cc = pd.concat(aa).values
print(cc)

"""
[['Liquid' 'water']
 ['Liquid' 'coke']
 ['Solid' 'pen']
 ['Gas' 'helium']
 ['Gas' 'nitrogen']]
"""
dd = df[cc]
print(dd)

"""
     Liquid               Solid       Gas          
      water      coke       pen    helium  nitrogen
0  1.683209 -0.523310 -0.440744  0.158327 -1.114051
1  0.616965 -0.586281 -0.007159  0.812071 -1.336370
2  1.569241 -0.190732 -1.504943 -0.813679  0.248096
"""

"""
In a similar way, if we want only 2 columns.
selected 2 items from Liquid and from Gas. Then :
"""
l2 = ['Liquid','Gas']
l3 = [['water','coke'],['helium','nitrogen']]


p = pd.concat([pd.DataFrame({'a':a,'b':b})for a,b in zip(l2,l3)]).values
print(p)
p1 = df[p]
print(p1)


"""
     Liquid                 Gas          
      water      coke    helium  nitrogen
0 -1.047081  0.195301 -1.709490 -1.483606
1 -0.685039 -0.038681  0.305787 -0.889225
2 -1.034577 -0.504169 -0.087984 -1.079425
"""
Answered By: Soudipta Dutta