Generating combinations in pandas dataframe
Question:
I have a dataset with ["Uni", ‘Region’, "Profession", "Level_Edu", ‘Financial_Base’, ‘Learning_Time’, ‘GENDER’] columns. All values in ["Uni", ‘Region’, "Profession"] are filled while ["Level_Edu", ‘Financial_Base’, ‘Learning_Time’, ‘GENDER’] always contain NAs.
For each column with NAs there are several possible values
Level_Edu = ['undergrad', 'grad', 'PhD']
Financial_Base = ['personal', 'grant']
Learning_Time = ["morning", "day", "evening"]
GENDER = ['Male', 'Female']
I want to generate all possible combinations of ["Level_Edu", ‘Financial_Base’, ‘Learning_Time’, ‘GENDER’] for each observation in the initial data. So that each initial observation would be represented by 36 new observations (obtained by the formula of combinatorics: N1 * N2 * N3 * N4, where Ni is the length of the i-th vector of possible values for a column)
Here is a Python code for recreating two initial observations and approximation of the result I desire to get (showing 3 combinations out of 36 for each initial observation I want).
import pandas as pd
import numpy as np
sample_data_as_is = pd.DataFrame([["X1", "Y1", "Z1", np.nan, np.nan, np.nan, np.nan], ["X2", "Y2", "Z2", np.nan, np.nan, np.nan, np.nan]], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])
sample_data_to_be = pd.DataFrame([["X1", "Y1", "Z1", "undergrad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "personal", "morning", 'Male'],
["X1", "Y1", "Z1", "grad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "grad", "personal", "morning", 'Male'],
["X1", "Y1", "Z1", "undergrad", "grant", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "grant", "morning", 'Male']], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])
Answers:
You can combine itertools.product
and a cross-merge
:
from itertools import product
data = {'Level_Edu': ['undergrad', 'grad', 'PhD'],
'Financial_Base': ['personal', 'grant'],
'Learning_Time': ['morning', 'day', 'evening'],
'GENDER': ['Male', 'Female']}
out = (sample_data_as_is[['Uni', 'Region', 'Profession']]
.merge(pd.DataFrame(product(*data.values()), columns=data.keys()), how='cross')
)
Output:
Uni Region Profession Level_Edu Financial_Base Learning_Time GENDER
0 X1 Y1 Z1 undergrad personal morning Male
1 X1 Y1 Z1 undergrad personal morning Female
2 X1 Y1 Z1 undergrad personal day Male
3 X1 Y1 Z1 undergrad personal day Female
4 X1 Y1 Z1 undergrad personal evening Male
.. .. ... ... ... ... ... ...
67 X2 Y2 Z2 PhD grant morning Female
68 X2 Y2 Z2 PhD grant day Male
69 X2 Y2 Z2 PhD grant day Female
70 X2 Y2 Z2 PhD grant evening Male
71 X2 Y2 Z2 PhD grant evening Female
[72 rows x 7 columns]
If you want the specific order of rows/columns from your expected output:
cols = ['Uni', 'Region', 'Profession']
out = (pd.DataFrame(product(*data.values()), columns=data.keys())
.merge(sample_data_as_is[cols], how='cross')
[cols+list(data)]
)
Output:
Uni Region Profession Level_Edu Financial_Base Learning_Time GENDER
0 X1 Y1 Z1 undergrad personal morning Male
1 X2 Y2 Z2 undergrad personal morning Male
2 X1 Y1 Z1 undergrad personal morning Female
3 X2 Y2 Z2 undergrad personal morning Female
4 X1 Y1 Z1 undergrad personal day Male
.. .. ... ... ... ... ... ...
67 X2 Y2 Z2 PhD grant day Female
68 X1 Y1 Z1 PhD grant evening Male
69 X2 Y2 Z2 PhD grant evening Male
70 X1 Y1 Z1 PhD grant evening Female
71 X2 Y2 Z2 PhD grant evening Female
[72 rows x 7 columns]
One option is with pyjanitor’s expand_grid, which is a wrapper around np.meshgrid
, which could come in handy for large sizes and can be more performant than itertools.product
at relatively large sizes (at really large sizes you can get memory error, which can happen with cartesian join, which is what this is essentially):
# pip install pyjanitor
import pandas as pd
import janitor
trimmed = sample_data_as_is.iloc[:, :3]
others = {'Level_Edu':Level_Edu,
'Financial_Base':Financial_Base,
'Learning_Time':Learning_Time,
'GENDER':GENDER}
new_columns = trimmed.columns.union(others,sort=False)
(trimmed
.expand_grid(df_key='df',others=others)
.set_axis(new_columns,axis='columns')
)
Uni Region Profession Level_Edu Financial_Base Learning_Time GENDER
0 X1 Y1 Z1 undergrad personal morning Male
1 X1 Y1 Z1 undergrad personal morning Female
2 X1 Y1 Z1 undergrad personal day Male
3 X1 Y1 Z1 undergrad personal day Female
4 X1 Y1 Z1 undergrad personal evening Male
.. .. ... ... ... ... ... ...
67 X2 Y2 Z2 PhD grant morning Female
68 X2 Y2 Z2 PhD grant day Male
69 X2 Y2 Z2 PhD grant day Female
70 X2 Y2 Z2 PhD grant evening Male
71 X2 Y2 Z2 PhD grant evening Female
[72 rows x 7 columns]
I have a dataset with ["Uni", ‘Region’, "Profession", "Level_Edu", ‘Financial_Base’, ‘Learning_Time’, ‘GENDER’] columns. All values in ["Uni", ‘Region’, "Profession"] are filled while ["Level_Edu", ‘Financial_Base’, ‘Learning_Time’, ‘GENDER’] always contain NAs.
For each column with NAs there are several possible values
Level_Edu = ['undergrad', 'grad', 'PhD']
Financial_Base = ['personal', 'grant']
Learning_Time = ["morning", "day", "evening"]
GENDER = ['Male', 'Female']
I want to generate all possible combinations of ["Level_Edu", ‘Financial_Base’, ‘Learning_Time’, ‘GENDER’] for each observation in the initial data. So that each initial observation would be represented by 36 new observations (obtained by the formula of combinatorics: N1 * N2 * N3 * N4, where Ni is the length of the i-th vector of possible values for a column)
Here is a Python code for recreating two initial observations and approximation of the result I desire to get (showing 3 combinations out of 36 for each initial observation I want).
import pandas as pd
import numpy as np
sample_data_as_is = pd.DataFrame([["X1", "Y1", "Z1", np.nan, np.nan, np.nan, np.nan], ["X2", "Y2", "Z2", np.nan, np.nan, np.nan, np.nan]], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])
sample_data_to_be = pd.DataFrame([["X1", "Y1", "Z1", "undergrad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "personal", "morning", 'Male'],
["X1", "Y1", "Z1", "grad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "grad", "personal", "morning", 'Male'],
["X1", "Y1", "Z1", "undergrad", "grant", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "grant", "morning", 'Male']], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])
You can combine itertools.product
and a cross-merge
:
from itertools import product
data = {'Level_Edu': ['undergrad', 'grad', 'PhD'],
'Financial_Base': ['personal', 'grant'],
'Learning_Time': ['morning', 'day', 'evening'],
'GENDER': ['Male', 'Female']}
out = (sample_data_as_is[['Uni', 'Region', 'Profession']]
.merge(pd.DataFrame(product(*data.values()), columns=data.keys()), how='cross')
)
Output:
Uni Region Profession Level_Edu Financial_Base Learning_Time GENDER
0 X1 Y1 Z1 undergrad personal morning Male
1 X1 Y1 Z1 undergrad personal morning Female
2 X1 Y1 Z1 undergrad personal day Male
3 X1 Y1 Z1 undergrad personal day Female
4 X1 Y1 Z1 undergrad personal evening Male
.. .. ... ... ... ... ... ...
67 X2 Y2 Z2 PhD grant morning Female
68 X2 Y2 Z2 PhD grant day Male
69 X2 Y2 Z2 PhD grant day Female
70 X2 Y2 Z2 PhD grant evening Male
71 X2 Y2 Z2 PhD grant evening Female
[72 rows x 7 columns]
If you want the specific order of rows/columns from your expected output:
cols = ['Uni', 'Region', 'Profession']
out = (pd.DataFrame(product(*data.values()), columns=data.keys())
.merge(sample_data_as_is[cols], how='cross')
[cols+list(data)]
)
Output:
Uni Region Profession Level_Edu Financial_Base Learning_Time GENDER
0 X1 Y1 Z1 undergrad personal morning Male
1 X2 Y2 Z2 undergrad personal morning Male
2 X1 Y1 Z1 undergrad personal morning Female
3 X2 Y2 Z2 undergrad personal morning Female
4 X1 Y1 Z1 undergrad personal day Male
.. .. ... ... ... ... ... ...
67 X2 Y2 Z2 PhD grant day Female
68 X1 Y1 Z1 PhD grant evening Male
69 X2 Y2 Z2 PhD grant evening Male
70 X1 Y1 Z1 PhD grant evening Female
71 X2 Y2 Z2 PhD grant evening Female
[72 rows x 7 columns]
One option is with pyjanitor’s expand_grid, which is a wrapper around np.meshgrid
, which could come in handy for large sizes and can be more performant than itertools.product
at relatively large sizes (at really large sizes you can get memory error, which can happen with cartesian join, which is what this is essentially):
# pip install pyjanitor
import pandas as pd
import janitor
trimmed = sample_data_as_is.iloc[:, :3]
others = {'Level_Edu':Level_Edu,
'Financial_Base':Financial_Base,
'Learning_Time':Learning_Time,
'GENDER':GENDER}
new_columns = trimmed.columns.union(others,sort=False)
(trimmed
.expand_grid(df_key='df',others=others)
.set_axis(new_columns,axis='columns')
)
Uni Region Profession Level_Edu Financial_Base Learning_Time GENDER
0 X1 Y1 Z1 undergrad personal morning Male
1 X1 Y1 Z1 undergrad personal morning Female
2 X1 Y1 Z1 undergrad personal day Male
3 X1 Y1 Z1 undergrad personal day Female
4 X1 Y1 Z1 undergrad personal evening Male
.. .. ... ... ... ... ... ...
67 X2 Y2 Z2 PhD grant morning Female
68 X2 Y2 Z2 PhD grant day Male
69 X2 Y2 Z2 PhD grant day Female
70 X2 Y2 Z2 PhD grant evening Male
71 X2 Y2 Z2 PhD grant evening Female
[72 rows x 7 columns]