Group dataframes using series as a reference
Question:
There is 1 Dataframe and 1 series
dataframe(df1) is information about the model and parts.
A model consists of several parts.
df1 = pd.DataFrame({'model':['A','A','A','A','A','B','B','B','B','B'],
'part':['part_1','part_1','part_2','part_2','part_3','part_4','part_4','part_5','part_5','part_5']})
Looking at s1, "A" is the model and "B" is the part.
s1 = pd.Series({'A':'model', 'B':'part'})
Series as a reference
"A" is grouped between the same models, and "B" is grouped together with the same parts.
should be displayed in the "batch" column of df1
Answers:
You can do it this way by creating a grouping column:
grp = pd.concat([df1[df1['model'] == k][v] for k, v in s1.items()])
df1['batch'] = 'batch_' + (grp != grp.shift()).cumsum().astype(str)
df1
Output:
model part batch
0 A part_1 batch_1
1 A part_1 batch_1
2 A part_2 batch_1
3 A part_2 batch_1
4 A part_3 batch_1
5 B part_4 batch_2
6 B part_4 batch_2
7 B part_5 batch_3
8 B part_5 batch_3
9 B part_5 batch_3
Using list comprehension, build a list of series from df1 using the keys and values from s1. pd.concat this list of series together to create your groupings, then use string manipulation to create batch strings for each row and assign to new dataframe column header ‘batch’.
Update for @mozway comment.
df1['batch'] = 'batch_' + (df1.groupby(grp).ngroup()+1).astype(str)
If you really want to group by an arbitrary column mapped by a dictionary, you should use an indexing lookup:
# map column names per model
idx, cols = pd.factorize(df1['model'].map(s1))
# get column value
grp = df1.reindex(cols, axis=1).to_numpy()[np.arange(len(df1)), idx]
# use this as grouper
df1['batch'] = (df1.groupby(['model', grp])
.ngroup().add(1)
.astype(str).radd('batch_')
)
If you don’t have values in "Model" that can be found in other columns, you can simplify the last part:
df1['batch'] = 'batch_' + pd.Series(pd.factorize(grp)[0]+1, index=df1.index).astype(str)
Output:
model part batch
0 A part_1 batch_1
1 A part_1 batch_1
2 A part_2 batch_1
3 A part_2 batch_1
4 A part_3 batch_1
5 B part_4 batch_2
6 B part_4 batch_2
7 B part_5 batch_3
8 B part_5 batch_3
9 B part_5 batch_3
Intermediate grp
:
['A' 'A' 'A' 'A' 'A' 'part_4' 'part_4' 'part_5' 'part_5' 'part_5']
There is 1 Dataframe and 1 series
dataframe(df1) is information about the model and parts.
A model consists of several parts.
df1 = pd.DataFrame({'model':['A','A','A','A','A','B','B','B','B','B'],
'part':['part_1','part_1','part_2','part_2','part_3','part_4','part_4','part_5','part_5','part_5']})
Looking at s1, "A" is the model and "B" is the part.
s1 = pd.Series({'A':'model', 'B':'part'})
Series as a reference
"A" is grouped between the same models, and "B" is grouped together with the same parts.
should be displayed in the "batch" column of df1
You can do it this way by creating a grouping column:
grp = pd.concat([df1[df1['model'] == k][v] for k, v in s1.items()])
df1['batch'] = 'batch_' + (grp != grp.shift()).cumsum().astype(str)
df1
Output:
model part batch
0 A part_1 batch_1
1 A part_1 batch_1
2 A part_2 batch_1
3 A part_2 batch_1
4 A part_3 batch_1
5 B part_4 batch_2
6 B part_4 batch_2
7 B part_5 batch_3
8 B part_5 batch_3
9 B part_5 batch_3
Using list comprehension, build a list of series from df1 using the keys and values from s1. pd.concat this list of series together to create your groupings, then use string manipulation to create batch strings for each row and assign to new dataframe column header ‘batch’.
Update for @mozway comment.
df1['batch'] = 'batch_' + (df1.groupby(grp).ngroup()+1).astype(str)
If you really want to group by an arbitrary column mapped by a dictionary, you should use an indexing lookup:
# map column names per model
idx, cols = pd.factorize(df1['model'].map(s1))
# get column value
grp = df1.reindex(cols, axis=1).to_numpy()[np.arange(len(df1)), idx]
# use this as grouper
df1['batch'] = (df1.groupby(['model', grp])
.ngroup().add(1)
.astype(str).radd('batch_')
)
If you don’t have values in "Model" that can be found in other columns, you can simplify the last part:
df1['batch'] = 'batch_' + pd.Series(pd.factorize(grp)[0]+1, index=df1.index).astype(str)
Output:
model part batch
0 A part_1 batch_1
1 A part_1 batch_1
2 A part_2 batch_1
3 A part_2 batch_1
4 A part_3 batch_1
5 B part_4 batch_2
6 B part_4 batch_2
7 B part_5 batch_3
8 B part_5 batch_3
9 B part_5 batch_3
Intermediate grp
:
['A' 'A' 'A' 'A' 'A' 'part_4' 'part_4' 'part_5' 'part_5' 'part_5']