Group dataframes using series as a reference

Question:

There is 1 Dataframe and 1 series

dataframe(df1) is information about the model and parts.
A model consists of several parts.

df1 = pd.DataFrame({'model':['A','A','A','A','A','B','B','B','B','B'],
                'part':['part_1','part_1','part_2','part_2','part_3','part_4','part_4','part_5','part_5','part_5']})

enter image description here

Looking at s1, "A" is the model and "B" is the part.

s1 = pd.Series({'A':'model', 'B':'part'})

enter image description here

Series as a reference
"A" is grouped between the same models, and "B" is grouped together with the same parts.
should be displayed in the "batch" column of df1

The desired output is df1
enter image description here

Asked By: ghost_like

||

Answers:

You can do it this way by creating a grouping column:

grp = pd.concat([df1[df1['model'] == k][v] for k, v in s1.items()])
df1['batch'] = 'batch_' + (grp != grp.shift()).cumsum().astype(str)
df1

Output:

  model    part    batch
0     A  part_1  batch_1
1     A  part_1  batch_1
2     A  part_2  batch_1
3     A  part_2  batch_1
4     A  part_3  batch_1
5     B  part_4  batch_2
6     B  part_4  batch_2
7     B  part_5  batch_3
8     B  part_5  batch_3
9     B  part_5  batch_3

Using list comprehension, build a list of series from df1 using the keys and values from s1. pd.concat this list of series together to create your groupings, then use string manipulation to create batch strings for each row and assign to new dataframe column header ‘batch’.

Update for @mozway comment.

df1['batch'] = 'batch_' + (df1.groupby(grp).ngroup()+1).astype(str)
Answered By: Scott Boston

If you really want to group by an arbitrary column mapped by a dictionary, you should use an indexing lookup:

# map column names per model
idx, cols = pd.factorize(df1['model'].map(s1))

# get column value
grp = df1.reindex(cols, axis=1).to_numpy()[np.arange(len(df1)), idx]

# use this as grouper
df1['batch'] = (df1.groupby(['model', grp])
                .ngroup().add(1)
                .astype(str).radd('batch_')
                )

If you don’t have values in "Model" that can be found in other columns, you can simplify the last part:

df1['batch'] = 'batch_' + pd.Series(pd.factorize(grp)[0]+1, index=df1.index).astype(str)

Output:


  model    part    batch
0     A  part_1  batch_1
1     A  part_1  batch_1
2     A  part_2  batch_1
3     A  part_2  batch_1
4     A  part_3  batch_1
5     B  part_4  batch_2
6     B  part_4  batch_2
7     B  part_5  batch_3
8     B  part_5  batch_3
9     B  part_5  batch_3

Intermediate grp:

['A' 'A' 'A' 'A' 'A' 'part_4' 'part_4' 'part_5' 'part_5' 'part_5']
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.