Search under condition in one column if a value in another column is unique, and write the results in a 3rd column

Question:

I have a data frame with 70k rows and two columns. Col1 contains bill of materials name and customer, Col2 contains a part number (which is a part of the BOM).

         Col1  Col2
0  TUR, Cust1  1001
1  GAR, Cust2  1001
2  FOR, Cust3  1001
3  ERB, Cust1  1002
4  PNR, Cust1  1002
5  DUL, Cust2  1003
6  COC, Cust3  1003
7  ETM, Cust1  1004
8  ROW, Cust3  1005
9  HON, Cust3  1005 

When searching for Cust1, I want to see in column3 if the part number is purchased exclusively for that customer. Something like this:

         Col1  Col2   Col3
0  TUR, Cust1  1001  false
1  GAR, Cust2  1001  false
2  FOR, Cust3  1001  false
3  ERB, Cust1  1002   true
4  PNR, Cust1  1002   true
5  DUL, Cust2  1003  false
6  COC, Cust3  1003  false
7  ETM, Cust1  1004   true
8  ROW, Cust3  1005  false
9  HON, Cust3  1005  false

I already tried to extract duplicates with df.duplicated and to evaluate the customer name with str.contains, but without satisfying result. Is there a smart solution that I don’t know?
I am new to python and getting nowhere with this problem.

Asked By: dawn

||

Answers:

Extract the customer from Col1 then groupby part number and count the number of unique customers which should be equal to 1

df['Cust'] = df['Col1'].str.split(', ').str[-1]
df['Col3'] = df.groupby('Col2')['Cust'].transform('nunique').eq(1)

If only interested in checking the dupes considering one customer at a time, here is a simpler version

m = df['Col1'].str.split(', ').str[-1] == 'Cust1' # is cust1?
df['Col3'] = m.groupby(df['Col2']).transform('all') # are all cust1 per part?

Result

         Col1  Col2   Cust   Col3
0  TUR, Cust1  1001  Cust1  False
1  GAR, Cust2  1001  Cust2  False
2  FOR, Cust3  1001  Cust3  False
3  ERB, Cust1  1002  Cust1   True
4  PNR, Cust1  1002  Cust1   True
5  DUL, Cust2  1003  Cust2  False
6  COC, Cust3  1003  Cust3  False
7  ETM, Cust1  1004  Cust1   True
Answered By: Shubham Sharma

One approach could be as follows:

df['Col3'] = df.Col1.str.extract(r',s(.*)$')[0]
# or: `df['Col1'].str.split(', ', expand=True)[1]`

df['Col3'] = df.Col2.map(df.drop_duplicates(subset=['Col2','Col3'])
                         ['Col2'].value_counts().eq(1))

print(df)

         Col1  Col2   Col3
0  TUR, Cust1  1001  False
1  GAR, Cust2  1001  False
2  FOR, Cust3  1001  False
3  ERB, Cust1  1002   True
4  PNR, Cust1  1002   True
5  DUL, Cust2  1003  False
6  COC, Cust3  1003  False
7  ETM, Cust1  1004   True

Explanation

  • Use Series.str.extract to get the customer names from Col1 and assign to a new column, Col3. (Alternatively, use Series.str.split.)
  • Next, use df.drop_duplicates with subset parameter set to ['Col2','Col3']. E.g. for a duplicate like 1002, Cust1, we will keep only the first.
  • Now, select Col2 and apply Series.value_counts. Result at this stage for df.drop_duplicates(subset=['Col2','Col3'])['Col2'].value_counts() would look as follows:
1001    3
1003    2
1002    1
1004    1
Name: Col2, dtype: int64
  • The values from Col2 with 1 as the count (i.e. 1002, 1004) will occur only for one customer, so we can chain Series.eq to get back True for these values only, and False for the others.
  • Finally, we want to place this result inside Series.map, applied to Col2 to match the correct booleans. Overwrite Col3 again with the result.
Answered By: ouroboros1
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.