Check a column for substring to assign new columns with values

Question:

This is my dataframe with 2 columns:

ID      CODES
36233   LEH,PW
6175    N/A
6242    
6680    MS,XL,JFK

In column CODES, I need to identify the comma (",") and then count the number of commas and return it in a dataframe:

Output:

ID      CODES   HAS COMMA   NO. OF COMMAS
36233   LEH,PW  TRUE        1
6175    N/A     FALSE       0
6242            FALSE       0
6680  MS,XL,JFK TRUE        2

So far I’ve tried DF['HAS COMMA'] = np.where(DF['CODE'].str.contains(','),True, False) but this returns TRUE where there are blanks. 🙁

Additionally DF['NO OF COMMAs']=DF['CODE'].count(",") returns an error.

Asked By: Sumit

||

Answers:

How about with:

df['HAS COMMA'],df['NO. OF COMMA'] = [df.CODES.str.contains(',').fillna(False), df.CODES.str.count(',').fillna(0)]

prints:

      ID      CODES  HAS COMMA  NO. OF COMMA
0  36233     LEH,PW       True           1.0
1   6175        N/A      False           0.0
2   6242        NaN      False           0.0
3   6680  MS,XL,JFK       True           2.0
Answered By: sophocles

Pandas string methods are not optimized so a Python list comprehension would be more efficient for this task. For example, the code below is about 8 times faster than the equivalent pandas str methods for a df with 4k rows.

Simply check if a comma exists in each value of df.CODES and decide whether to count or not.

df[['HAS COMMA', 'NO. OF COMMA']] = [[True, s.count(',')] if ',' in s else [False, 0] for s in df['CODES'].tolist()]

result

Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.