How to select rows values starting by specific letters by group in a python dataframe?

Question:

I have the following dataframe "data" composed of ID and associated cluster number :

ID      cluster 
FP_101   1  
FP_102   1     
SP_209   3
SP_300   3
SP_209   1
FP_45    90
SP_50    90
FP_398   100
...

I would like to print clusters which contain more than one ID starting by SP and/or FP.
I think that I have the two parts of the answer but just do not know of to combine them in propre way :

  • data = data[data[‘ID’].str.startswith(‘FP’)] (same for SP)
  • selection fonction : data = data.groupby([‘cluster’]).filter(lambda x: x[‘ID’].nunique() > 1)

The result should give from the previous example :

    ID      cluster 
    FP_101   1  
    FP_102   1
    SP_209   1     
    SP_209   3
    SP_300   3

How can I combine arrange these fonction to obtain this result ?

Asked By: JEG

||

Answers:

This is my understanding of your question; let me know if it helps:

  1. Separating SP & FP

df[‘Prefix’] = df[‘ID’].apply(lambda x: x.split(‘_’)[0])

  1. Grouping by clusters

df2 = df.groupby([‘cluster’, ‘Prefix’], as_index = False).agg({‘ID’:[‘nunique’,’unique’]})

  1. Filtering

df2.columns = df2.columns.to_flat_index().str.join(”)

df2[df2[‘IDnunique’]>1]

Answered By: Megha John
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.