How to find first occurence of a specified integer over multiple columns using Pandas?

Question:

I have this dataset:

        2010            2011            2012        
0   NaN                 NaN             505303.0
1   542225.0            NaN             210530.0
2   123210.0            429439.0        543964.0
3   434304.0            540325.0        NaN
4   750450.0            143430.0        540425.0
5   543015.0            549320.0        104365.0

and I want first to find the first digit for each cell like this (see MWE):

    2010    2011    2012
0   -       -       5
1   5       -       2
2   1       4       5
3   4       5       -
4   7       1       5
5   5       5       1

but finally I want to count the first occurence of 5 in each row, and which year it occured. If 5 occurs several places, I only want to know the first one. How do I accomplish this?

    2010    2011    2012    Year
0   -       -       5       2012
1   5       -       2       2010
2   1       4       5       2012
3   4       5       -       2011
4   7       1       5       2012
5   5       5       1       2010

Below you will find the MWE:

import numpy as np

data = {"2010": [np.nan, 542225, 123210, 434304, 750450, 543015],
        "2011": [np.nan, np.nan, 429439, 540325, 143430, 549320],
        "2012": [505303, 210530, 543964, np.nan, 540425, 104365]
       }

df_t = pd.DataFrame(data)

for col in df_t.columns:
    df_t[col] = (df_t[col]
           .fillna(-1)
           .astype(str)
           .str[0]
           )
Asked By: snate

||

Answers:

Your solution should be used with DataFrame.apply:

df = df_t.fillna(-1).astype(str).apply(lambda x: x.str[0])
print (df)
  2010 2011 2012
0    -    -    5
1    5    -    2
2    1    4    5
3    4    5    -
4    7    1    5
5    5    5    1

Then compare by string '5' and get first matched year by DataFrame.idxmax, if no match get None:

m = df.eq('5')
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
  2010 2011 2012  Year
0    -    -    5  2012
1    5    -    2  2010
2    1    4    5  2012
3    4    5    -  2011
4    7    1    5  2012
5    5    5    1  2010

Another idea with numeric only values:

df = df_t // (10 ** np.log10(df_t).fillna(1).astype(int))
print (df)
   2010  2011  2012
0   NaN   NaN   5.0
1   5.0   NaN   2.0
2   1.0   4.0   5.0
3   4.0   5.0   NaN
4   7.0   1.0   5.0
5   5.0   5.0   1.0

m = df.eq(5)
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
   2010  2011  2012  Year
0   NaN   NaN   5.0  2012
1   5.0   NaN   2.0  2010
2   1.0   4.0   5.0  2012
3   4.0   5.0   NaN  2011
4   7.0   1.0   5.0  2012
5   5.0   5.0   1.0  2010
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.