How to find first occurence of a specified integer over multiple columns using Pandas?
Question:
I have this dataset:
2010 2011 2012
0 NaN NaN 505303.0
1 542225.0 NaN 210530.0
2 123210.0 429439.0 543964.0
3 434304.0 540325.0 NaN
4 750450.0 143430.0 540425.0
5 543015.0 549320.0 104365.0
and I want first to find the first digit for each cell like this (see MWE
):
2010 2011 2012
0 - - 5
1 5 - 2
2 1 4 5
3 4 5 -
4 7 1 5
5 5 5 1
but finally I want to count the first occurence of 5
in each row, and which year it occured. If 5
occurs several places, I only want to know the first one. How do I accomplish this?
2010 2011 2012 Year
0 - - 5 2012
1 5 - 2 2010
2 1 4 5 2012
3 4 5 - 2011
4 7 1 5 2012
5 5 5 1 2010
Below you will find the MWE
:
import numpy as np
data = {"2010": [np.nan, 542225, 123210, 434304, 750450, 543015],
"2011": [np.nan, np.nan, 429439, 540325, 143430, 549320],
"2012": [505303, 210530, 543964, np.nan, 540425, 104365]
}
df_t = pd.DataFrame(data)
for col in df_t.columns:
df_t[col] = (df_t[col]
.fillna(-1)
.astype(str)
.str[0]
)
Answers:
Your solution should be used with DataFrame.apply
:
df = df_t.fillna(-1).astype(str).apply(lambda x: x.str[0])
print (df)
2010 2011 2012
0 - - 5
1 5 - 2
2 1 4 5
3 4 5 -
4 7 1 5
5 5 5 1
Then compare by string '5'
and get first matched year by DataFrame.idxmax
, if no match get None
:
m = df.eq('5')
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
2010 2011 2012 Year
0 - - 5 2012
1 5 - 2 2010
2 1 4 5 2012
3 4 5 - 2011
4 7 1 5 2012
5 5 5 1 2010
Another idea with numeric only values:
df = df_t // (10 ** np.log10(df_t).fillna(1).astype(int))
print (df)
2010 2011 2012
0 NaN NaN 5.0
1 5.0 NaN 2.0
2 1.0 4.0 5.0
3 4.0 5.0 NaN
4 7.0 1.0 5.0
5 5.0 5.0 1.0
m = df.eq(5)
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
2010 2011 2012 Year
0 NaN NaN 5.0 2012
1 5.0 NaN 2.0 2010
2 1.0 4.0 5.0 2012
3 4.0 5.0 NaN 2011
4 7.0 1.0 5.0 2012
5 5.0 5.0 1.0 2010
I have this dataset:
2010 2011 2012
0 NaN NaN 505303.0
1 542225.0 NaN 210530.0
2 123210.0 429439.0 543964.0
3 434304.0 540325.0 NaN
4 750450.0 143430.0 540425.0
5 543015.0 549320.0 104365.0
and I want first to find the first digit for each cell like this (see MWE
):
2010 2011 2012
0 - - 5
1 5 - 2
2 1 4 5
3 4 5 -
4 7 1 5
5 5 5 1
but finally I want to count the first occurence of 5
in each row, and which year it occured. If 5
occurs several places, I only want to know the first one. How do I accomplish this?
2010 2011 2012 Year
0 - - 5 2012
1 5 - 2 2010
2 1 4 5 2012
3 4 5 - 2011
4 7 1 5 2012
5 5 5 1 2010
Below you will find the MWE
:
import numpy as np
data = {"2010": [np.nan, 542225, 123210, 434304, 750450, 543015],
"2011": [np.nan, np.nan, 429439, 540325, 143430, 549320],
"2012": [505303, 210530, 543964, np.nan, 540425, 104365]
}
df_t = pd.DataFrame(data)
for col in df_t.columns:
df_t[col] = (df_t[col]
.fillna(-1)
.astype(str)
.str[0]
)
Your solution should be used with DataFrame.apply
:
df = df_t.fillna(-1).astype(str).apply(lambda x: x.str[0])
print (df)
2010 2011 2012
0 - - 5
1 5 - 2
2 1 4 5
3 4 5 -
4 7 1 5
5 5 5 1
Then compare by string '5'
and get first matched year by DataFrame.idxmax
, if no match get None
:
m = df.eq('5')
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
2010 2011 2012 Year
0 - - 5 2012
1 5 - 2 2010
2 1 4 5 2012
3 4 5 - 2011
4 7 1 5 2012
5 5 5 1 2010
Another idea with numeric only values:
df = df_t // (10 ** np.log10(df_t).fillna(1).astype(int))
print (df)
2010 2011 2012
0 NaN NaN 5.0
1 5.0 NaN 2.0
2 1.0 4.0 5.0
3 4.0 5.0 NaN
4 7.0 1.0 5.0
5 5.0 5.0 1.0
m = df.eq(5)
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
2010 2011 2012 Year
0 NaN NaN 5.0 2012
1 5.0 NaN 2.0 2010
2 1.0 4.0 5.0 2012
3 4.0 5.0 NaN 2011
4 7.0 1.0 5.0 2012
5 5.0 5.0 1.0 2010