pandas – find first occurrence
Question:
Suppose I have a structured dataframe as follows:
df = pd.DataFrame({"A":['a','a','a','b','b'],
"B":[1]*5})
The A
column has previously been sorted. I wish to find the first row index of where df[df.A!='a']
. The end goal is to use this index to break the data frame into groups based on A
.
Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since A
has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a'
. Therefore it is important that whatever method that you use the scanning stops once the first element is found.
Answers:
idxmax
and argmax
will return the position of the maximal value or the first position if the maximal value occurs more than once.
use idxmax
on df.A.ne('a')
df.A.ne('a').idxmax()
3
or the numpy
equivalent
(df.A.values != 'a').argmax()
3
However, if A
has already been sorted, then we can use searchsorted
df.A.searchsorted('a', side='right')
array([3])
Or the numpy
equivalent
df.A.values.searchsorted('a', side='right')
3
If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
for index in range(len(df['A'])):
if df['A'][index] != 'a':
print(index)
break
The index is the row number of the 1st index of where df.A!='a'
I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:
df[df.A!='a'].first_valid_index()
3
However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:
df.loc[df.A!='a','A'].index[0]
Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
argmax numpy: 0.0009 1.29
for loop: 0.0045 6.43
searchsorted pandas: 0.0075 10.71
idxmax pandas: 0.0267 38.14
index[0]: 0.0295 42.14
first_valid_index pandas: 0.1181 168.71
Notice numpy’s searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it’s just because the dataframe has very few entries.
For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
searchsorted pandas: 0.0076 10.86
argmax numpy: 0.0117 16.71
index[0]: 0.0815 116.43
idxmax pandas: 0.0904 129.14
first_valid_index pandas: 0.1691 241.57
for loop: 9.6504 13786.29
The code to produce these results is below:
import timeit
# code snippet to be executed only once
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''
# code snippets whose execution time is to be measured
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]
mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")
mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")
mycode_set.append( '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")
mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")
mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")
mycode_set.append( '''for index in range(len(df['A'])):
if df['A'][index] != 'a':
ans = index
break
''')
message.append("for loop: ")
total_time_in_sec = []
for i in range(len(mycode_set)):
mycode = mycode_set[i]
total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,
stmt = mycode, number = 100),4))
output = pd.DataFrame(total_time_in_sec, index = message,
columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] =
np.round(output.total_time_sec/output["total_time_sec"].min(),2)
output = output.sort_values(by = "total_time_sec")
display(output)
For the larger dataframe:
mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''
For multiple conditions:
Let’s say we have:
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
And we want to find the first item different than a and c, we do:
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
Times:
import numpy as np
import pandas as pd
from datetime import datetime
ITERS = 1000
def pandas_multi_condition(s):
ts = datetime.now()
for i in range(ITERS):
n = s[(s != 'a') & (s != 'c')].index[0]
print(n)
print(datetime.now() - ts)
def numpy_bitwise_and(s):
ts = datetime.now()
for i in range(ITERS):
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
print(n)
print(datetime.now() - ts)
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)
Output:
pandas_multi_condition():
4
0:00:01.144767
numpy_bitwise_and():
4
0:00:00.019013
You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:
def getMaxIndex(df, col)
max = -999999
rtn_index = 0
for index, row in df.iterrows():
if row[col] > max:
max = row[col]
rtn_index = index
return rtn_index
Using pandas groupby()
to group by column or list of columns. Then first()
to get the first value in each group.
import pandas as pd
df = pd.DataFrame({"A":['a','a','a','b','b'],
"B":[1]*5})
#Group df by column and get the first value in each group
grouped_df = df.groupby("A").first()
#Reset indices to match format
first_values = grouped_df.reset_index()
print(first_values)
>>> A B
0 a 1
1 b 1
Generalized Form:
index = df.loc[df.column_name == 'value_you_looking_for'].index[0]
Example:
index_of_interest = df.loc[df.A == 'a'].index[0]
Suppose I have a structured dataframe as follows:
df = pd.DataFrame({"A":['a','a','a','b','b'],
"B":[1]*5})
The A
column has previously been sorted. I wish to find the first row index of where df[df.A!='a']
. The end goal is to use this index to break the data frame into groups based on A
.
Now I realise that there is a groupby functionality. However, the dataframe is quite large and this is a simplified toy example. Since A
has been sorted already, it would be faster if I can just find the 1st index of where df.A!='a'
. Therefore it is important that whatever method that you use the scanning stops once the first element is found.
idxmax
and argmax
will return the position of the maximal value or the first position if the maximal value occurs more than once.
use idxmax
on df.A.ne('a')
df.A.ne('a').idxmax()
3
or the numpy
equivalent
(df.A.values != 'a').argmax()
3
However, if A
has already been sorted, then we can use searchsorted
df.A.searchsorted('a', side='right')
array([3])
Or the numpy
equivalent
df.A.values.searchsorted('a', side='right')
3
If you just want to find the first instance without going through the entire dataframe, you can go the for-loop way.
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
for index in range(len(df['A'])):
if df['A'][index] != 'a':
print(index)
break
The index is the row number of the 1st index of where df.A!='a'
I found there is first_valid_index function for Pandas DataFrames that will do the job, one could use it as follows:
df[df.A!='a'].first_valid_index()
3
However, this function seems to be very slow. Even taking the first index of the filtered dataframe is faster:
df.loc[df.A!='a','A'].index[0]
Below I compare the total time(sec) of repeating calculations 100 times for these two options and all the codes above:
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
argmax numpy: 0.0009 1.29
for loop: 0.0045 6.43
searchsorted pandas: 0.0075 10.71
idxmax pandas: 0.0267 38.14
index[0]: 0.0295 42.14
first_valid_index pandas: 0.1181 168.71
Notice numpy’s searchsorted is the winner and first_valid_index shows worst performance. Generally, numpy algorithms are faster, and the for loop does not do so bad, but it’s just because the dataframe has very few entries.
For a dataframe with 10,000 entries where the desired entries are closer to the end the results are different, with searchsorted delivering the best performance:
total_time_sec ratio wrt fastest algo
searchsorted numpy: 0.0007 1.00
searchsorted pandas: 0.0076 10.86
argmax numpy: 0.0117 16.71
index[0]: 0.0815 116.43
idxmax pandas: 0.0904 129.14
first_valid_index pandas: 0.1691 241.57
for loop: 9.6504 13786.29
The code to produce these results is below:
import timeit
# code snippet to be executed only once
mysetup = '''import pandas as pd
import numpy as np
df = pd.DataFrame({"A":['a','a','a','b','b'],"B":[1]*5})
'''
# code snippets whose execution time is to be measured
mycode_set = ['''
df[df.A!='a'].first_valid_index()
''']
message = ["first_valid_index pandas:"]
mycode_set.append( '''df.loc[df.A!='a','A'].index[0]''')
message.append("index[0]: ")
mycode_set.append( '''df.A.ne('a').idxmax()''')
message.append("idxmax pandas: ")
mycode_set.append( '''(df.A.values != 'a').argmax()''')
message.append("argmax numpy: ")
mycode_set.append( '''df.A.searchsorted('a', side='right')''')
message.append("searchsorted pandas: ")
mycode_set.append( '''df.A.values.searchsorted('a', side='right')''' )
message.append("searchsorted numpy: ")
mycode_set.append( '''for index in range(len(df['A'])):
if df['A'][index] != 'a':
ans = index
break
''')
message.append("for loop: ")
total_time_in_sec = []
for i in range(len(mycode_set)):
mycode = mycode_set[i]
total_time_in_sec.append(np.round(timeit.timeit(setup = mysetup,
stmt = mycode, number = 100),4))
output = pd.DataFrame(total_time_in_sec, index = message,
columns = ['total_time_sec' ])
output["ratio wrt fastest algo"] =
np.round(output.total_time_sec/output["total_time_sec"].min(),2)
output = output.sort_values(by = "total_time_sec")
display(output)
For the larger dataframe:
mysetup = '''import pandas as pd
import numpy as np
n = 10000
lt = ['a' for _ in range(n)]
b = ['b' for _ in range(5)]
lt[-5:] = b
df = pd.DataFrame({"A":lt,"B":[1]*n})
'''
For multiple conditions:
Let’s say we have:
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
And we want to find the first item different than a and c, we do:
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
Times:
import numpy as np
import pandas as pd
from datetime import datetime
ITERS = 1000
def pandas_multi_condition(s):
ts = datetime.now()
for i in range(ITERS):
n = s[(s != 'a') & (s != 'c')].index[0]
print(n)
print(datetime.now() - ts)
def numpy_bitwise_and(s):
ts = datetime.now()
for i in range(ITERS):
n = np.logical_and(s.values != 'a', s.values != 'c').argmax()
print(n)
print(datetime.now() - ts)
s = pd.Series(['a', 'a', 'c', 'c', 'b', 'd'])
print('pandas_multi_condition():')
pandas_multi_condition(s)
print()
print('numpy_bitwise_and():')
numpy_bitwise_and(s)
Output:
pandas_multi_condition():
4
0:00:01.144767
numpy_bitwise_and():
4
0:00:00.019013
You can iterate by dataframe rows (it is slow) and create your own logic to get values that you wanted:
def getMaxIndex(df, col)
max = -999999
rtn_index = 0
for index, row in df.iterrows():
if row[col] > max:
max = row[col]
rtn_index = index
return rtn_index
Using pandas groupby()
to group by column or list of columns. Then first()
to get the first value in each group.
import pandas as pd
df = pd.DataFrame({"A":['a','a','a','b','b'],
"B":[1]*5})
#Group df by column and get the first value in each group
grouped_df = df.groupby("A").first()
#Reset indices to match format
first_values = grouped_df.reset_index()
print(first_values)
>>> A B
0 a 1
1 b 1
Generalized Form:
index = df.loc[df.column_name == 'value_you_looking_for'].index[0]
Example:
index_of_interest = df.loc[df.A == 'a'].index[0]