Generating new a column in dataframe given value falls within a certain range of another column value
Question:
Given the following dataframe:
df = pd.DataFrame({'A':[random.randrange(0, 9, 1) for i in range(10000000)],
'B':[random.randrange(0, 9, 1) for i in range(10000000)]})
That may look like this:
A B
0 8 3
1 3 0
2 8 4
3 6 5
4 8 2
...
I’d like to generate a new column called Eq. This column confirms if A and B row values fall within a certain range. If so, the number 1 is appended, if not 0 is appended.
If the range is 2 the result should look like this:
A B Eq
0 8 3 0
1 3 0 0
2 8 4 0
3 6 5 1
4 8 2 0
...
Essentially:
8 does NOT fall in range of (3-2,3+2)
3 does NOT fall in range of (0-2,0+2)
6 DOES fall in range of (5-2,5+2)
In my first attempt, I wrote a simple function to apply to each row of the df.
def CountingMatches(row, range_limit):
if row['A'] in range (-range_limit + row['B'], range_limit+row['B']):
return 1
else:
return 0
df['Eq'] = df.apply(CountingMatches, axis=1, range_limit=3)
This worked but took an incredibly long time, so kinda useless.
I then used my function with swifter https://towardsdatascience.com/speed-up-your-pandas-processing-with-swifter-6aa314600a13
df['Eq'] = df.swifter.apply(CountingMatches, axis=1, range_limit=3)
This also took a really long time.
I then checked how long it would take to check if the columns matched, no range.
df['Eq'] = (df['A'].astype(int) == df['B'].astype(int)).astype(int)
This was incredibly fast ~ 1s.
Given this hopeful result, I tried to incorporate the ranges.
range_limit=2
df['Eq'] = (df['A'].astype(int) in range(df['B'].astype(int)-range_limit,df['B'].astype(int) + range_limit)).astype(int)
But I get the following error, rightfully so:
'Series' object cannot be interpreted as an integer
How can I efficiently complete this task on this dataframe?
Answers:
Use Series.between
. It should be fast.
df['Eq'] = df['A'].between(df['B'].sub(2), df['B'].add(2)).astype(int)
This shouldn’t take much longer than maybe 2x the time it took when you checked if the columns matched (as int):
range_limit = 2
lower_limits = df['B'] - range_limit
upper_limits = df['B'] + range_limit
df['Eq'] = ((lower_limits < df['A']) & (df['A'] < upper_limits)).astype(int)
# df['Eq'] = (((df['B'] - 2) < df['A']) & (df['A'] < (df['B'] + 2))).astype(int)
[It took about 0.1s on colab.]
Given the following dataframe:
df = pd.DataFrame({'A':[random.randrange(0, 9, 1) for i in range(10000000)],
'B':[random.randrange(0, 9, 1) for i in range(10000000)]})
That may look like this:
A B
0 8 3
1 3 0
2 8 4
3 6 5
4 8 2
...
I’d like to generate a new column called Eq. This column confirms if A and B row values fall within a certain range. If so, the number 1 is appended, if not 0 is appended.
If the range is 2 the result should look like this:
A B Eq
0 8 3 0
1 3 0 0
2 8 4 0
3 6 5 1
4 8 2 0
...
Essentially:
8 does NOT fall in range of (3-2,3+2)
3 does NOT fall in range of (0-2,0+2)
6 DOES fall in range of (5-2,5+2)
In my first attempt, I wrote a simple function to apply to each row of the df.
def CountingMatches(row, range_limit):
if row['A'] in range (-range_limit + row['B'], range_limit+row['B']):
return 1
else:
return 0
df['Eq'] = df.apply(CountingMatches, axis=1, range_limit=3)
This worked but took an incredibly long time, so kinda useless.
I then used my function with swifter https://towardsdatascience.com/speed-up-your-pandas-processing-with-swifter-6aa314600a13
df['Eq'] = df.swifter.apply(CountingMatches, axis=1, range_limit=3)
This also took a really long time.
I then checked how long it would take to check if the columns matched, no range.
df['Eq'] = (df['A'].astype(int) == df['B'].astype(int)).astype(int)
This was incredibly fast ~ 1s.
Given this hopeful result, I tried to incorporate the ranges.
range_limit=2
df['Eq'] = (df['A'].astype(int) in range(df['B'].astype(int)-range_limit,df['B'].astype(int) + range_limit)).astype(int)
But I get the following error, rightfully so:
'Series' object cannot be interpreted as an integer
How can I efficiently complete this task on this dataframe?
Use Series.between
. It should be fast.
df['Eq'] = df['A'].between(df['B'].sub(2), df['B'].add(2)).astype(int)
This shouldn’t take much longer than maybe 2x the time it took when you checked if the columns matched (as int):
range_limit = 2
lower_limits = df['B'] - range_limit
upper_limits = df['B'] + range_limit
df['Eq'] = ((lower_limits < df['A']) & (df['A'] < upper_limits)).astype(int)
# df['Eq'] = (((df['B'] - 2) < df['A']) & (df['A'] < (df['B'] + 2))).astype(int)
[It took about 0.1s on colab.]