Efficiently filtering out the last row of a duplicate column
Question:
I need to filter out the last row where col2 = 3
but preserve the rest of the dataframe.
I can do that like so, while maintaining the order relative to the index:
import pandas
d = {
'col1': [0, 1, 2, 3, 3, 3, 3, 4, 5, 6],
'col2': [0, 11, 21, 31, 32, 33, 34, 41, 51, 61]
}
df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
pandas.concat([df2,df3]).sort_index()
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
But for a larger dataframe, this operation gets progressively more expensive to perform.
Is there a more efficient way?
UPDATE
Based on the answers provided this far, here are the results:
import pandas
import random
dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()
df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)
df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])
df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())
df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)
703 ms ± 60.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
497 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
413 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
253 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
404 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
792 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
491 ms ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Answers:
EDIT:
Filter index of last 3
and remove it:
df1 = df.drop(df.index[df['col1'].eq(3)][-1])
If possible always exist value 3
you can find index and drop it:
df1 = df.drop((df['col1'].iloc[::-1] == 3).idxmax())
Same idea in numpy:
df1 = df.drop(np.argwhere(df['col1'].to_numpy() == 3)[-1])
Timings are very similar:
#Last value of 1M is 3
np.random.seed(100)
df = pd.DataFrame({'col1': np.random.randint(100, size=1000000)})
df.loc[len(df), 'col1'] = 3
#print (df)
In [261]: %timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
43.8 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [262]: %timeit df.drop(df[df['col1'].eq(3)].index[-1])
44.4 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [263]: %timeit df.drop(df[df['col1'].eq(3)].index[-1:])
44.8 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())
44.6 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [265]: %timeit df.drop(np.argwhere(df['col1'].to_numpy() == 3)[-1])
43.3 ms ± 422 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [266]: %timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
64.3 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [267]: %timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
168 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use:
# get the last index of 3 in col1
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
# if there was no 3 in col1, this would give a false positive
# idxmax would return the last non-3 instead
# ensure that we drop the correct row
if df.loc[idx, 'col1'] == 3:
df = df.drop(idx)
NB. If you already know that 3
is in col1, then a simple df = df.drop(idx)
is sufficient.
Output:
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
comparison of timings
DataFrame sizes from 8 to 33M rows. Timings of all answers are similar except for that of Chrysophylaxs that is reproducibly faster (and those of rhug123 and Corralien/LaurentB that are slower for large/small DataFrames, respectively).
Also possible:
out = df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
out:
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
You can use:
>>> df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
Here is another solution:
df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
or
df.loc[~(df.groupby(df['col1'].eq(3)).cumcount(ascending = False).eq(0) & df['col1'].eq(3))]
or if the last group of 3 needs to be excluded each time a streak of them appears:
df.loc[~(df['col1'].diff(-1).ne(0) & df['col1'].eq(3))]
Output:
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
Here one possible solution that uses Polars (https://www.pola.rs/) instead of Pandas.
The timing seems to be about 20 times less than Pandas. (15 ms vs 300+ ms)
import polars as pl
df = pl.DataFrame(d)
id_ = df.select((pl.col('col1').eq(3)).arg_true().max()).item()
df = df.slice(0,id_-1).vstack(df.slice(id_))
# reconvert to Pandas if needed
df_pandas = df.to_pandas()
Time comparison
import pandas
import random
dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()
df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)
df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])
df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())
df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)
%%timeit
id_ = df.select((pl.col('col1').eq(3)).arg_true().max()).item()
df.slice(0,id_-1).vstack(df.slice(id_))
878 ms ± 88.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
868 ms ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
574 ms ± 53.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
373 ms ± 45.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
627 ms ± 65.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
597 ms ± 62.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.07 s ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
720 ms ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.9 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) (Polars)
If we include converting back to Pandas, then the time goes up to 73ms:
%%timeit
id_ = df.select((pl.col('col1').eq(3)).arg_true().max()).item()
(df.slice(0,id_-1).vstack(df.slice(id_))).to_pandas(use_pyarrow_extension_array=True)
73 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I need to filter out the last row where col2 = 3
but preserve the rest of the dataframe.
I can do that like so, while maintaining the order relative to the index:
import pandas
d = {
'col1': [0, 1, 2, 3, 3, 3, 3, 4, 5, 6],
'col2': [0, 11, 21, 31, 32, 33, 34, 41, 51, 61]
}
df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
pandas.concat([df2,df3]).sort_index()
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
But for a larger dataframe, this operation gets progressively more expensive to perform.
Is there a more efficient way?
UPDATE
Based on the answers provided this far, here are the results:
import pandas
import random
dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()
df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)
df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])
df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())
df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)
703 ms ± 60.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
497 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
413 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
253 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
404 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
792 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
491 ms ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
EDIT:
Filter index of last 3
and remove it:
df1 = df.drop(df.index[df['col1'].eq(3)][-1])
If possible always exist value 3
you can find index and drop it:
df1 = df.drop((df['col1'].iloc[::-1] == 3).idxmax())
Same idea in numpy:
df1 = df.drop(np.argwhere(df['col1'].to_numpy() == 3)[-1])
Timings are very similar:
#Last value of 1M is 3
np.random.seed(100)
df = pd.DataFrame({'col1': np.random.randint(100, size=1000000)})
df.loc[len(df), 'col1'] = 3
#print (df)
In [261]: %timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
43.8 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [262]: %timeit df.drop(df[df['col1'].eq(3)].index[-1])
44.4 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [263]: %timeit df.drop(df[df['col1'].eq(3)].index[-1:])
44.8 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())
44.6 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [265]: %timeit df.drop(np.argwhere(df['col1'].to_numpy() == 3)[-1])
43.3 ms ± 422 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [266]: %timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
64.3 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [267]: %timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
168 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use:
# get the last index of 3 in col1
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
# if there was no 3 in col1, this would give a false positive
# idxmax would return the last non-3 instead
# ensure that we drop the correct row
if df.loc[idx, 'col1'] == 3:
df = df.drop(idx)
NB. If you already know that 3
is in col1, then a simple df = df.drop(idx)
is sufficient.
Output:
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
comparison of timings
DataFrame sizes from 8 to 33M rows. Timings of all answers are similar except for that of Chrysophylaxs that is reproducibly faster (and those of rhug123 and Corralien/LaurentB that are slower for large/small DataFrames, respectively).
Also possible:
out = df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
out:
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
You can use:
>>> df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
Here is another solution:
df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
or
df.loc[~(df.groupby(df['col1'].eq(3)).cumcount(ascending = False).eq(0) & df['col1'].eq(3))]
or if the last group of 3 needs to be excluded each time a streak of them appears:
df.loc[~(df['col1'].diff(-1).ne(0) & df['col1'].eq(3))]
Output:
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)
col1 col2
0 0 0
1 1 11
2 2 21
3 3 31
4 3 32
5 3 33
7 4 41
8 5 51
9 6 61
Here one possible solution that uses Polars (https://www.pola.rs/) instead of Pandas.
The timing seems to be about 20 times less than Pandas. (15 ms vs 300+ ms)
import polars as pl
df = pl.DataFrame(d)
id_ = df.select((pl.col('col1').eq(3)).arg_true().max()).item()
df = df.slice(0,id_-1).vstack(df.slice(id_))
# reconvert to Pandas if needed
df_pandas = df.to_pandas()
Time comparison
import pandas
import random
dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()
df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)
df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])
df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())
df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)
%%timeit
id_ = df.select((pl.col('col1').eq(3)).arg_true().max()).item()
df.slice(0,id_-1).vstack(df.slice(id_))
878 ms ± 88.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
868 ms ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
574 ms ± 53.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
373 ms ± 45.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
627 ms ± 65.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
597 ms ± 62.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.07 s ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
720 ms ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.9 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) (Polars)
If we include converting back to Pandas, then the time goes up to 73ms:
%%timeit
id_ = df.select((pl.col('col1').eq(3)).arg_true().max()).item()
(df.slice(0,id_-1).vstack(df.slice(id_))).to_pandas(use_pyarrow_extension_array=True)
73 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)