Efficiently filtering out the last row of a duplicate column

Question

I need to filter out the last row where col2 = 3 but preserve the rest of the dataframe.

I can do that like so, while maintaining the order relative to the index:

import pandas


d = {
     'col1': [0, 1, 2, 3, 3, 3, 3, 4, 5, 6],
     'col2': [0, 11, 21, 31, 32, 33, 34, 41, 51, 61]
    }

df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]

pandas.concat([df2,df3]).sort_index()

    col1 col2
0   0    0
1   1    11
2   2    21
3   3    31
4   3    32
5   3    33
7   4    41
8   5    51
9   6    61

But for a larger dataframe, this operation gets progressively more expensive to perform.

Is there a more efficient way?

UPDATE

Based on the answers provided this far, here are the results:

import pandas
import random

dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)

df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()

df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())

df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)

df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]

df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])

df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())

df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]

df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)

703 ms ± 60.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
497 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
413 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
253 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
404 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
792 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
491 ms ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Asked By: Jason

||

Source

Answer 1

EDIT:

Filter index of last 3 and remove it:

df1 = df.drop(df.index[df['col1'].eq(3)][-1])

If possible always exist value 3 you can find index and drop it:

df1 = df.drop((df['col1'].iloc[::-1] == 3).idxmax())

Same idea in numpy:

df1 = df.drop(np.argwhere(df['col1'].to_numpy() == 3)[-1])

Timings are very similar:

#Last value of 1M is 3
np.random.seed(100)
df = pd.DataFrame({'col1': np.random.randint(100, size=1000000)})
df.loc[len(df), 'col1'] = 3
#print (df)


In [261]: %timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]
43.8 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [262]: %timeit df.drop(df[df['col1'].eq(3)].index[-1])
44.4 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [263]: %timeit df.drop(df[df['col1'].eq(3)].index[-1:])
44.8 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [264]: %timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())
44.6 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [265]: %timeit df.drop(np.argwhere(df['col1'].to_numpy() == 3)[-1])
43.3 ms ± 422 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [266]: %timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
64.3 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [267]: %timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]
168 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answered By: jezrael

Answer 2

You can use:

# get the last index of 3 in col1
idx = df.loc[::-1, 'col1'].eq(3).idxmax()

# if there was no 3 in col1, this would give a false positive
# idxmax would return the last non-3 instead
# ensure that we drop the correct row
if df.loc[idx, 'col1'] == 3:
    df = df.drop(idx)

NB. If you already know that 3 is in col1, then a simple df = df.drop(idx) is sufficient.

Output:

   col1  col2
0     0     0
1     1    11
2     2    21
3     3    31
4     3    32
5     3    33
7     4    41
8     5    51
9     6    61

comparison of timings

DataFrame sizes from 8 to 33M rows. Timings of all answers are similar except for that of Chrysophylaxs that is reproducibly faster (and those of rhug123 and Corralien/LaurentB that are slower for large/small DataFrames, respectively).

Answered By: mozway

Answer 3

Also possible:

out = df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]

out:

   col1  col2
0     0     0
1     1    11
2     2    21
3     3    31
4     3    32
5     3    33
7     4    41
8     5    51
9     6    61

Answered By: Chrysophylaxs

Answer 4

You can use:

>>> df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())
   col1  col2
0     0     0
1     1    11
2     2    21
3     3    31
4     3    32
5     3    33
7     4    41
8     5    51
9     6    61

Answered By: Corralien

Answer 5

Here is another solution:

df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]

or

df.loc[~(df.groupby(df['col1'].eq(3)).cumcount(ascending = False).eq(0) & df['col1'].eq(3))]

or if the last group of 3 needs to be excluded each time a streak of them appears:

df.loc[~(df['col1'].diff(-1).ne(0) & df['col1'].eq(3))]

Output:

   col1  col2
0     0     0
1     1    11
2     2    21
3     3    31
4     3    32
5     3    33
7     4    41
8     5    51
9     6    61

Answered By: rhug123

Answer 6

df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)

   col1  col2
0     0     0
1     1    11
2     2    21
3     3    31
4     3    32
5     3    33
7     4    41
8     5    51
9     6    61

Answered By: Laurent B.

Answer 7

Here one possible solution that uses Polars (https://www.pola.rs/) instead of Pandas.
The timing seems to be about 20 times less than Pandas. (15 ms vs 300+ ms)

import polars as pl

df = pl.DataFrame(d)

id_ = df.select((pl.col('col1').eq(3)).arg_true().max()).item()
df = df.slice(0,id_-1).vstack(df.slice(id_))

# reconvert to Pandas if needed
df_pandas = df.to_pandas()

Time comparison

import pandas
import random

dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)

df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()

df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())

df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)

df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]

df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])

df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())

df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]

df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)

%%timeit
id_ = df.select((pl.col('col1').eq(3)).arg_true().max()).item()
df.slice(0,id_-1).vstack(df.slice(id_))

878 ms ± 88.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
868 ms ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
574 ms ± 53.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
373 ms ± 45.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
627 ms ± 65.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
597 ms ± 62.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.07 s ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
720 ms ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.9 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) (Polars)

If we include converting back to Pandas, then the time goes up to 73ms:

%%timeit
id_ = df.select((pl.col('col1').eq(3)).arg_true().max()).item()
(df.slice(0,id_-1).vstack(df.slice(id_))).to_pandas(use_pyarrow_extension_array=True)

73 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answered By: Luca

Efficiently filtering out the last row of a duplicate column

Question:

Answers:

comparison of timings