How to drop row in pandas if column1 = certain value and column 2 = NaN?

Question:

I’m trying to do the following: "# drop all rows where tag == train_loop and start is NaN".

Here’s my current attempt (thanks Copilot):

# drop all rows where tag == train_loop and start is NaN
# apply filter function to each row
# return True if row should be dropped
def filter_fn(row):
    return row["tag"] == "train_loop" and pd.isna(row["start"]):

old_len = len(df)
df = df[~df.apply(filter_fn, axis=1)]

It works well, but I’m wondering if there is a less verbose way.

Asked By: Foobar

||

Answers:

using apply is a really bad way to do this actually, since it loops over every row, calling the function you defined in python. Instead, use vectorized functions which you can call on the entire dataframe, which call optimized/vectorized versions written in C under the hood.

df = df[~((df["tag"] == "train_loop") & df["start"].isnull())]

If your data is large (>~100k rows), then even faster would be to use pandas query methods, where you can write both conditions in one:

df = df.query(
    '~((tag == "train_loop") and (start != start))'
)

This makes use of the fact that NaNs never equal anything, including themselves, so we can use simple logical operators to find NaNS (.isnull() isn’t available in the compiled query mini-language). For the query method to be faster, you need to have numexpr installed, which will compile your queries on the fly before they’re called on the data.

See the docs on enhancing performance for more info and examples.

Answered By: Michael Delgado

You can do

df = df.loc[~(df['tag'].eq('train_loop') & df['start'].isna())]
Answered By: BENY
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.