How to invert a regular expression in pandas filter function

Question:

I have the following pandas dataframe df (which is actually just the last lines of a much larger one):

                           count
gene                            
WBGene00236788                56
WBGene00236807                 3
WBGene00249816                12
WBGene00249825                20
WBGene00255543                 6
__no_feature            11697881
__ambiguous                 1353
__too_low_aQual                0
__not_aligned                  0
__alignment_not_unique         0

I can use filter‘s regex option to get only the lines starting with two underscores:

df.filter(regex="^__", axis=0)

This returns the following:

                           count
gene                            
__no_feature            11697881
__ambiguous                 1353
__too_low_aQual                0
__not_aligned                  0
__alignment_not_unique         0

Actually, what I want is to have the complement: Only those lines that do not start with two underscores.

I can do it with another regular expression: df.filter(regex="^[^_][^_]", axis=0).

Is there a way to more simply specify that I want the inverse of the initial regular expression?

Is such regexp-based filtering efficient?

Edit: Testing some proposed solutions

df.filter(regex="(?!^__)", axis=0) and df.filter(regex="^w+", axis=0) both return all lines.

According to the re module documentation, the w special character actually includes the underscore, which explains the behaviour of the second expression.

I guess that the first one doesn’t work because the (?!...) applies on what follows a pattern. Here, “^” should be put outside, as in the following proposed solution:

df.filter(regex="^(?!__).*?$", axis=0) works.

So does df.filter(regex="^(?!__)", axis=0).

Asked By: bli

||

Answers:

You have two possibilities here:

(?!^__) # a negative lookahead
        # making sure that there are no underscores right at the beginning of the line

Or:

^w+  # match word characters, aka a-z, A-Z, 0-9 at least once
Answered By: Jan

Matching all lines with no two leading underscores:

^(?!__)

^ matches the beginning of the line
(?!__)makes sure the line (what follows the preceding ^ match) does not begin with two underscores

Edit:
dropped the .*?$ since it’s not necessary to filter the lines.

Answered By: Robin Koch

I had the same problem but I wanted to filter the columns. Thus I am using axis=1 but concept should be similar.

df.drop(df.filter(regex='my_expression').columns,axis=1)
Answered By: harsshal

Invert select of multiple columns (A, B, C).

df.filter(regex=r'^(?!(A_|B_.*_|C_.*_.*_))',axis='columns')
Answered By: BSalita
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.