Pandas replace regex: why this negation does not work

Question:

I have the following dataframe:

>>> df = pd.DataFrame(['0123_GRP_LE_BNS', 'ABC_GRP_BNS', 'DEF_GRP', '456A_GRP_SSA'], columns=['P'])
>>> df
                 P
0  0123_GRP_LE_BNS
1      ABC_GRP_BNS
2          DEF_GRP
3     456A_GRP_SSA

and want to remove characters appear after GRP if they are not ‘_LE’, or remove characters after GRP_LE.

The desired output is:

0     0123_GRP_LE
1         ABC_GRP
2         DEF_GRP
3        456A_GRP

I used the following pattern matching. the ouput was not expected:

>>> df['P'].replace({r'(.*_GRP)[^_LE].*':r'1', r'(.*GRP_LE)_.*':r'1'}, regex=True)
0     0123_GRP_LE
1     ABC_GRP_BNS
2         DEF_GRP
3    456A_GRP_SSA
Name: P, dtype: object

Why the negation in r'(.*_GRP)[^_LE].*’ does not work?

Asked By: techie11

||

Answers:

Why not make _LE optional?

df['P'].str.replace(r'(GRP(?:_LE)?).*', r'1', regex=True)

Output:

0    0123_GRP_LE
1        ABC_GRP
2        DEF_GRP
3       456A_GRP
Name: P, dtype: object
Answered By: mozway

I find pythons string ops easier to work with and less error prone than regex; I think this does what you’re looking for:

def strip_code(code_str):
    if "GRP_LE" in code_str:
        return "".join(code_str.partition("GRP_LE")[0:2])
    elif "GRP" in code_str:
        return "".join(code_str.partition("GRP")[0:2])
    return code_str


df.P.apply(strip_code)

output:

0    0123_GRP_LE
1        ABC_GRP
2        DEF_GRP
3       456A_GRP
Name: P, dtype: object
Answered By: anon01
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.