Using regex matched groups in pandas dataframe replace function
Question:
I’m just learning python/pandas and like how powerful and concise it is.
During data cleaning I want to use replace on a column in a dataframe with regex but I want to reinsert parts of the match (groups).
Simple Example:
lastname, firstname -> firstname lastname
I tried something like the following (actual case is more complex so excuse the simple regex):
df['Col1'].replace({'([A-Za-z])+, ([A-Za-z]+)' : '2 1'}, inplace=True, regex=True)
However, this results in empty values. The match part works as expected, but the value part doesn’t.
I guess this could be achieved by some splitting and merging, but I am looking for a general answer as to whether the regex group can be used in replace.
Answers:
setup
df = pd.DataFrame(dict(name=['Smith, Sean']))
print(df)
name
0 Smith, Sean
using replace
df.name.str.replace(r'(w+),s*(w+)', r'2 1')
0 Sean Smith
Name: name, dtype: object
using extract
split to two columns
df.name.str.extract('(?P<Last>w+),s*(?P<First>w+)', expand=True)
Last First
0 Smith Sean
I think you have a few issues with the RegEx’s.
As @Abdou just said use either '\2 \1'
or better r'2 1'
, as '1'
is a symbol with ASCII code 1
Your solution should work if you will use correct RegEx’s:
In [193]: df
Out[193]:
name
0 John, Doe
1 Max, Mustermann
In [194]: df.name.replace({r'(w+),s+(w+)' : r'2 1'}, regex=True)
Out[194]:
0 Doe John
1 Mustermann Max
Name: name, dtype: object
In [195]: df.name.replace({r'(w+),s+(w+)' : r'2 1', 'Max':'Fritz'}, regex=True)
Out[195]:
0 Doe John
1 Mustermann Fritz
Name: name, dtype: object
I’m just learning python/pandas and like how powerful and concise it is.
During data cleaning I want to use replace on a column in a dataframe with regex but I want to reinsert parts of the match (groups).
Simple Example:
lastname, firstname -> firstname lastname
I tried something like the following (actual case is more complex so excuse the simple regex):
df['Col1'].replace({'([A-Za-z])+, ([A-Za-z]+)' : '2 1'}, inplace=True, regex=True)
However, this results in empty values. The match part works as expected, but the value part doesn’t.
I guess this could be achieved by some splitting and merging, but I am looking for a general answer as to whether the regex group can be used in replace.
setup
df = pd.DataFrame(dict(name=['Smith, Sean']))
print(df)
name
0 Smith, Sean
using replace
df.name.str.replace(r'(w+),s*(w+)', r'2 1')
0 Sean Smith
Name: name, dtype: object
using extract
split to two columns
df.name.str.extract('(?P<Last>w+),s*(?P<First>w+)', expand=True)
Last First
0 Smith Sean
I think you have a few issues with the RegEx’s.
As @Abdou just said use either '\2 \1'
or better r'2 1'
, as '1'
is a symbol with ASCII code 1
Your solution should work if you will use correct RegEx’s:
In [193]: df
Out[193]:
name
0 John, Doe
1 Max, Mustermann
In [194]: df.name.replace({r'(w+),s+(w+)' : r'2 1'}, regex=True)
Out[194]:
0 Doe John
1 Mustermann Max
Name: name, dtype: object
In [195]: df.name.replace({r'(w+),s+(w+)' : r'2 1', 'Max':'Fritz'}, regex=True)
Out[195]:
0 Doe John
1 Mustermann Fritz
Name: name, dtype: object