Remove everything after second caret regex and apply to pandas dataframe column
Question:
I have a dataframe with a column that looks like this:
0 EIAB^EIAB^6
1 8W^W844^A
2 8W^W844^A
3 8W^W858^A
4 8W^W844^A
...
826136 EIAB^EIAB^6
826137 SICU^6124^A
826138 SICU^6124^A
826139 SICU^6128^A
826140 SICU^6128^A
I just want to keep everything before the second caret, e.g.: 8W^W844
, what regex would I use in Python? Similarly PACU^SPAC^06
would be PACU^SPAC
. And to apply it to the whole column.
I tried r'[\^].+$'
since I thought it would take the last caret and everything after, but it didn’t work.
Answers:
I don’t think regex is really necessary here, just slice the string up to the position of the second caret:
>>> s = 'PACU^SPAC^06'
>>> s[:s.find("^", s.find("^") + 1)]
'PACU^SPAC'
Explanation: str.find
accepts a second argument of where to start the search, place it just after the position of the first caret.
You can negate the character group to find everything except ^
and put it in a match group. you don’t need to escape the ^
in the character group but you do need to escape the one outside.
re.match(r"([^^]+^[^^]+)", "8W^W844^A").group(1)
This is quite useful in a pandas dataframe. Assuming you want to do this on a single column you can extract the string you want with
df['col'].str.extract(r'^([^^]+^[^^]+)', expand=False)
NOTE
Originally, I used replace
, but the extract
solution suggested in the comments executed in 1/4 the time.
import pandas as pd
import numpy as np
from timeit import timeit
df = pd.DataFrame({"foo":np.arange(1_000_000)})
df["bar"] = "8W^W844^A"
df2 = df.copy()
def t1():
df.bar.str.replace(r"([^^]+^[^^]+).*", r"1", regex=True)
def t2():
df.bar.str.extract(r'^([^^]+^[^^]+)', expand=False)
print("replace", timeit("t1()", globals=globals(), number=20))
print("extract", timeit("t2()", globals=globals(), number=20))
output
replace 39.73989862400049
extract 9.910304663004354
I have a dataframe with a column that looks like this:
0 EIAB^EIAB^6
1 8W^W844^A
2 8W^W844^A
3 8W^W858^A
4 8W^W844^A
...
826136 EIAB^EIAB^6
826137 SICU^6124^A
826138 SICU^6124^A
826139 SICU^6128^A
826140 SICU^6128^A
I just want to keep everything before the second caret, e.g.: 8W^W844
, what regex would I use in Python? Similarly PACU^SPAC^06
would be PACU^SPAC
. And to apply it to the whole column.
I tried r'[\^].+$'
since I thought it would take the last caret and everything after, but it didn’t work.
I don’t think regex is really necessary here, just slice the string up to the position of the second caret:
>>> s = 'PACU^SPAC^06'
>>> s[:s.find("^", s.find("^") + 1)]
'PACU^SPAC'
Explanation: str.find
accepts a second argument of where to start the search, place it just after the position of the first caret.
You can negate the character group to find everything except ^
and put it in a match group. you don’t need to escape the ^
in the character group but you do need to escape the one outside.
re.match(r"([^^]+^[^^]+)", "8W^W844^A").group(1)
This is quite useful in a pandas dataframe. Assuming you want to do this on a single column you can extract the string you want with
df['col'].str.extract(r'^([^^]+^[^^]+)', expand=False)
NOTE
Originally, I used replace
, but the extract
solution suggested in the comments executed in 1/4 the time.
import pandas as pd
import numpy as np
from timeit import timeit
df = pd.DataFrame({"foo":np.arange(1_000_000)})
df["bar"] = "8W^W844^A"
df2 = df.copy()
def t1():
df.bar.str.replace(r"([^^]+^[^^]+).*", r"1", regex=True)
def t2():
df.bar.str.extract(r'^([^^]+^[^^]+)', expand=False)
print("replace", timeit("t1()", globals=globals(), number=20))
print("extract", timeit("t2()", globals=globals(), number=20))
output
replace 39.73989862400049
extract 9.910304663004354