Regular expression in Pandas: Get substring between a space and a colon

Question:

I have a Pandas dataframe with the column store. It contains a list of stores that look like this:

H-E-B 721:1101 W STAN SCHLUETER LOOP,KILLEEN,TX
H-E-B PLUS 39:2509 N MAIN ST,BELTON,TX

I want the store number, which are 721 and 39 in the above examples.

Here is my process for getting it:

  1. Find the position of the colon.
  2. Slice backwards until reaching a space.

How do I do this in Python/Pandas? I’m guessing that I need to use regex, but I have no idea how to start.

Asked By: Iterator516

||

Answers:

You can apply a function to the store column

def retrieve_store_number(text):
    return text.split(":")[0].split()[-1]

df["store_number"] = df.store.apply(retrieve_store_number)
Answered By: thomask

You can use str.extract with the (d+): regex:

df['number'] = df['store'].str.extract('(d+):', expand=False).astype(int)

Output:

                                             store  number
0  H-E-B 721:1101 W STAN SCHLUETER LOOP,KILLEEN,TX     721
1           H-E-B PLUS 39:2509 N MAIN ST,BELTON,TX      39

regex demo

Answered By: mozway

You can do something along those lines:

def extract_number(string: str) -> int:
    return [int(i.split(":")[0]) for i in string.split(" ") if ":" in i][0]
df = pd.DataFrame([["H-E-B 721:1101 W STAN SCHLUETER LOOP,KILLEEN,TX"],
                   ["H-E-B PLUS 39:2509 N MAIN ST,BELTON,TX"]], columns = ["store"])
df["number"] = df["store"].apply(extract_number)
Answered By: Gleb Ryabov

To use regex in Pandas, you can use the extract function:

df['store_number'] = df['store'].str.extract(r'(d+(?=:))')
Answered By: Freda Xin