Use regex to extract number before a list of words in pandas dataframe

Question:

I want to extract only the numbers before a list of specific words. Then put the extracted numbers in a new column.

The list of words is: l = ["car", "truck", "van"]. I only put singular form here, but it should also apply to plural.

df = pd.DataFrame(columns=["description"], data=[["have 3 cars"], ["a 1-car situation"], ["may be 2 trucks"]])

We can call the new column for extracted number df["extracted_num"]

Thank you!

Asked By: DanZimmerman

||

Answers:

You can use Series.str.extract

l = ["car", "truck", "van"]

pat = f"(d+)[s-](?:{'|'.join(l)})"
df['extracted_num'] = df['description'].str.extract(pat)

Output:

>>> print(pat)
(d+)[s-](?:car|truck|van)

>>> df

         description extracted_num
0        have 3 cars             3
1  a 1-car situation             1
2    may be 2 trucks             2

Explanation:

  • (d+) – Matches one or more digits and captures the group;
  • [s-] – Matches a single space or hyphen;
  • (?:{'|'.join(l)})"– Matches any word from the list l without capturing it.
Answered By: Rodalm
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.