regex extract digit + word

Question:

I want to extract the exact value+unit from the 8 rows below. But my code is not working properly. For example, my code extracts in row 4 "1000 2 ounce" instead of only "2 ounce". Same issue with row 5 and 6.

  1. includes chalkboard labels, marker & measuring cup capacity s | 135.2 ounce capacity
  2. 33% thicker – 12 glass spice – 3 cubic inch capacity
  3. nolopau 1.53 gallon capacity lon glass jar,
  4. [1000 2 ounce capacity compostable condiment souffluide bagasse
  5. bia cordon bleu inc 900712 8.5 ounce capacity porcelain
  6. bekith 15 9 liter capacity glass jars,
  7. echtpower bento box, 1.5 milliliter lunch box with handle
  8. aislor plastic straining 0.50 gallon capacity lon pitcher, dishwasher

Outcome for each row:
153.2 ounce
3 cubic inch
1.53 gallon
2 ounce
8.5 ounce
9 liter
1.5 milliliter
0.50 gallon

I tried:

def extract_c(df):

pattern =  (d+.d+s*ounce|d+.d+s*cubics*inch|d+.d+s*gallon|d+.d+s*liter|d+.d+s*milliliter)'

return df.str.extract(pattern)

Can anyone support me in extracting these 8 rows in 1 line?

Asked By: C C

||

Answers:

You need to escape . to literally match .. The following regex should work. Notice the decimal part also needs to be optional (?:.d+)?

(d+(?:.d+)?s*(?:ounce|cubics*inch|gallon|liter|milliliter))

s.str.extract('(d+(?:.d+)?s*(?:ounce|cubics*inch|gallon|liter|milliliter))')
                0
0     135.2 ounce
1    3 cubic inch
2     1.53 gallon
3         2 ounce
4       8.5 ounce
5         9 liter
6  1.5 milliliter
7     0.50 gallon
Answered By: Psidom
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.