Regex, that matches variable grams sizes

Question:

I have the following sample series

 s = pd.Series({0: 'Açúcar Refinado UNIÃO Pacote 1kg',
 1: 'Açúcar Refinado QUALITÁ Pacote 1Kg',
 2: 'Açúcar Refinado DA BARRA Pacote 1kg',
 3: 'Açúcar Refinado CARAVELAS Pacote 1kg',
 4: 'Açúcar Refinado GUARANI Pacote 1Kg',
 5: 'Açúcar Refinado Granulado Doçúcar UNIÃO Pacote 1kg',
 6: 'Açúcar Refinado Light UNIÃO Fit Pacote 500g',
 7: 'Açúcar Refinado Granulado Premium UNIÃO Pacote 1kg',
 8: 'Açúcar Refinado UNIÃO 1kg - Pacote com 10 Unidades',
 9: 'Açúcar Refinado Granulado em Cubos UNIÃO Pote 250g',
 10: 'Açúcar Refinado Granulado Premium Caravelas Pacote 1kg',
 11: 'Acucar Refinado Uniao 1kg'})

What I want to do is to capture the string part that represents the weights of the given products. In specific, the "1kg" string or the "500g" string.
I need to capture one or another, so I can easily interact through the pandas.Series object.

What I tried

s.str.extract(r"(.kg)|(.g)",flags = re.IGNORECASE)

Since the number of number before the string can vary I would like a different approach.

Asked By: INGl0R1AM0R1

||

Answers:

Use the following regex matching (assuming that the numeric part can be also a float number):

s.str.extract(r"(d+.?d*?k?g)",flags=re.IGNORECASE)

       0
0    1kg
1    1Kg
2    1kg
3    1kg
4    1Kg
5    1kg
6   500g
7    1kg
8    1kg
9   250g
10   1kg
11   1kg
Answered By: RomanPerekhrest

With this extended data:

>>> s = pd.Series({
...     0: 'Açúcar Refinado UNIÃO Pacote 1kg',
...     1: 'Açúcar Refinado QUALITÁ Pacote 1Kg',
...     2: 'Açúcar Refinado DA BARRA Pacote 1kg',
...     3: 'Açúcar Refinado CARAVELAS Pacote 1kg',
...     4: 'Açúcar Refinado GUARANI Pacote 1Kg',
...     5: 'Açúcar Refinado Granulado Doçúcar UNIÃO Pacote 1kg',
...     6: 'Açúcar Refinado Light UNIÃO Fit Pacote 500g',
...     7: 'Açúcar Refinado Granulado Premium UNIÃO Pacote 1kg',
...     8: 'Açúcar Refinado UNIÃO 1kg - Pacote com 10 Unidades',
...     9: 'Açúcar Refinado Granulado em Cubos UNIÃO Pote 250g',
...     10: 'Açúcar Refinado Granulado Premium Caravelas Pacote 1kg',
...     11: 'Acucar Refinado Uniao 1kg',
...     12: 'something something 1.25kg',
...     13: 'something something 1,25kg'})

Parsing out the numbers and the units:

>>> s.str.extract(r'(d+(?:[.,]d*)?)( ?k?g)', flags=re.IGNORECASE) 
...     .assign(k=lambda d: d[0].str
...             .replace('(?<=d),(?=d)', '.', regex=True)
...             .pipe(pd.to_numeric))
       0   1       k
0      1  kg    1.00
1      1  Kg    1.00
2      1  kg    1.00
3      1  kg    1.00
4      1  Kg    1.00
5      1  kg    1.00
6    500   g  500.00
7      1  kg    1.00
8      1  kg    1.00
9    250   g  250.00
10     1  kg    1.00
11     1  kg    1.00
12  1.25  kg    1.25
13  1,25  kg    1.25

I also allow for an optional space between the decimal and the units. Extended also to deal with non-integer numbers, accounting also for different decimal markers: eg in continental Europe, decimals are marked like 1,25 rather than 1.25 as in the Anglosphere.

I use a non-capturing group for the decimal portion; Roman’s version also works. For parsing the number, I would normalise the decimal format if mixed. If otherwise, you can re-parse by import io; pd.read_csv(io.StringIO(your_df.to_csv()), decimal=',').

You will get more capture groups on the row if you have a string like 250g ... 1kg. You may want to filter or otherwise clean that before throwing it into this function. Also consider appending a b to ensure that you don’t match something like 50grandmas.

Thanks also for providing the data frame constructor ab initio.

Answered By: ifly6
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.