Regex, that matches variable grams sizes
Question:
I have the following sample series
s = pd.Series({0: 'Açúcar Refinado UNIÃO Pacote 1kg',
1: 'Açúcar Refinado QUALITÁ Pacote 1Kg',
2: 'Açúcar Refinado DA BARRA Pacote 1kg',
3: 'Açúcar Refinado CARAVELAS Pacote 1kg',
4: 'Açúcar Refinado GUARANI Pacote 1Kg',
5: 'Açúcar Refinado Granulado Doçúcar UNIÃO Pacote 1kg',
6: 'Açúcar Refinado Light UNIÃO Fit Pacote 500g',
7: 'Açúcar Refinado Granulado Premium UNIÃO Pacote 1kg',
8: 'Açúcar Refinado UNIÃO 1kg - Pacote com 10 Unidades',
9: 'Açúcar Refinado Granulado em Cubos UNIÃO Pote 250g',
10: 'Açúcar Refinado Granulado Premium Caravelas Pacote 1kg',
11: 'Acucar Refinado Uniao 1kg'})
What I want to do is to capture the string part that represents the weights of the given products. In specific, the "1kg" string or the "500g" string.
I need to capture one or another, so I can easily interact through the pandas.Series object.
What I tried
s.str.extract(r"(.kg)|(.g)",flags = re.IGNORECASE)
Since the number of number before the string can vary I would like a different approach.
Answers:
Use the following regex matching (assuming that the numeric part can be also a float number):
s.str.extract(r"(d+.?d*?k?g)",flags=re.IGNORECASE)
0
0 1kg
1 1Kg
2 1kg
3 1kg
4 1Kg
5 1kg
6 500g
7 1kg
8 1kg
9 250g
10 1kg
11 1kg
With this extended data:
>>> s = pd.Series({
... 0: 'Açúcar Refinado UNIÃO Pacote 1kg',
... 1: 'Açúcar Refinado QUALITÁ Pacote 1Kg',
... 2: 'Açúcar Refinado DA BARRA Pacote 1kg',
... 3: 'Açúcar Refinado CARAVELAS Pacote 1kg',
... 4: 'Açúcar Refinado GUARANI Pacote 1Kg',
... 5: 'Açúcar Refinado Granulado Doçúcar UNIÃO Pacote 1kg',
... 6: 'Açúcar Refinado Light UNIÃO Fit Pacote 500g',
... 7: 'Açúcar Refinado Granulado Premium UNIÃO Pacote 1kg',
... 8: 'Açúcar Refinado UNIÃO 1kg - Pacote com 10 Unidades',
... 9: 'Açúcar Refinado Granulado em Cubos UNIÃO Pote 250g',
... 10: 'Açúcar Refinado Granulado Premium Caravelas Pacote 1kg',
... 11: 'Acucar Refinado Uniao 1kg',
... 12: 'something something 1.25kg',
... 13: 'something something 1,25kg'})
Parsing out the numbers and the units:
>>> s.str.extract(r'(d+(?:[.,]d*)?)( ?k?g)', flags=re.IGNORECASE)
... .assign(k=lambda d: d[0].str
... .replace('(?<=d),(?=d)', '.', regex=True)
... .pipe(pd.to_numeric))
0 1 k
0 1 kg 1.00
1 1 Kg 1.00
2 1 kg 1.00
3 1 kg 1.00
4 1 Kg 1.00
5 1 kg 1.00
6 500 g 500.00
7 1 kg 1.00
8 1 kg 1.00
9 250 g 250.00
10 1 kg 1.00
11 1 kg 1.00
12 1.25 kg 1.25
13 1,25 kg 1.25
I also allow for an optional space between the decimal and the units. Extended also to deal with non-integer numbers, accounting also for different decimal markers: eg in continental Europe, decimals are marked like 1,25
rather than 1.25
as in the Anglosphere.
I use a non-capturing group for the decimal portion; Roman’s version also works. For parsing the number, I would normalise the decimal format if mixed. If otherwise, you can re-parse by import io; pd.read_csv(io.StringIO(your_df.to_csv()), decimal=',')
.
You will get more capture groups on the row if you have a string like 250g ... 1kg
. You may want to filter or otherwise clean that before throwing it into this function. Also consider appending a b
to ensure that you don’t match something like 50grandmas
.
Thanks also for providing the data frame constructor ab initio.
I have the following sample series
s = pd.Series({0: 'Açúcar Refinado UNIÃO Pacote 1kg',
1: 'Açúcar Refinado QUALITÁ Pacote 1Kg',
2: 'Açúcar Refinado DA BARRA Pacote 1kg',
3: 'Açúcar Refinado CARAVELAS Pacote 1kg',
4: 'Açúcar Refinado GUARANI Pacote 1Kg',
5: 'Açúcar Refinado Granulado Doçúcar UNIÃO Pacote 1kg',
6: 'Açúcar Refinado Light UNIÃO Fit Pacote 500g',
7: 'Açúcar Refinado Granulado Premium UNIÃO Pacote 1kg',
8: 'Açúcar Refinado UNIÃO 1kg - Pacote com 10 Unidades',
9: 'Açúcar Refinado Granulado em Cubos UNIÃO Pote 250g',
10: 'Açúcar Refinado Granulado Premium Caravelas Pacote 1kg',
11: 'Acucar Refinado Uniao 1kg'})
What I want to do is to capture the string part that represents the weights of the given products. In specific, the "1kg" string or the "500g" string.
I need to capture one or another, so I can easily interact through the pandas.Series object.
What I tried
s.str.extract(r"(.kg)|(.g)",flags = re.IGNORECASE)
Since the number of number before the string can vary I would like a different approach.
Use the following regex matching (assuming that the numeric part can be also a float number):
s.str.extract(r"(d+.?d*?k?g)",flags=re.IGNORECASE)
0
0 1kg
1 1Kg
2 1kg
3 1kg
4 1Kg
5 1kg
6 500g
7 1kg
8 1kg
9 250g
10 1kg
11 1kg
With this extended data:
>>> s = pd.Series({
... 0: 'Açúcar Refinado UNIÃO Pacote 1kg',
... 1: 'Açúcar Refinado QUALITÁ Pacote 1Kg',
... 2: 'Açúcar Refinado DA BARRA Pacote 1kg',
... 3: 'Açúcar Refinado CARAVELAS Pacote 1kg',
... 4: 'Açúcar Refinado GUARANI Pacote 1Kg',
... 5: 'Açúcar Refinado Granulado Doçúcar UNIÃO Pacote 1kg',
... 6: 'Açúcar Refinado Light UNIÃO Fit Pacote 500g',
... 7: 'Açúcar Refinado Granulado Premium UNIÃO Pacote 1kg',
... 8: 'Açúcar Refinado UNIÃO 1kg - Pacote com 10 Unidades',
... 9: 'Açúcar Refinado Granulado em Cubos UNIÃO Pote 250g',
... 10: 'Açúcar Refinado Granulado Premium Caravelas Pacote 1kg',
... 11: 'Acucar Refinado Uniao 1kg',
... 12: 'something something 1.25kg',
... 13: 'something something 1,25kg'})
Parsing out the numbers and the units:
>>> s.str.extract(r'(d+(?:[.,]d*)?)( ?k?g)', flags=re.IGNORECASE)
... .assign(k=lambda d: d[0].str
... .replace('(?<=d),(?=d)', '.', regex=True)
... .pipe(pd.to_numeric))
0 1 k
0 1 kg 1.00
1 1 Kg 1.00
2 1 kg 1.00
3 1 kg 1.00
4 1 Kg 1.00
5 1 kg 1.00
6 500 g 500.00
7 1 kg 1.00
8 1 kg 1.00
9 250 g 250.00
10 1 kg 1.00
11 1 kg 1.00
12 1.25 kg 1.25
13 1,25 kg 1.25
I also allow for an optional space between the decimal and the units. Extended also to deal with non-integer numbers, accounting also for different decimal markers: eg in continental Europe, decimals are marked like 1,25
rather than 1.25
as in the Anglosphere.
I use a non-capturing group for the decimal portion; Roman’s version also works. For parsing the number, I would normalise the decimal format if mixed. If otherwise, you can re-parse by import io; pd.read_csv(io.StringIO(your_df.to_csv()), decimal=',')
.
You will get more capture groups on the row if you have a string like 250g ... 1kg
. You may want to filter or otherwise clean that before throwing it into this function. Also consider appending a b
to ensure that you don’t match something like 50grandmas
.
Thanks also for providing the data frame constructor ab initio.