Extract Numeric info from Pandas column using regex
Question:
I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column:
Text
Dimensions: 23"/60
Dimensions: 23" / 60
Dimensions: 48"
Dimensions: 22.5X8.25
Dimensions: 80IN
Dimensions: 567 S
Dimensions: 22.5X8.25
Dimensions: 26INNP
Dimensions: 24" x 55"
with pipe 16 x 7
I am using regex and is as follows: regex = r"(d([^s]*)s.sd*[^s])|(d([^s])*)"
I am using to the below script to create a new column with all the numeric information extracted from each sentence and combined with ‘_’
df2['Numeric_Info'] = df2['Text'].apply(lambda x: '_'.join([i[0] for i in re.findall(regex, str(x))]))
Can someone tell me what I am doing wrong?
Edit:
Specifying rules to the pattern to solidify the regex:
- Sentences containing words with numbers should be extracted (for sentence
short A20-6014 complete
, A20-6014
should be extracted)
- Sentences beginning with words with numbers should be extracted (for sentence
A20-6014 short complete
, A20-6014
should be extracted)
- Sentences ending with words with numbers should be extracted (for sentence
short complete A20-6014
, A20-6014
should be extracted)
- Two letter words in between words containing numbers should be considered (ie. for sentence
ELBOW- 2.5 IN 90 SIL
, 2.5 IN 90
should be extracted)
- Greater than 2 letter words in between words containing numbers should extracted separately and joined with a
_
. This also applies for alphanumeric words occurring at two different places in a sentence (ie. for sentence ELBOW- 2.5 IND 90 SIL
, 2.5_90
should be extracted. )
- Another Test Case: for a sentence
90 CONN 3/4 ST-#8 FL
, 90_3/4_ST-#8
should be extracted
Answers:
You can use
regex = r'd+(?:.d+)?"?(?:s*[Xx/.]s*d+(?:.d+)?"?)?(?:s*[A-Z]+b)?'
df2['Numeric_Info'] = df2['Text'].str.findall(regex).str.join('_')
See the regex demo.
Details:
d+(?:.d+)?
– an int or float value
"?
– an optional "
char
(?:s*[Xx/.]s*d+(?:.d+)?"?)?
– an optional sequence of
s*[Xx/.]s*
– X
, x
, /
or .
enclosed with zero or more whitespaces
d+(?:.d+)?
– an int or float value
"?
– an optional "
char
(?:s*[A-Z]+b)?
– an optional sequence of
s*
– zero or more whitespaces
[A-Z]+
– one or more ASCII uppercase letters
b
– a word boundary.
Here is a Pandas test:
import pandas as pd
df = pd.DataFrame({'Text':['No data here','Dimensions: 23"/60', 'Dimensions: 23" / 60', 'Dimensions: 48"', 'Dimensions: 22.5X8.25', 'Dimensions: 80IN', 'Dimensions: 567 S','Dimensions: 22.5X8.25', 'Dimensions: 26INNP','Dimensions: 24" x 55" with pipe 16 x 7']})
regex = r'd+(?:.d+)?"?(?:s*[Xx/.]s*d+(?:.d+)?"?)?(?:s*[A-Z]+b)?'
df['Numeric_Info'] = df['Text'].str.findall(regex).str.join('_')
Output:
>>> df
Text Numeric_Info
0 No data here
1 Dimensions: 23"/60 23"/60
2 Dimensions: 23" / 60 23" / 60
3 Dimensions: 48" 48"
4 Dimensions: 22.5X8.25 22.5X8.25
5 Dimensions: 80IN 80IN
6 Dimensions: 567 S 567 S
7 Dimensions: 22.5X8.25 22.5X8.25
8 Dimensions: 26INNP 26INNP
9 Dimensions: 24" x 55" with pipe 16 x 7 24" x 55"_16 x 7
I am trying to extract the highlighted "numeric information" from a Pandas DataFrame column:
Text |
---|
Dimensions: 23"/60 |
Dimensions: 23" / 60 |
Dimensions: 48" |
Dimensions: 22.5X8.25 |
Dimensions: 80IN |
Dimensions: 567 S |
Dimensions: 22.5X8.25 |
Dimensions: 26INNP |
Dimensions: 24" x 55" with pipe 16 x 7 |
I am using regex and is as follows: regex = r"(d([^s]*)s.sd*[^s])|(d([^s])*)"
I am using to the below script to create a new column with all the numeric information extracted from each sentence and combined with ‘_’
df2['Numeric_Info'] = df2['Text'].apply(lambda x: '_'.join([i[0] for i in re.findall(regex, str(x))]))
Can someone tell me what I am doing wrong?
Edit:
Specifying rules to the pattern to solidify the regex:
- Sentences containing words with numbers should be extracted (for sentence
short A20-6014 complete
,A20-6014
should be extracted) - Sentences beginning with words with numbers should be extracted (for sentence
A20-6014 short complete
,A20-6014
should be extracted) - Sentences ending with words with numbers should be extracted (for sentence
short complete A20-6014
,A20-6014
should be extracted) - Two letter words in between words containing numbers should be considered (ie. for sentence
ELBOW- 2.5 IN 90 SIL
,2.5 IN 90
should be extracted) - Greater than 2 letter words in between words containing numbers should extracted separately and joined with a
_
. This also applies for alphanumeric words occurring at two different places in a sentence (ie. for sentenceELBOW- 2.5 IND 90 SIL
,2.5_90
should be extracted. ) - Another Test Case: for a sentence
90 CONN 3/4 ST-#8 FL
,90_3/4_ST-#8
should be extracted
You can use
regex = r'd+(?:.d+)?"?(?:s*[Xx/.]s*d+(?:.d+)?"?)?(?:s*[A-Z]+b)?'
df2['Numeric_Info'] = df2['Text'].str.findall(regex).str.join('_')
See the regex demo.
Details:
d+(?:.d+)?
– an int or float value"?
– an optional"
char(?:s*[Xx/.]s*d+(?:.d+)?"?)?
– an optional sequence ofs*[Xx/.]s*
–X
,x
,/
or.
enclosed with zero or more whitespacesd+(?:.d+)?
– an int or float value"?
– an optional"
char
(?:s*[A-Z]+b)?
– an optional sequence ofs*
– zero or more whitespaces[A-Z]+
– one or more ASCII uppercase lettersb
– a word boundary.
Here is a Pandas test:
import pandas as pd
df = pd.DataFrame({'Text':['No data here','Dimensions: 23"/60', 'Dimensions: 23" / 60', 'Dimensions: 48"', 'Dimensions: 22.5X8.25', 'Dimensions: 80IN', 'Dimensions: 567 S','Dimensions: 22.5X8.25', 'Dimensions: 26INNP','Dimensions: 24" x 55" with pipe 16 x 7']})
regex = r'd+(?:.d+)?"?(?:s*[Xx/.]s*d+(?:.d+)?"?)?(?:s*[A-Z]+b)?'
df['Numeric_Info'] = df['Text'].str.findall(regex).str.join('_')
Output:
>>> df
Text Numeric_Info
0 No data here
1 Dimensions: 23"/60 23"/60
2 Dimensions: 23" / 60 23" / 60
3 Dimensions: 48" 48"
4 Dimensions: 22.5X8.25 22.5X8.25
5 Dimensions: 80IN 80IN
6 Dimensions: 567 S 567 S
7 Dimensions: 22.5X8.25 22.5X8.25
8 Dimensions: 26INNP 26INNP
9 Dimensions: 24" x 55" with pipe 16 x 7 24" x 55"_16 x 7