Pandas Extract Number from String
Question:
Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['1a',np.nan,'10a','100b','0b'],
})
df
A
0 1a
1 NaN
2 10a
3 100b
4 0b
I’d like to extract the numbers from each cell (where they exist).
The desired result is:
A
0 1
1 NaN
2 10
3 100
4 0
I know it can be done with str.extract
, but I’m not sure how.
Answers:
Give it a regex capture group:
df.A.str.extract('(d+)')
Gives you:
0 1
1 NaN
2 10
3 100
4 0
Name: A, dtype: object
To answer @Steven G ‘s question in the comment above, this should work:
df.A.str.extract('(^d*)')
U can replace your column with your result using "assign" function:
df = df.assign(A = lambda x: x['A'].str.extract('(d+)'))
If you have cases where you have multiple disjoint sets of digits, as in 1a2b3c
, in which you would like to extract 123
, you can do it with Series.str.replace
:
>>> df
A
0 1a
1 b2
2 a1b2
3 1a2b3c
>>> df['A'] = df['A'].str.replace('D+', '')
0 1
1 2
2 12
3 123
You could also work this around with Series.str.extractall
and groupby
but I think that this one is easier.
Hope this helps!
Given the following data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['1a',np.nan,'10a','100b','0b'],
})
df
A
0 1a
1 NaN
2 10a
3 100b
4 0b
I’d like to extract the numbers from each cell (where they exist).
The desired result is:
A
0 1
1 NaN
2 10
3 100
4 0
I know it can be done with str.extract
, but I’m not sure how.
Give it a regex capture group:
df.A.str.extract('(d+)')
Gives you:
0 1
1 NaN
2 10
3 100
4 0
Name: A, dtype: object
To answer @Steven G ‘s question in the comment above, this should work:
df.A.str.extract('(^d*)')
U can replace your column with your result using "assign" function:
df = df.assign(A = lambda x: x['A'].str.extract('(d+)'))
If you have cases where you have multiple disjoint sets of digits, as in 1a2b3c
, in which you would like to extract 123
, you can do it with Series.str.replace
:
>>> df
A
0 1a
1 b2
2 a1b2
3 1a2b3c
>>> df['A'] = df['A'].str.replace('D+', '')
0 1
1 2
2 12
3 123
You could also work this around with Series.str.extractall
and groupby
but I think that this one is easier.
Hope this helps!