Creating New Column In Pandas Dataframe Using Regex
Question:
I have a column in a pandas df of type object
that I want to parse to get the first number in the string, and create a new column containing that number as an int
.
For example:
Existing df
col
'foo 12 bar 8'
'bar 3 foo'
'bar 32bar 98'
Desired df
col col1
'foo 12 bar 8' 12
'bar 3 foo' 3
'bar 32bar 98' 32
I have code that works on any individual cell in the column series
int(re.search(r'd+', df.iloc[0]['col']).group())
The above code works fine and returns 12 as it should. But when I try to create a new column using the whole series:
df['col1'] = int(re.search(r'd+', df['col']).group())
I get the following Error:
TypeError: expected string or bytes-like object
I tried wrapping a str()
around df['col']
which got rid of the error but yielded all 0’s in col1
I’ve also tried converting col
to a list
of strings and iterating through the list
, which only yields the same error. Does anyone know what I’m doing wrong? Help would be much appreciated.
Answers:
This will do the trick:
new_column = []
for values in df['col']:
new_column.append(re.search(r'd+', values).group())
df['col1'] = new_column
the output looks like this:
col col1
0 foo 12 bar 8 12
1 bar 3 foo 3
2 bar 32bar 98 32
I have a column in a pandas df of type object
that I want to parse to get the first number in the string, and create a new column containing that number as an int
.
For example:
Existing df
col
'foo 12 bar 8'
'bar 3 foo'
'bar 32bar 98'
Desired df
col col1
'foo 12 bar 8' 12
'bar 3 foo' 3
'bar 32bar 98' 32
I have code that works on any individual cell in the column series
int(re.search(r'd+', df.iloc[0]['col']).group())
The above code works fine and returns 12 as it should. But when I try to create a new column using the whole series:
df['col1'] = int(re.search(r'd+', df['col']).group())
I get the following Error:
TypeError: expected string or bytes-like object
I tried wrapping a str()
around df['col']
which got rid of the error but yielded all 0’s in col1
I’ve also tried converting col
to a list
of strings and iterating through the list
, which only yields the same error. Does anyone know what I’m doing wrong? Help would be much appreciated.
This will do the trick:
new_column = []
for values in df['col']:
new_column.append(re.search(r'd+', values).group())
df['col1'] = new_column
the output looks like this:
col col1
0 foo 12 bar 8 12
1 bar 3 foo 3
2 bar 32bar 98 32