applying regex to a pandas dataframe
Question:
I’m having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:
Name Season School G MP FGA 3P 3PA 3P%
74 Joe Dumars 1982-83 McNeese State 29 NaN 487 5 8 0.625
84 Sam Vincent 1982-83 Michigan State 30 1066 401 5 11 0.455
176 Gerald Wilkins 1982-83 Chattanooga 30 820 350 0 2 0.000
177 Gerald Wilkins 1983-84 Chattanooga 23 737 297 3 10 0.300
243 Delaney Rudd 1982-83 Wake Forest 32 1004 324 13 29 0.448
I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.
Here is what I put together:
import re
def split_it(year):
return re.findall('(dddd)', year)
df['Season2'] = df['Season'].apply(split_it(x))
TypeError: expected string or buffer
Output would be a column called Season2 that contains the year before the hyphen. I’m sure theres an easier way to do it without regex, but more importantly, i’m trying to figure out what I did wrong
Thanks for any help in advance.
Answers:
When I try (a variant of) your code I get NameError: name 'x' is not defined
— which it isn’t.
You could use either
df['Season2'] = df['Season'].apply(split_it)
or
df['Season2'] = df['Season'].apply(lambda x: split_it(x))
but the second one is just a longer and slower way to write the first one, so there’s not much point (unless you have other arguments to handle, which we don’t here.) Your function will return a list, though:
>>> df["Season"].apply(split_it)
74 [1982]
84 [1982]
176 [1982]
177 [1983]
243 [1982]
Name: Season, dtype: object
although you could easily change that. FWIW, I’d use vectorized string operations and do something like
>>> df["Season"].str[:4].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
or
>>> df["Season"].str.split("-").str[0].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
The asked problem can be solved by writing the following code :
import re
def split_it(year):
x = re.findall('([d]{4})', year)
if x :
return(x.group())
df['Season2'] = df['Season'].apply(split_it)
You were facing this problem as some rows didn’t had year in the string
I had the exact same issue. Thanks for the answers @DSM.
FYI @itjcms, you can improve the function by removing the repetition of the 'dddd'
.
def split_it(year):
return re.findall('(dddd)', year)
Becomes:
def split_it(year):
return re.findall('(d{4})', year)
You can simply use str.extract
df['Season2']=df['Season'].str.extract(r'(d{4})-d{2}')
Here you locate d{4}-d{2}
(for example 1982-83) but only extracts the captured group between parenthesis d{4}
(for example 1982)
you can use pandas native function to do it too.
check this page for the pandas functions that accepts regular expression. for your case, you can do
df["Season"].str.extract(r'([d]{4}))')
I’d extract with:
df['Season2']=df['Season'].str.extract(r'(d{4}))
I’m having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:
Name Season School G MP FGA 3P 3PA 3P%
74 Joe Dumars 1982-83 McNeese State 29 NaN 487 5 8 0.625
84 Sam Vincent 1982-83 Michigan State 30 1066 401 5 11 0.455
176 Gerald Wilkins 1982-83 Chattanooga 30 820 350 0 2 0.000
177 Gerald Wilkins 1983-84 Chattanooga 23 737 297 3 10 0.300
243 Delaney Rudd 1982-83 Wake Forest 32 1004 324 13 29 0.448
I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.
Here is what I put together:
import re
def split_it(year):
return re.findall('(dddd)', year)
df['Season2'] = df['Season'].apply(split_it(x))
TypeError: expected string or buffer
Output would be a column called Season2 that contains the year before the hyphen. I’m sure theres an easier way to do it without regex, but more importantly, i’m trying to figure out what I did wrong
Thanks for any help in advance.
When I try (a variant of) your code I get NameError: name 'x' is not defined
— which it isn’t.
You could use either
df['Season2'] = df['Season'].apply(split_it)
or
df['Season2'] = df['Season'].apply(lambda x: split_it(x))
but the second one is just a longer and slower way to write the first one, so there’s not much point (unless you have other arguments to handle, which we don’t here.) Your function will return a list, though:
>>> df["Season"].apply(split_it)
74 [1982]
84 [1982]
176 [1982]
177 [1983]
243 [1982]
Name: Season, dtype: object
although you could easily change that. FWIW, I’d use vectorized string operations and do something like
>>> df["Season"].str[:4].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
or
>>> df["Season"].str.split("-").str[0].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
The asked problem can be solved by writing the following code :
import re
def split_it(year):
x = re.findall('([d]{4})', year)
if x :
return(x.group())
df['Season2'] = df['Season'].apply(split_it)
You were facing this problem as some rows didn’t had year in the string
I had the exact same issue. Thanks for the answers @DSM.
FYI @itjcms, you can improve the function by removing the repetition of the 'dddd'
.
def split_it(year):
return re.findall('(dddd)', year)
Becomes:
def split_it(year):
return re.findall('(d{4})', year)
You can simply use str.extract
df['Season2']=df['Season'].str.extract(r'(d{4})-d{2}')
Here you locate d{4}-d{2}
(for example 1982-83) but only extracts the captured group between parenthesis d{4}
(for example 1982)
you can use pandas native function to do it too.
check this page for the pandas functions that accepts regular expression. for your case, you can do
df["Season"].str.extract(r'([d]{4}))')
I’d extract with:
df['Season2']=df['Season'].str.extract(r'(d{4}))