Series split column with condition

Question:

My pandas series contains year values. They’re not formatted consistently. For example,

df['year']

1994-1996
circa 1990
1995-1998
circa 2010

I’d like to grab the year from the string.

df['Year'] = df['Year'].astype(str)
df['Year'] = df['Year'].str[:4]

This doesn’t work for rows with circa.

I’d like handle the rows with circa and grab only the year if it exists.

df['Year'] 

1994
1990
1995
2010
Asked By: kms

||

Answers:

df['Year_Only']=df['Year'].str.extract(r'(d{4})')[:4]

You can use str.extract then convert as pd.Int16Dtype:

df['Year'] = df['Year'].str.extract(r'(d{4})', expand=False).astype(pd.Int16Dtype())
print(df)

# Output
   Year
0  1994
1  1990
2  1995
3  2010
Answered By: Corralien
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.