fixing date shape in pandas

Question:

dataset in question
Hello, I have been trying to standardize the date in the year column to get rid of the decimals and and the random format and keep only the years.

Is there an efficient way to do this in Pandas?

Asked By: Aroune

||

Answers:

Setup

import pandas as pd  # 1.5.1


so = pd.DataFrame({
    "Countries": [*["Canada"]*5, *["Brazil"]*5],
    "Year": [1990.0, 1991.0, 1992.0, 1993.0, 1994.0, 2020.0, 2021.0, 2021.0, "2011-21", 2021.0],
    "Value": 1  # placeholder
})

print(so)
  Countries     Year  Value
0    Canada   1990.0      1
1    Canada   1991.0      1
2    Canada   1992.0      1
3    Canada   1993.0      1
4    Canada   1994.0      1
5    Brazil   2020.0      1
6    Brazil   2021.0      1
7    Brazil   2021.0      1
8    Brazil  2011-21      1
9    Brazil   2021.0      1

Explanation

Inspecting the .dtype of so.Year we get object

print(so.Year.dtype)
object

I’m making an assumption that all years in so.Year will be 4-digit, so I convert to str and limit to the first four characters

so["NewYear"] = so.Year.astype(str).str[:4]

print(so)
  Countries     Year  Value NewYear
0    Canada   1990.0      1    1990
1    Canada   1991.0      1    1991
2    Canada   1992.0      1    1992
3    Canada   1993.0      1    1993
4    Canada   1994.0      1    1994
5    Brazil   2020.0      1    2020
6    Brazil   2021.0      1    2021
7    Brazil   2021.0      1    2021
8    Brazil  2011-21      1    2011
9    Brazil   2021.0      1    2021

Now you can either use the NewYear column as-is, or convert to some other dtype.

Answered By: Ian Thompson