Calculate Year wise age in Pandas
Question:
Let’s say I have an Employees Table and yearly survey filled by each person.
I have to transform transactional data into prediction data year wise.
Available Data:
E_ID
TestYear
DateOfBirth
1
2010
1947-01-01
1
2011
1947-01-01
1
2012
1947-01-01
2
2010
1990-01-01
3
2011
1999-01-01
4
2011
1991-01-01
4
2012
1991-01-01
5
2010
1989-01-01
5
2011
1989-01-01
5
2012
1989-01-01
5
2013
1989-01-01
DataFrame I need:
E_ID
Year
Age
1
2010
63
1
2011
64
1
2012
65
2
2010
20
2
2011
21
2
2012
22
3
2010
11
3
2011
12
3
2012
13
4
2010
19
4
2011
20
4
2012
21
5
2010
21
5
2011
22
5
2012
23
In the new df I need all employees, for all 3 years 2010, 2011, 2022 and their relevant ages in the year 2010, 2011, 2022 respectively.
How to achieve this? Since in the transactional data, I have records for some employees for some years and not for other years.
Answers:
You can produce a Series of the birth years based on a substring of the DateOfBirth col. You can then use subtraction between that series and the TestYear series to get the age — both Series originate from the same DataFrame, so they have the same size and order.
dob_years = df['DateOfBirth'].str[:4].astype(int)
df['Age'] = df['TestYear'] - dob_years
Use DataFrame.drop_duplicates
for DataFrame by first rows by E_ID
, then repeat rows by Index.repeat
and assign repeated lists by numpy.tile
, last subtract years:
y = [2010, 2011, 2012]
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])
df1 = df.drop_duplicates('E_ID')
df1 = df1.loc[df1.index.repeat(len(y))].assign(TestYear = np.tile(y, len(df1)))
df1['Age'] = df1['TestYear'].sub(df1['DateOfBirth'].dt.year)
print (df1)
E_ID TestYear DateOfBirth Age
0 1 2010 1947-01-01 63
0 1 2011 1947-01-01 64
0 1 2012 1947-01-01 65
3 2 2010 1990-01-01 20
3 2 2011 1990-01-01 21
3 2 2012 1990-01-01 22
4 3 2010 1999-01-01 11
4 3 2011 1999-01-01 12
4 3 2012 1999-01-01 13
5 4 2010 1991-01-01 19
5 4 2011 1991-01-01 20
5 4 2012 1991-01-01 21
7 5 2010 1989-01-01 21
7 5 2011 1989-01-01 22
7 5 2012 1989-01-01 23
If performance is important avoid groupby
:
#[22000 rows x 3 columns] - groups sizes like in sample data
df = pd.concat([df.assign(E_ID = df['E_ID'].astype(str) + '_' + str(i)) for i in range(2000)], ignore_index=True)
print (df)
E_ID TestYear DateOfBirth
0 1_0 2010 1947-01-01
1 1_0 2011 1947-01-01
2 1_0 2012 1947-01-01
3 2_0 2010 1990-01-01
4 3_0 2011 1999-01-01
... ... ...
21995 4_1999 2012 1991-01-01
21996 5_1999 2010 1989-01-01
21997 5_1999 2011 1989-01-01
21998 5_1999 2012 1989-01-01
21999 5_1999 2013 1989-01-01
In [213]: %%timeit
...: df1 = df.groupby('E_ID').agg({"TestYear": lambda x: [2010, 2011, 2012],
...: 'DateOfBirth': lambda x: list(x)[0]}).explode("TestYear")
...:
...: df1['DateOfBirth'] = pd.to_datetime(df1['DateOfBirth'])
...:
...: df1['Age'] = df1['TestYear'] - df1['DateOfBirth'].dt.year
...:
...:
374 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [214]: %%timeit
...: y = [2010, 2011, 2012]
...: df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])
...:
...: df1 = df.drop_duplicates('E_ID')
...: df1 = df1.loc[df1.index.repeat(len(y))].assign(TestYear = np.tile(y, len(df1)))
...:
...: df1['Age'] = df1['TestYear'].sub(df1['DateOfBirth'].dt.year)
...:
...:
21.7 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Since your employer Id E_ID
is unique and the date of birth DateOfBirth
is also unique you can groupby the employer id and get the date of birth.
For the aggregation functions in the TestYear
you include a list with the years you want to extract the age and and in DateOfBirth
you can aggregate with a list, since the values of the list are the same (identic date of birth) you get the first entry:
df = df.groupby('E_ID').agg({"TestYear": lambda x: [2010, 2011, 2012],
'DateOfBirth': lambda x: list(x)[0]}).explode("TestYear")
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])
df['Age'] = df['TestYear'] - df['DateOfBirth'].dt.year
output
TestYear DateOfBirth Age
E_ID
1 2010 1947-01-01 63
1 2011 1947-01-01 64
1 2012 1947-01-01 65
2 2010 1990-01-01 20
2 2011 1990-01-01 21
2 2012 1990-01-01 22
3 2010 1999-01-01 11
3 2011 1999-01-01 12
3 2012 1999-01-01 13
4 2010 1991-01-01 19
4 2011 1991-01-01 20
4 2012 1991-01-01 21
5 2010 1989-01-01 21
5 2011 1989-01-01 22
5 2012 1989-01-01 23
Let’s say I have an Employees Table and yearly survey filled by each person.
I have to transform transactional data into prediction data year wise.
Available Data:
E_ID | TestYear | DateOfBirth |
---|---|---|
1 | 2010 | 1947-01-01 |
1 | 2011 | 1947-01-01 |
1 | 2012 | 1947-01-01 |
2 | 2010 | 1990-01-01 |
3 | 2011 | 1999-01-01 |
4 | 2011 | 1991-01-01 |
4 | 2012 | 1991-01-01 |
5 | 2010 | 1989-01-01 |
5 | 2011 | 1989-01-01 |
5 | 2012 | 1989-01-01 |
5 | 2013 | 1989-01-01 |
DataFrame I need:
E_ID | Year | Age |
---|---|---|
1 | 2010 | 63 |
1 | 2011 | 64 |
1 | 2012 | 65 |
2 | 2010 | 20 |
2 | 2011 | 21 |
2 | 2012 | 22 |
3 | 2010 | 11 |
3 | 2011 | 12 |
3 | 2012 | 13 |
4 | 2010 | 19 |
4 | 2011 | 20 |
4 | 2012 | 21 |
5 | 2010 | 21 |
5 | 2011 | 22 |
5 | 2012 | 23 |
In the new df I need all employees, for all 3 years 2010, 2011, 2022 and their relevant ages in the year 2010, 2011, 2022 respectively.
How to achieve this? Since in the transactional data, I have records for some employees for some years and not for other years.
You can produce a Series of the birth years based on a substring of the DateOfBirth col. You can then use subtraction between that series and the TestYear series to get the age — both Series originate from the same DataFrame, so they have the same size and order.
dob_years = df['DateOfBirth'].str[:4].astype(int)
df['Age'] = df['TestYear'] - dob_years
Use DataFrame.drop_duplicates
for DataFrame by first rows by E_ID
, then repeat rows by Index.repeat
and assign repeated lists by numpy.tile
, last subtract years:
y = [2010, 2011, 2012]
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])
df1 = df.drop_duplicates('E_ID')
df1 = df1.loc[df1.index.repeat(len(y))].assign(TestYear = np.tile(y, len(df1)))
df1['Age'] = df1['TestYear'].sub(df1['DateOfBirth'].dt.year)
print (df1)
E_ID TestYear DateOfBirth Age
0 1 2010 1947-01-01 63
0 1 2011 1947-01-01 64
0 1 2012 1947-01-01 65
3 2 2010 1990-01-01 20
3 2 2011 1990-01-01 21
3 2 2012 1990-01-01 22
4 3 2010 1999-01-01 11
4 3 2011 1999-01-01 12
4 3 2012 1999-01-01 13
5 4 2010 1991-01-01 19
5 4 2011 1991-01-01 20
5 4 2012 1991-01-01 21
7 5 2010 1989-01-01 21
7 5 2011 1989-01-01 22
7 5 2012 1989-01-01 23
If performance is important avoid groupby
:
#[22000 rows x 3 columns] - groups sizes like in sample data
df = pd.concat([df.assign(E_ID = df['E_ID'].astype(str) + '_' + str(i)) for i in range(2000)], ignore_index=True)
print (df)
E_ID TestYear DateOfBirth
0 1_0 2010 1947-01-01
1 1_0 2011 1947-01-01
2 1_0 2012 1947-01-01
3 2_0 2010 1990-01-01
4 3_0 2011 1999-01-01
... ... ...
21995 4_1999 2012 1991-01-01
21996 5_1999 2010 1989-01-01
21997 5_1999 2011 1989-01-01
21998 5_1999 2012 1989-01-01
21999 5_1999 2013 1989-01-01
In [213]: %%timeit
...: df1 = df.groupby('E_ID').agg({"TestYear": lambda x: [2010, 2011, 2012],
...: 'DateOfBirth': lambda x: list(x)[0]}).explode("TestYear")
...:
...: df1['DateOfBirth'] = pd.to_datetime(df1['DateOfBirth'])
...:
...: df1['Age'] = df1['TestYear'] - df1['DateOfBirth'].dt.year
...:
...:
374 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [214]: %%timeit
...: y = [2010, 2011, 2012]
...: df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])
...:
...: df1 = df.drop_duplicates('E_ID')
...: df1 = df1.loc[df1.index.repeat(len(y))].assign(TestYear = np.tile(y, len(df1)))
...:
...: df1['Age'] = df1['TestYear'].sub(df1['DateOfBirth'].dt.year)
...:
...:
21.7 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Since your employer Id E_ID
is unique and the date of birth DateOfBirth
is also unique you can groupby the employer id and get the date of birth.
For the aggregation functions in the TestYear
you include a list with the years you want to extract the age and and in DateOfBirth
you can aggregate with a list, since the values of the list are the same (identic date of birth) you get the first entry:
df = df.groupby('E_ID').agg({"TestYear": lambda x: [2010, 2011, 2012],
'DateOfBirth': lambda x: list(x)[0]}).explode("TestYear")
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])
df['Age'] = df['TestYear'] - df['DateOfBirth'].dt.year
output
TestYear DateOfBirth Age
E_ID
1 2010 1947-01-01 63
1 2011 1947-01-01 64
1 2012 1947-01-01 65
2 2010 1990-01-01 20
2 2011 1990-01-01 21
2 2012 1990-01-01 22
3 2010 1999-01-01 11
3 2011 1999-01-01 12
3 2012 1999-01-01 13
4 2010 1991-01-01 19
4 2011 1991-01-01 20
4 2012 1991-01-01 21
5 2010 1989-01-01 21
5 2011 1989-01-01 22
5 2012 1989-01-01 23