Calculate Year wise age in Pandas

Question:

Let’s say I have an Employees Table and yearly survey filled by each person.
I have to transform transactional data into prediction data year wise.

Available Data:

E_ID TestYear DateOfBirth
1 2010 1947-01-01
1 2011 1947-01-01
1 2012 1947-01-01
2 2010 1990-01-01
3 2011 1999-01-01
4 2011 1991-01-01
4 2012 1991-01-01
5 2010 1989-01-01
5 2011 1989-01-01
5 2012 1989-01-01
5 2013 1989-01-01

DataFrame I need:

E_ID Year Age
1 2010 63
1 2011 64
1 2012 65
2 2010 20
2 2011 21
2 2012 22
3 2010 11
3 2011 12
3 2012 13
4 2010 19
4 2011 20
4 2012 21
5 2010 21
5 2011 22
5 2012 23

In the new df I need all employees, for all 3 years 2010, 2011, 2022 and their relevant ages in the year 2010, 2011, 2022 respectively.

How to achieve this? Since in the transactional data, I have records for some employees for some years and not for other years.

Asked By: TheSarfaraz

||

Answers:

You can produce a Series of the birth years based on a substring of the DateOfBirth col. You can then use subtraction between that series and the TestYear series to get the age — both Series originate from the same DataFrame, so they have the same size and order.

dob_years = df['DateOfBirth'].str[:4].astype(int)
df['Age'] = df['TestYear'] - dob_years
Answered By: timtim17

Use DataFrame.drop_duplicates for DataFrame by first rows by E_ID, then repeat rows by Index.repeat and assign repeated lists by numpy.tile, last subtract years:

y = [2010, 2011, 2012]
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])
    
df1 = df.drop_duplicates('E_ID')
df1 = df1.loc[df1.index.repeat(len(y))].assign(TestYear = np.tile(y, len(df1)))

df1['Age'] = df1['TestYear'].sub(df1['DateOfBirth'].dt.year)
print (df1)
   E_ID  TestYear DateOfBirth  Age
0     1      2010  1947-01-01   63
0     1      2011  1947-01-01   64
0     1      2012  1947-01-01   65
3     2      2010  1990-01-01   20
3     2      2011  1990-01-01   21
3     2      2012  1990-01-01   22
4     3      2010  1999-01-01   11
4     3      2011  1999-01-01   12
4     3      2012  1999-01-01   13
5     4      2010  1991-01-01   19
5     4      2011  1991-01-01   20
5     4      2012  1991-01-01   21
7     5      2010  1989-01-01   21
7     5      2011  1989-01-01   22
7     5      2012  1989-01-01   23

If performance is important avoid groupby:

#[22000 rows x 3 columns] - groups sizes like in sample data
df = pd.concat([df.assign(E_ID = df['E_ID'].astype(str) + '_' + str(i)) for i in range(2000)], ignore_index=True)
print (df)

         E_ID  TestYear DateOfBirth
0         1_0      2010  1947-01-01
1         1_0      2011  1947-01-01
2         1_0      2012  1947-01-01
3         2_0      2010  1990-01-01
4         3_0      2011  1999-01-01
      ...       ...         ...
21995  4_1999      2012  1991-01-01
21996  5_1999      2010  1989-01-01
21997  5_1999      2011  1989-01-01
21998  5_1999      2012  1989-01-01
21999  5_1999      2013  1989-01-01

In [213]: %%timeit
     ...: df1 = df.groupby('E_ID').agg({"TestYear": lambda x: [2010, 2011, 2012], 
     ...:                         'DateOfBirth': lambda x: list(x)[0]}).explode("TestYear")
     ...: 
     ...: df1['DateOfBirth'] = pd.to_datetime(df1['DateOfBirth'])
     ...: 
     ...: df1['Age'] = df1['TestYear'] - df1['DateOfBirth'].dt.year
     ...: 
     ...: 
374 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [214]: %%timeit
     ...: y = [2010, 2011, 2012]
     ...: df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])
     ...: 
     ...: df1 = df.drop_duplicates('E_ID')
     ...: df1 = df1.loc[df1.index.repeat(len(y))].assign(TestYear = np.tile(y, len(df1)))
     ...: 
     ...: df1['Age'] = df1['TestYear'].sub(df1['DateOfBirth'].dt.year)
     ...: 
     ...: 
21.7 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Answered By: jezrael

Since your employer Id E_ID is unique and the date of birth DateOfBirth is also unique you can groupby the employer id and get the date of birth.

For the aggregation functions in the TestYear you include a list with the years you want to extract the age and and in DateOfBirth you can aggregate with a list, since the values of the list are the same (identic date of birth) you get the first entry:

df = df.groupby('E_ID').agg({"TestYear": lambda x: [2010, 2011, 2012], 
                        'DateOfBirth': lambda x: list(x)[0]}).explode("TestYear")

df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])

df['Age'] = df['TestYear'] - df['DateOfBirth'].dt.year

output

    TestYear    DateOfBirth     Age
E_ID            
1   2010    1947-01-01  63
1   2011    1947-01-01  64
1   2012    1947-01-01  65
2   2010    1990-01-01  20
2   2011    1990-01-01  21
2   2012    1990-01-01  22
3   2010    1999-01-01  11
3   2011    1999-01-01  12
3   2012    1999-01-01  13
4   2010    1991-01-01  19
4   2011    1991-01-01  20
4   2012    1991-01-01  21
5   2010    1989-01-01  21
5   2011    1989-01-01  22
5   2012    1989-01-01  23
Answered By: Lucas M. Uriarte
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.