Using np.select(), with a single-valued custom function as input

Question:

I have the following example code:

data = ['05/21/2021','05/21/2022','05/21/2023']    
df = pd.DataFrame(data, columns=['register_date'])
df['register_year'] = pd.to_datetime(df['register_date']).dt.year
df['study_year'] = 2022

Here, I have a dataframe that looks like this:

register_date register_year study_year
05/21/2021 2021 2022
05/21/2022 2022 2022
05/21/2023 2023 2022

The goal is to create another column named "duration_start" where it is 1 if register_year < study_year, 0 if register_year > study_year, and the proportion of the year between year end of register_year and the register date if register_year == study_year.

I created the following conditions to be used with np.select()

register_year_lt_study_year = df['register_year'] < df['study_year']
register_year_gt_study_year = df['Register_year'] > df['Study_year']

And a function that calculates the proportion of the year from year end:

def proportion_to_year_end(date):
    start = pd.to_datetime(date)
    year_end = pd.to_datetime('12/31/' + str(start.year))
    return (year_end - start).days/365

However, I’m not sure how I should fill in the ??, since proportion_to_year_end() is a single-valued function taking in a string, but np.select() accepts a column of the same length.

df['duration_start'] = np.select([register_year_lt_study_year, register_year_gt_study_year], [1, 0], ??)

I thought about using the apply() function, to perhaps generate another column and then drop it, but that would entail more logic to take care of the 0 and 1 first, then on top of that, apply proportion_to_year_end, then drop the temporary column.

Alternatively, I thought about changing proportion_to_year_end() to take in two columns, but I’m not sure how to write it without for-loops, which is something we should avoid.

I wonder if there are better ways to do these type of problems where there is an apparent dimension mismatch between a single-valued function and columns in dataframe?

Asked By: kd8

||

Answers:

The documentation of np.select indicates that the default must be a value, so I don’t think you can apply your function there.

One thing you could do is partially fill the array with 1s and 0s, give a specific value to the rest of the array (where register_year == study_year) like nan or -1 and fill that section thanks to a mask. For example:

df['duration_start'] = np.select([register_year_lt_study_year, register_year_gt_study_year],
                                 [1, 0],
                                 np.nan)

mask = df['register_year'] == df['study_year']
df['duration_start'][mask] = np.vectorize(proportion_to_year_end)(df['study_year'][mask])

I had to use vectorize here, but if you change proportion_to_year_end to support an array of datetimes we can remove it.

Answered By: Guimoute
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.