Using np.select(), with a single-valued custom function as input
Question:
I have the following example code:
data = ['05/21/2021','05/21/2022','05/21/2023']
df = pd.DataFrame(data, columns=['register_date'])
df['register_year'] = pd.to_datetime(df['register_date']).dt.year
df['study_year'] = 2022
Here, I have a dataframe that looks like this:
register_date
register_year
study_year
05/21/2021
2021
2022
05/21/2022
2022
2022
05/21/2023
2023
2022
The goal is to create another column named "duration_start" where it is 1
if register_year < study_year
, 0
if register_year > study_year
, and the proportion of the year between year end of register_year and the register date if register_year == study_year
.
I created the following conditions to be used with np.select()
register_year_lt_study_year = df['register_year'] < df['study_year']
register_year_gt_study_year = df['Register_year'] > df['Study_year']
And a function that calculates the proportion of the year from year end:
def proportion_to_year_end(date):
start = pd.to_datetime(date)
year_end = pd.to_datetime('12/31/' + str(start.year))
return (year_end - start).days/365
However, I’m not sure how I should fill in the ??
, since proportion_to_year_end() is a single-valued function taking in a string, but np.select() accepts a column of the same length.
df['duration_start'] = np.select([register_year_lt_study_year, register_year_gt_study_year], [1, 0], ??)
I thought about using the apply()
function, to perhaps generate another column and then drop it, but that would entail more logic to take care of the 0 and 1 first, then on top of that, apply proportion_to_year_end, then drop the temporary column.
Alternatively, I thought about changing proportion_to_year_end() to take in two columns, but I’m not sure how to write it without for-loops, which is something we should avoid.
I wonder if there are better ways to do these type of problems where there is an apparent dimension mismatch between a single-valued function and columns in dataframe?
Answers:
The documentation of np.select
indicates that the default must be a value, so I don’t think you can apply your function there.
One thing you could do is partially fill the array with 1s and 0s, give a specific value to the rest of the array (where register_year == study_year
) like nan or -1 and fill that section thanks to a mask. For example:
df['duration_start'] = np.select([register_year_lt_study_year, register_year_gt_study_year],
[1, 0],
np.nan)
mask = df['register_year'] == df['study_year']
df['duration_start'][mask] = np.vectorize(proportion_to_year_end)(df['study_year'][mask])
I had to use vectorize
here, but if you change proportion_to_year_end
to support an array of datetimes we can remove it.
I have the following example code:
data = ['05/21/2021','05/21/2022','05/21/2023']
df = pd.DataFrame(data, columns=['register_date'])
df['register_year'] = pd.to_datetime(df['register_date']).dt.year
df['study_year'] = 2022
Here, I have a dataframe that looks like this:
register_date | register_year | study_year |
---|---|---|
05/21/2021 | 2021 | 2022 |
05/21/2022 | 2022 | 2022 |
05/21/2023 | 2023 | 2022 |
The goal is to create another column named "duration_start" where it is 1
if register_year < study_year
, 0
if register_year > study_year
, and the proportion of the year between year end of register_year and the register date if register_year == study_year
.
I created the following conditions to be used with np.select()
register_year_lt_study_year = df['register_year'] < df['study_year']
register_year_gt_study_year = df['Register_year'] > df['Study_year']
And a function that calculates the proportion of the year from year end:
def proportion_to_year_end(date):
start = pd.to_datetime(date)
year_end = pd.to_datetime('12/31/' + str(start.year))
return (year_end - start).days/365
However, I’m not sure how I should fill in the ??
, since proportion_to_year_end() is a single-valued function taking in a string, but np.select() accepts a column of the same length.
df['duration_start'] = np.select([register_year_lt_study_year, register_year_gt_study_year], [1, 0], ??)
I thought about using the apply()
function, to perhaps generate another column and then drop it, but that would entail more logic to take care of the 0 and 1 first, then on top of that, apply proportion_to_year_end, then drop the temporary column.
Alternatively, I thought about changing proportion_to_year_end() to take in two columns, but I’m not sure how to write it without for-loops, which is something we should avoid.
I wonder if there are better ways to do these type of problems where there is an apparent dimension mismatch between a single-valued function and columns in dataframe?
The documentation of np.select
indicates that the default must be a value, so I don’t think you can apply your function there.
One thing you could do is partially fill the array with 1s and 0s, give a specific value to the rest of the array (where register_year == study_year
) like nan or -1 and fill that section thanks to a mask. For example:
df['duration_start'] = np.select([register_year_lt_study_year, register_year_gt_study_year],
[1, 0],
np.nan)
mask = df['register_year'] == df['study_year']
df['duration_start'][mask] = np.vectorize(proportion_to_year_end)(df['study_year'][mask])
I had to use vectorize
here, but if you change proportion_to_year_end
to support an array of datetimes we can remove it.