Pandas get postion of last value based on condition for each column (efficiently)
Question:
I want to get the information in which row the value 1
occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence. Like so:
>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
a b c d
0 0 1 1 0
1 0 1 0 0
2 1 1 0 0
3 0 1 0 0
4 0 1 1 0
Desired result:
>> calculate_recency_vector(df)
[3,1,1,None]
The desired result shows for each column "how many rows ago" the value 1
appeared for the last time. Eg for the column a
the value 1
appears last in the 3rd-last row, hence the recency of 3
in the result vector. Any ideas how to implement this?
Edit: to avoid confusion, I changed the desired output for the last column from 0
to None
. This column has no recency because the value 1
does not occur at all.
Edit II: Thanks for the great answers! I have to calculate this recency vector approx. 150k times on dataframes shaped (42,250). A more efficient solution would be much appreciated.
Answers:
With this example dataframe, you can define a function as follow:
def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
recency_vector = []
for col in df.columns:
last = 0
for i, y in enumerate(df[col].to_list()):
if y == condition:
last = i
recency = len(df[col].to_list()) - last
if recency == len(df[col].to_list()):
recency = None
recency_vector.append(recency)
return recency_vector
Running the function, it will return this:
calculate_recency_vector(df, 1)
[3, 1, 1, None]
This
df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)
produces the desired output as a pd.Series
, with the only diffrence that the result is float and None
is replaced by pandas Nan
, u could then take the desired column
A loop-less solution which is faster & cleaner:
>> def calculate_recency_for_one_column(column: pd.Series) -> int:
>> """Returns the inverse position of the last non-zero value in a pd.Series of >> numerics. If the last value is non-zero, returns 1. If all values are non-zero, >> returns 0."""
>> non_zero_values_of_col = column[column.astype(bool)]
>> if non_zero_values_of_col.empty:
>> return 0
>> return len(column) - non_zero_values_of_col.index[-1]
>> df = pd.DataFrame({"a":[0,0,1,0,0],"b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)
a 3
b 1
c 1
d 0
dtype: int64
Sidenote: Using pd.apply()
is slow (explanation why). I am still looking for a faster solution to this problem. I posted this in a new SO question.
I want to get the information in which row the value 1
occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence. Like so:
>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
a b c d
0 0 1 1 0
1 0 1 0 0
2 1 1 0 0
3 0 1 0 0
4 0 1 1 0
Desired result:
>> calculate_recency_vector(df)
[3,1,1,None]
The desired result shows for each column "how many rows ago" the value 1
appeared for the last time. Eg for the column a
the value 1
appears last in the 3rd-last row, hence the recency of 3
in the result vector. Any ideas how to implement this?
Edit: to avoid confusion, I changed the desired output for the last column from 0
to None
. This column has no recency because the value 1
does not occur at all.
Edit II: Thanks for the great answers! I have to calculate this recency vector approx. 150k times on dataframes shaped (42,250). A more efficient solution would be much appreciated.
With this example dataframe, you can define a function as follow:
def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
recency_vector = []
for col in df.columns:
last = 0
for i, y in enumerate(df[col].to_list()):
if y == condition:
last = i
recency = len(df[col].to_list()) - last
if recency == len(df[col].to_list()):
recency = None
recency_vector.append(recency)
return recency_vector
Running the function, it will return this:
calculate_recency_vector(df, 1)
[3, 1, 1, None]
This
df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)
produces the desired output as a pd.Series
, with the only diffrence that the result is float and None
is replaced by pandas Nan
, u could then take the desired column
A loop-less solution which is faster & cleaner:
>> def calculate_recency_for_one_column(column: pd.Series) -> int:
>> """Returns the inverse position of the last non-zero value in a pd.Series of >> numerics. If the last value is non-zero, returns 1. If all values are non-zero, >> returns 0."""
>> non_zero_values_of_col = column[column.astype(bool)]
>> if non_zero_values_of_col.empty:
>> return 0
>> return len(column) - non_zero_values_of_col.index[-1]
>> df = pd.DataFrame({"a":[0,0,1,0,0],"b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)
a 3
b 1
c 1
d 0
dtype: int64
Sidenote: Using pd.apply()
is slow (explanation why). I am still looking for a faster solution to this problem. I posted this in a new SO question.