Pandas get postion of last value based on condition for each column (efficiently)

Question:

I want to get the information in which row the value 1 occurs last for each column of my dataframe. Given this last row index I want to calculate the "recency" of the occurence. Like so:

>> df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
>> df
   a  b  c  d
0  0  1  1  0
1  0  1  0  0
2  1  1  0  0
3  0  1  0  0
4  0  1  1  0

Desired result:

>> calculate_recency_vector(df)
[3,1,1,None]

The desired result shows for each column "how many rows ago" the value 1 appeared for the last time. Eg for the column a the value 1 appears last in the 3rd-last row, hence the recency of 3 in the result vector. Any ideas how to implement this?

Edit: to avoid confusion, I changed the desired output for the last column from 0 to None. This column has no recency because the value 1 does not occur at all.

Edit II: Thanks for the great answers! I have to calculate this recency vector approx. 150k times on dataframes shaped (42,250). A more efficient solution would be much appreciated.

Asked By: Viktor

||

Answers:

With this example dataframe, you can define a function as follow:

def calculate_recency_vector(df: pd.DataFrame, condition: int) -> list:
    recency_vector = []

    for col in df.columns:
        last = 0
        for i, y in enumerate(df[col].to_list()):
            if y == condition:
                last = i

        recency = len(df[col].to_list()) - last
        if recency == len(df[col].to_list()):
            recency = None

        recency_vector.append(recency)

    return recency_vector

Running the function, it will return this:

calculate_recency_vector(df, 1)
[3, 1, 1, None]
Answered By: Adrien Riaux

This

df = pandas.DataFrame({"a":[0,0,1,0,0]," b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})
df.apply(lambda x : ([df.shape[0] - i for i ,v in x.items() if v==1] or [None])[-1], axis=0)

produces the desired output as a pd.Series , with the only diffrence that the result is float and None is replaced by pandas Nan, u could then take the desired column

Answered By: ach2ashes

A loop-less solution which is faster & cleaner:

>> def calculate_recency_for_one_column(column: pd.Series) -> int:
>>     """Returns the inverse position of the last non-zero value in a pd.Series of >> numerics. If the last value is non-zero, returns 1. If all values are non-zero, >> returns 0."""
>>     non_zero_values_of_col = column[column.astype(bool)]
>>     if non_zero_values_of_col.empty:
>>         return 0
>>     return len(column) - non_zero_values_of_col.index[-1]

>> df = pd.DataFrame({"a":[0,0,1,0,0],"b":[1,1,1,1,1],"c":[1,0,0,0,1],"d":[0,0,0,0,0]})

>> df.apply(lambda column: calculate_recency_for_one_column(column),axis=0)

a    3
b    1
c    1
d    0
dtype: int64

Sidenote: Using pd.apply() is slow (explanation why). I am still looking for a faster solution to this problem. I posted this in a new SO question.

Answered By: Viktor
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.