How to returns the number of records that have passed since EXPR in Python
Question:
I’m trying to replicate the IBM SPSS function @SINCE using Python and Pandas, but unfortunately, I’ve got stuck in a part of my process.
Is there a direct function that replicates IBM SPSS CLEM @SINCE using python?
Here is the link for more info: Link
IMB @SINCE function description
"This function returns the offset of the last record where this condition was true–that is, the number of records before this one in which the condition was true. If the condition has never been true, @SINCE returns @INDEX + 1." (IBM, 2020)
I’ve been trying to replicate this function from scratch, but I haven’t find the right way to do it:
How do I do this with pandas and Python?
Here the problem,
My data looks like this:
+------+----------+
| Type | Flag |
+------+----------+
| d | |
+------+----------+
| A | myStatus |
+------+----------+
| c | |
+------+----------+
| B | myStatus |
+------+----------+
| c | |
+------+----------+
| c | myStatus |
+------+----------+
| c | |
+------+----------+
| d | |
+------+----------+
| d | |
+------+----------+
| A | myStatus |
+------+----------+
In IBM SPSS I use this formula to get this data:
if Type = 'A' or Type = 'B' then @SINCE(Flag = 'myStatus') else -1 endif
And this is the output:
+------+----------+----------------+
| Type | Flag | Expected Count |
+------+----------+----------------+
| d | | -1 |
+------+----------+----------------+
| A | myStatus | 0 |
+------+----------+----------------+
| c | | -1 |
+------+----------+----------------+
| B | myStatus | 2 |
+------+----------+----------------+
| c | | -1 |
+------+----------+----------------+
| c | myStatus | -1 |
+------+----------+----------------+
| c | | -1 |
+------+----------+----------------+
| d | | -1 |
+------+----------+----------------+
| d | | -1 |
+------+----------+----------------+
| A | myStatus | 4 |
+------+----------+----------------+
Answers:
So, I found the way to fix this problem: here is the code:
df = pd.DataFrame({"Type":["d", "A", "c", "B", "c", "c", "c", "d", "d", "A"],
"Flag":[np.nan, "myStatus", np.nan, "myStatus", np.nan, "myStatus", np.nan, np.nan, np.nan, "myStatus"]})
Function to solve the problem:
def spssSince(df):
df_temp = df
df_temp = df[df.Flag=="myStatus"]
df_temp['last_ind'] = df_temp.index
df_temp['last_ind'] = df_temp.last_ind.shift(1)
df_temp['last_ind'] = df_temp['last_ind'].fillna(1)
df_temp["Expected Count"] = df_temp.index - df_temp.last_ind
df_temp.loc[~df_temp.Type.isin(["A", "B"]), "Expected Count"] = -1
DFreturn = pd.merge(left=df, right=df_temp.drop(['Type', 'Flag', 'last_ind'], axis=1), how="left", left_index=True, right_index=True)
DFreturn["Expected Count"] = DFreturn["Expected Count"].fillna(-1)
return DFreturn
Basically the function calculates the last SINCE value from a condition, calculating the actual index among the index that has the validation (using shift()).
I’m trying to replicate the IBM SPSS function @SINCE using Python and Pandas, but unfortunately, I’ve got stuck in a part of my process.
Is there a direct function that replicates IBM SPSS CLEM @SINCE using python?
Here is the link for more info: Link
IMB @SINCE function description
"This function returns the offset of the last record where this condition was true–that is, the number of records before this one in which the condition was true. If the condition has never been true, @SINCE returns @INDEX + 1." (IBM, 2020)
I’ve been trying to replicate this function from scratch, but I haven’t find the right way to do it:
How do I do this with pandas and Python?
Here the problem,
My data looks like this:
+------+----------+
| Type | Flag |
+------+----------+
| d | |
+------+----------+
| A | myStatus |
+------+----------+
| c | |
+------+----------+
| B | myStatus |
+------+----------+
| c | |
+------+----------+
| c | myStatus |
+------+----------+
| c | |
+------+----------+
| d | |
+------+----------+
| d | |
+------+----------+
| A | myStatus |
+------+----------+
In IBM SPSS I use this formula to get this data:
if Type = 'A' or Type = 'B' then @SINCE(Flag = 'myStatus') else -1 endif
And this is the output:
+------+----------+----------------+
| Type | Flag | Expected Count |
+------+----------+----------------+
| d | | -1 |
+------+----------+----------------+
| A | myStatus | 0 |
+------+----------+----------------+
| c | | -1 |
+------+----------+----------------+
| B | myStatus | 2 |
+------+----------+----------------+
| c | | -1 |
+------+----------+----------------+
| c | myStatus | -1 |
+------+----------+----------------+
| c | | -1 |
+------+----------+----------------+
| d | | -1 |
+------+----------+----------------+
| d | | -1 |
+------+----------+----------------+
| A | myStatus | 4 |
+------+----------+----------------+
So, I found the way to fix this problem: here is the code:
df = pd.DataFrame({"Type":["d", "A", "c", "B", "c", "c", "c", "d", "d", "A"],
"Flag":[np.nan, "myStatus", np.nan, "myStatus", np.nan, "myStatus", np.nan, np.nan, np.nan, "myStatus"]})
Function to solve the problem:
def spssSince(df):
df_temp = df
df_temp = df[df.Flag=="myStatus"]
df_temp['last_ind'] = df_temp.index
df_temp['last_ind'] = df_temp.last_ind.shift(1)
df_temp['last_ind'] = df_temp['last_ind'].fillna(1)
df_temp["Expected Count"] = df_temp.index - df_temp.last_ind
df_temp.loc[~df_temp.Type.isin(["A", "B"]), "Expected Count"] = -1
DFreturn = pd.merge(left=df, right=df_temp.drop(['Type', 'Flag', 'last_ind'], axis=1), how="left", left_index=True, right_index=True)
DFreturn["Expected Count"] = DFreturn["Expected Count"].fillna(-1)
return DFreturn
Basically the function calculates the last SINCE value from a condition, calculating the actual index among the index that has the validation (using shift()).