how to identify sequence order and cumsum the transactions?
Question:
I have the following dataframe:
df = pd.DataFrame({'id':[1,1,1,2,2,3,3,4,5,6,6,6,6,6,8,8,9,11,12,12],'letter':['A','A','Q','Q','Q','F','F','G','D','G','I','I','K','Q','E','S','S','I','I','F']})
My objective is to add another column tx that shows the followings: if it finds Q and there after an I – mark it as 1st transaction. Both Q and I must exists and must have the same comes as last_Q –> first_I.
so the end result should look like this:
Answers:
To easily find the Q-I
pattern, I’d use a regex like Q[^QI]*?I
, that will also provide the positions where it matches
Then use iloc
to set your counter
df = pd.DataFrame({'id': [1, 1, 1, 2, 2, 3, 3, 4, 5, 6, 6, 6, 6, 6, 8, 8, 9, 11, 12, 12],
'letter': ['A', 'A', 'Q', 'Q', 'Q', 'F', 'F', 'G', 'D', 'G', 'I', 'I',
'K', 'Q', 'E', 'S', 'S', 'I', 'I', 'F']})
df['tx'] = 0
count = 1
for x in re.finditer("Q[^QI]*?I", df['letter'].str.cat()):
df.iloc[x.start():x.end(), 2] = count
count += 1
I would use boolean arithmetic:
# map True/False to Q/I
m1 = df['letter'].map({'Q': True, 'I': False})
# ffill the values
m2 = m1.ffill().fillna(False)
# only keep last Qs
m3 = m1.shift(-1).ne(True)
m4 = m2&m3
df['tx'] = (m1&m3).cumsum().where(m4|m4.shift(), 0)
Output:
id letter tx
0 1 A 0
1 1 A 0
2 1 Q 0
3 2 Q 0
4 2 Q 1
5 3 F 1
6 3 F 1
7 4 G 1
8 5 D 1
9 6 G 1
10 6 I 1
11 6 I 0
12 6 K 0
13 6 Q 2
14 8 E 2
15 8 S 2
16 9 S 2
17 11 I 2
18 12 I 0
19 12 F 0
Intermediates:
id letter tx m1 m2 m3 m4
0 1 A 0 NaN False True False
1 1 A 0 NaN False False False
2 1 Q 0 True True False False
3 2 Q 0 True True False False
4 2 Q 1 True True True True
5 3 F 1 NaN True True True
6 3 F 1 NaN True True True
7 4 G 1 NaN True True True
8 5 D 1 NaN True True True
9 6 G 1 NaN True True True
10 6 I 1 False False True False
11 6 I 0 False False True False
12 6 K 0 NaN False False False
13 6 Q 2 True True True True
14 8 E 2 NaN True True True
15 8 S 2 NaN True True True
16 9 S 2 NaN True True True
17 11 I 2 False False True False
18 12 I 0 False False True False
19 12 F 0 NaN False True False
Here is another way:
q = df['letter'].eq('Q')
i = df['letter'].eq('I')
m1 = q.diff(-1).ne(0) & q
m2 = i.diff().ne(0) & i
m1.cumsum().where(m1.where(m1|m2.shift()).ffill().fillna(False),0)
Output:
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 0
12 0
13 2
14 2
15 2
16 2
17 2
18 0
19 0
I have the following dataframe:
df = pd.DataFrame({'id':[1,1,1,2,2,3,3,4,5,6,6,6,6,6,8,8,9,11,12,12],'letter':['A','A','Q','Q','Q','F','F','G','D','G','I','I','K','Q','E','S','S','I','I','F']})
My objective is to add another column tx that shows the followings: if it finds Q and there after an I – mark it as 1st transaction. Both Q and I must exists and must have the same comes as last_Q –> first_I.
so the end result should look like this:
To easily find the Q-I
pattern, I’d use a regex like Q[^QI]*?I
, that will also provide the positions where it matches
Then use iloc
to set your counter
df = pd.DataFrame({'id': [1, 1, 1, 2, 2, 3, 3, 4, 5, 6, 6, 6, 6, 6, 8, 8, 9, 11, 12, 12],
'letter': ['A', 'A', 'Q', 'Q', 'Q', 'F', 'F', 'G', 'D', 'G', 'I', 'I',
'K', 'Q', 'E', 'S', 'S', 'I', 'I', 'F']})
df['tx'] = 0
count = 1
for x in re.finditer("Q[^QI]*?I", df['letter'].str.cat()):
df.iloc[x.start():x.end(), 2] = count
count += 1
I would use boolean arithmetic:
# map True/False to Q/I
m1 = df['letter'].map({'Q': True, 'I': False})
# ffill the values
m2 = m1.ffill().fillna(False)
# only keep last Qs
m3 = m1.shift(-1).ne(True)
m4 = m2&m3
df['tx'] = (m1&m3).cumsum().where(m4|m4.shift(), 0)
Output:
id letter tx
0 1 A 0
1 1 A 0
2 1 Q 0
3 2 Q 0
4 2 Q 1
5 3 F 1
6 3 F 1
7 4 G 1
8 5 D 1
9 6 G 1
10 6 I 1
11 6 I 0
12 6 K 0
13 6 Q 2
14 8 E 2
15 8 S 2
16 9 S 2
17 11 I 2
18 12 I 0
19 12 F 0
Intermediates:
id letter tx m1 m2 m3 m4
0 1 A 0 NaN False True False
1 1 A 0 NaN False False False
2 1 Q 0 True True False False
3 2 Q 0 True True False False
4 2 Q 1 True True True True
5 3 F 1 NaN True True True
6 3 F 1 NaN True True True
7 4 G 1 NaN True True True
8 5 D 1 NaN True True True
9 6 G 1 NaN True True True
10 6 I 1 False False True False
11 6 I 0 False False True False
12 6 K 0 NaN False False False
13 6 Q 2 True True True True
14 8 E 2 NaN True True True
15 8 S 2 NaN True True True
16 9 S 2 NaN True True True
17 11 I 2 False False True False
18 12 I 0 False False True False
19 12 F 0 NaN False True False
Here is another way:
q = df['letter'].eq('Q')
i = df['letter'].eq('I')
m1 = q.diff(-1).ne(0) & q
m2 = i.diff().ne(0) & i
m1.cumsum().where(m1.where(m1|m2.shift()).ffill().fillna(False),0)
Output:
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 0
12 0
13 2
14 2
15 2
16 2
17 2
18 0
19 0