Avoid df.iterrow to drop dataframe rows within certain conditions
Question:
I have a dataframe similar to this:
import pandas as pd
colA = ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
colB = [(21,1,2), (0,1,21), (2,1,21), (1,12,5), (21,1,0), (12,5,6), (18,7,14), (7,5,12), (14,7,18), (12,7,11), (11,7,12), (3,5,7)]
df = pd.DataFrame(list(zip(colA, colB)), columns = ['colA', 'colB'])
display(df)
output:
colA colB
0 a (21, 1, 2)
1 a (0, 1, 21)
2 a (2, 1, 21)
3 a (1, 12, 5)
4 b (21, 1, 0)
5 b (12, 5, 6)
6 b (18, 7, 14)
7 b (7, 5, 12)
8 c (14, 7, 18)
9 c (12, 7, 11)
10 c (11, 7, 12)
11 c (3, 5, 7)
I’d need to drop (or filter out) all the rows where, within the same value of colA
, a value of colB
in a row is equal to the reverse value of colB
in another row.
In the example provided:
-
within colA='a'
row 2 has colB=(2,1,21)
which is the reverse of row 0 colB=(21,1,2)
and thus should be dropped
-
colA='b'
row 4 has colB=(21,1,0)
which is the reverse of row 1 colB=(0,1,21)
but that’s colA='a'
so nothing to drop here
-
within colA='c'
row 10 has colB=(11,7,12)
which is the reverse of row 9 colB=(12,7,11)
and thus should be dropped
Final results would something like:
colA colB
0 a (21, 1, 2)
1 a (0, 1, 21)
2 a (1, 12, 5)
3 b (21, 1, 0)
4 b (12, 5, 6)
5 b (18, 7, 14)
6 b (7, 5, 12)
7 c (14, 7, 18)
8 c (12, 7, 11)
9 c (3, 5, 7)
Observations:
-
Preferable to drop row on a duplicated dataframe and keep the original
-
Very important: my real dataframe has shape (3millions, 11), so I am looking for an efficient way to do this, like .apply, lambda etc..I did this in the past with df.iterrows, it was already not the best way, my bad..now it’s completely unfeasible
Current df.iterrows solution:
unique_df = df.copy()
seen_a_b = set()
for i, row in df.iterrows():
val_a = row['colA']
val_b = row['colB']
a_b = (val_a, val_b)
a_revb = (val_a, val_b[::-1])
if a_b in seen_a_b:
unique_df.drop(i, inplace=True)
continue
seen_a_b.add(a_b)
seen_a_b.add(a_revb)
Answers:
Try this:
df.groupby(['colA',df['colB'].map(lambda x: frozenset((x[0],x[-1])))],as_index=False).first()
This solution creates a frozenset, or an immutable set that can be used as a groupby key. This, along with colA is used to get the first value of each group. We are only using the first and last value in colB, as the middle value is the same forward and backwards.
or
df.loc[df['colB'].map(lambda x: frozenset((x[0],x[-1]))).groupby(df['colA']).transform(lambda x: ~x.duplicated())]
Output:
colA colB
0 a (21, 1, 2)
1 a (0, 1, 21)
2 a (1, 12, 5)
3 b (21, 1, 0)
4 b (12, 5, 6)
5 b (18, 7, 14)
6 b (7, 5, 12)
7 c (14, 7, 18)
8 c (12, 7, 11)
9 c (3, 5, 7)
I have a dataframe similar to this:
import pandas as pd
colA = ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
colB = [(21,1,2), (0,1,21), (2,1,21), (1,12,5), (21,1,0), (12,5,6), (18,7,14), (7,5,12), (14,7,18), (12,7,11), (11,7,12), (3,5,7)]
df = pd.DataFrame(list(zip(colA, colB)), columns = ['colA', 'colB'])
display(df)
output:
colA colB
0 a (21, 1, 2)
1 a (0, 1, 21)
2 a (2, 1, 21)
3 a (1, 12, 5)
4 b (21, 1, 0)
5 b (12, 5, 6)
6 b (18, 7, 14)
7 b (7, 5, 12)
8 c (14, 7, 18)
9 c (12, 7, 11)
10 c (11, 7, 12)
11 c (3, 5, 7)
I’d need to drop (or filter out) all the rows where, within the same value of colA
, a value of colB
in a row is equal to the reverse value of colB
in another row.
In the example provided:
-
within
colA='a'
row 2 hascolB=(2,1,21)
which is the reverse of row 0colB=(21,1,2)
and thus should be dropped -
colA='b'
row 4 hascolB=(21,1,0)
which is the reverse of row 1colB=(0,1,21)
but that’scolA='a'
so nothing to drop here -
within
colA='c'
row 10 hascolB=(11,7,12)
which is the reverse of row 9colB=(12,7,11)
and thus should be dropped
Final results would something like:
colA colB
0 a (21, 1, 2)
1 a (0, 1, 21)
2 a (1, 12, 5)
3 b (21, 1, 0)
4 b (12, 5, 6)
5 b (18, 7, 14)
6 b (7, 5, 12)
7 c (14, 7, 18)
8 c (12, 7, 11)
9 c (3, 5, 7)
Observations:
-
Preferable to drop row on a duplicated dataframe and keep the original
-
Very important: my real dataframe has shape (3millions, 11), so I am looking for an efficient way to do this, like .apply, lambda etc..I did this in the past with df.iterrows, it was already not the best way, my bad..now it’s completely unfeasible
Current df.iterrows solution:
unique_df = df.copy()
seen_a_b = set()
for i, row in df.iterrows():
val_a = row['colA']
val_b = row['colB']
a_b = (val_a, val_b)
a_revb = (val_a, val_b[::-1])
if a_b in seen_a_b:
unique_df.drop(i, inplace=True)
continue
seen_a_b.add(a_b)
seen_a_b.add(a_revb)
Try this:
df.groupby(['colA',df['colB'].map(lambda x: frozenset((x[0],x[-1])))],as_index=False).first()
This solution creates a frozenset, or an immutable set that can be used as a groupby key. This, along with colA is used to get the first value of each group. We are only using the first and last value in colB, as the middle value is the same forward and backwards.
or
df.loc[df['colB'].map(lambda x: frozenset((x[0],x[-1]))).groupby(df['colA']).transform(lambda x: ~x.duplicated())]
Output:
colA colB
0 a (21, 1, 2)
1 a (0, 1, 21)
2 a (1, 12, 5)
3 b (21, 1, 0)
4 b (12, 5, 6)
5 b (18, 7, 14)
6 b (7, 5, 12)
7 c (14, 7, 18)
8 c (12, 7, 11)
9 c (3, 5, 7)