Check if pandas row contains exact quantity of strings
Question:
I have a df1 32611 x 17:
0 1 2 3 4 5 ... 11 12 13 14 15 16
0 BSO PRV BSI TUR WSP ACP ... HLR HEX HEX None None None
1 BSO PRV BSI TUR WSP ACP ... HLF HLR HEX HEX HEX None
2 BSO PRV BSI HLF HLR TUR ... HEX RSO RSI HEX HEX HEX
3 BSO PRV BSI HLF HLR TUR ... RSO RSI HEX HEX HEX None
4 BSO PRV BSI HLF TUR WSP ... RSO RSI HLR HEX HEX HEX
... ... ... ... ... ... ... ... ... ... ... ... ...
32607 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
32608 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
32609 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
32610 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
32611 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
I have another df2 6 x 17:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 ACP HEX HEX HEX HEX TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 ACP HEX HEX HEX HEX HEX HEX TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 ACP HEX HEX HEX HEX HEX HEX TUR TUR NaN NaN NaN NaN NaN NaN NaN NaN
4 ACP HEX HEX HEX HEX TUR TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 ACP HEX HEX TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 ACP HEX HEX TUR TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I specifically care about df2’s value counts for each row. What I am trying to accomplish is:
Does Df1.loc[i]
contain df2.loc[j].value_counts()
.
So df2.loc[j].value_counts()
is:
HEX 4
ACP 1
TUR 1
Name: 1, dtype: int64
I want to iterate through each row of df1, and check it if it contains 4 HEX, 1 ACP, and 1 TUR, and if it does, assign it a number (in a separate list, this part doesn’t matter), if not pass
.
Answers:
Per the conversation in the comments, here is one way to compare on a row-by-row basis (not sure how performant this will be if operating on many records):
import pandas as pd
def contains_value_counts(row1: pd.Series, row2: pd.Series) -> bool:
"""Check if `row1` contains the value counts of `row2`."""
vc1 = row1.value_counts()
vc2 = row2.value_counts()
return vc1.filter(vc2.index).equals(vc2)
df1 = pd.DataFrame(...)
df2 = pd.DataFrame(...)
idx1 = 0
idx2 = 0
equal = compare_value_counts(df1.iloc[idx1], df2.iloc[idx2])
I have a df1 32611 x 17:
0 1 2 3 4 5 ... 11 12 13 14 15 16
0 BSO PRV BSI TUR WSP ACP ... HLR HEX HEX None None None
1 BSO PRV BSI TUR WSP ACP ... HLF HLR HEX HEX HEX None
2 BSO PRV BSI HLF HLR TUR ... HEX RSO RSI HEX HEX HEX
3 BSO PRV BSI HLF HLR TUR ... RSO RSI HEX HEX HEX None
4 BSO PRV BSI HLF TUR WSP ... RSO RSI HLR HEX HEX HEX
... ... ... ... ... ... ... ... ... ... ... ... ...
32607 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
32608 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
32609 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
32610 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
32611 BSO PRV BSI TUR WSP ACP ... HEX None None None None None
I have another df2 6 x 17:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 ACP HEX HEX HEX HEX TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 ACP HEX HEX HEX HEX HEX HEX TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 ACP HEX HEX HEX HEX HEX HEX TUR TUR NaN NaN NaN NaN NaN NaN NaN NaN
4 ACP HEX HEX HEX HEX TUR TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 ACP HEX HEX TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 ACP HEX HEX TUR TUR NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I specifically care about df2’s value counts for each row. What I am trying to accomplish is:
Does Df1.loc[i]
contain df2.loc[j].value_counts()
.
So df2.loc[j].value_counts()
is:
HEX 4
ACP 1
TUR 1
Name: 1, dtype: int64
I want to iterate through each row of df1, and check it if it contains 4 HEX, 1 ACP, and 1 TUR, and if it does, assign it a number (in a separate list, this part doesn’t matter), if not pass
.
Per the conversation in the comments, here is one way to compare on a row-by-row basis (not sure how performant this will be if operating on many records):
import pandas as pd
def contains_value_counts(row1: pd.Series, row2: pd.Series) -> bool:
"""Check if `row1` contains the value counts of `row2`."""
vc1 = row1.value_counts()
vc2 = row2.value_counts()
return vc1.filter(vc2.index).equals(vc2)
df1 = pd.DataFrame(...)
df2 = pd.DataFrame(...)
idx1 = 0
idx2 = 0
equal = compare_value_counts(df1.iloc[idx1], df2.iloc[idx2])