Searching Dataframe Using Two Criteria
Question:
I have a dataframe called test_df (below) and I am trying to search for the ID based on two criteria: the Factor and the name.
Here’s my code:
test_id = test_df.loc[test_df['Name'].str.contains(test_name, case=False) & test_df['Factor'].str.contains(test_factor, case=False), 'ID'].item()
But I get the following error:
Traceback (most recent call last):
File "C:/Users/v3.py", line 508, in <module>
test_id = test_df.loc[test_df['Name'].str.contains(test_name, case=False) & test_df['Factor'].str.contains(test_factor, case=False), 'ID'].item()
File "C:UsersAppDataLocalProgramsPythonPython37libsite-packagespandascorebase.py", line 331, in item
raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar
I thought the issue was the Factor column being a float format but I have converted it to a string with the same result.
Can anyone see what the problem is?
FIELD1
ID
Factor
Status
Speed
Name
0
49513622
-4
ACTIVE
II
1
11193741
4
ACTIVE
AP
2
49513622
-3.75
ACTIVE
II
3
11193741
3.75
ACTIVE
AP
4
49513622
-3.5
ACTIVE
II
5
11193741
3.5
ACTIVE
AP
6
49513622
-3.25
ACTIVE
II
7
11193741
3.25
ACTIVE
AP
8
49513622
-3
ACTIVE
II
9
11193741
3
ACTIVE
AP
10
49513622
-2.75
ACTIVE
II
11
11193741
2.75
ACTIVE
AP
12
49513622
-2.5
ACTIVE
II
13
11193741
2.5
ACTIVE
AP
14
49513622
-2.25
ACTIVE
II
15
11193741
2.25
ACTIVE
AP
16
49513622
-2
ACTIVE
II
17
11193741
2
ACTIVE
AP
18
49513622
-1.75
ACTIVE
II
19
11193741
1.75
ACTIVE
AP
20
49513622
-1.5
ACTIVE
II
21
11193741
1.5
ACTIVE
AP
22
49513622
-1.25
ACTIVE
II
23
11193741
1.25
ACTIVE
AP
24
49513622
-1
ACTIVE
II
25
11193741
1
ACTIVE
AP
26
49513622
-0.75
ACTIVE
II
27
11193741
0.75
ACTIVE
1.02
AP
28
49513622
-0.5
ACTIVE
II
29
11193741
0.5
ACTIVE
AP
30
49513622
-0.25
ACTIVE
II
31
11193741
0.25
ACTIVE
AP
32
49513622
0
ACTIVE
2.68
II
33
11193741
0
ACTIVE
1.03
AP
34
49513622
0.25
ACTIVE
II
35
11193741
-0.25
ACTIVE
1.99
AP
36
49513622
0.5
ACTIVE
II
37
11193741
-0.5
ACTIVE
2.3
AP
38
49513622
0.75
ACTIVE
II
39
11193741
-0.75
ACTIVE
AP
40
49513622
1
ACTIVE
II
41
11193741
-1
ACTIVE
AP
42
49513622
1.25
ACTIVE
II
43
11193741
-1.25
ACTIVE
AP
44
49513622
1.5
ACTIVE
II
45
11193741
-1.5
ACTIVE
AP
46
49513622
1.75
ACTIVE
II
47
11193741
-1.75
ACTIVE
AP
48
49513622
2
ACTIVE
II
49
11193741
-2
ACTIVE
AP
50
49513622
2.25
ACTIVE
II
51
11193741
-2.25
ACTIVE
AP
52
49513622
2.5
ACTIVE
II
53
11193741
-2.5
ACTIVE
AP
54
49513622
2.75
ACTIVE
II
55
11193741
-2.75
ACTIVE
AP
56
49513622
3
ACTIVE
II
57
11193741
-3
ACTIVE
AP
58
49513622
3.25
ACTIVE
II
59
11193741
-3.25
ACTIVE
AP
60
49513622
3.5
ACTIVE
II
61
11193741
-3.5
ACTIVE
AP
62
49513622
3.75
ACTIVE
II
63
11193741
-3.75
ACTIVE
AP
64
49513622
4
ACTIVE
II
65
11193741
-4
ACTIVE
AP
Answers:
Problem is no match any value, so DataFrame.loc
return empty Series
.
Possible solution is use next
+ iter
for assign default value if no match:
mask = test_df['Name'].str.contains(test_name, case=False) &
test_df['Factor'].str.contains(test_factor, case=False)
#if no match assign 'no match'
test_id = next(iter(test_df.loc[mask, 'ID']), 'no match')
#if no match assign None
test_id = next(iter(test_df.loc[mask, 'ID']), None)
Or use if-else
with test if at least one value match:
test_id = test_df.loc[mask, 'ID'].item() if m.any() else 'no match'
Or:
if m.any():
test_id = test_df.loc[mask, 'ID'].item()
EDIT: For testing si possible create helper columns:
m1 = test_df['Name'].str.contains(test_name, case=False)
m2 = test_df['Factor'].str.contains(test_factor, case=False)
test_df = test_df.assign(name_mask = m1, factor_mask = m1, both = m1 & m2)
I have a dataframe called test_df (below) and I am trying to search for the ID based on two criteria: the Factor and the name.
Here’s my code:
test_id = test_df.loc[test_df['Name'].str.contains(test_name, case=False) & test_df['Factor'].str.contains(test_factor, case=False), 'ID'].item()
But I get the following error:
Traceback (most recent call last):
File "C:/Users/v3.py", line 508, in <module>
test_id = test_df.loc[test_df['Name'].str.contains(test_name, case=False) & test_df['Factor'].str.contains(test_factor, case=False), 'ID'].item()
File "C:UsersAppDataLocalProgramsPythonPython37libsite-packagespandascorebase.py", line 331, in item
raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar
I thought the issue was the Factor column being a float format but I have converted it to a string with the same result.
Can anyone see what the problem is?
FIELD1 | ID | Factor | Status | Speed | Name |
---|---|---|---|---|---|
0 | 49513622 | -4 | ACTIVE | II | |
1 | 11193741 | 4 | ACTIVE | AP | |
2 | 49513622 | -3.75 | ACTIVE | II | |
3 | 11193741 | 3.75 | ACTIVE | AP | |
4 | 49513622 | -3.5 | ACTIVE | II | |
5 | 11193741 | 3.5 | ACTIVE | AP | |
6 | 49513622 | -3.25 | ACTIVE | II | |
7 | 11193741 | 3.25 | ACTIVE | AP | |
8 | 49513622 | -3 | ACTIVE | II | |
9 | 11193741 | 3 | ACTIVE | AP | |
10 | 49513622 | -2.75 | ACTIVE | II | |
11 | 11193741 | 2.75 | ACTIVE | AP | |
12 | 49513622 | -2.5 | ACTIVE | II | |
13 | 11193741 | 2.5 | ACTIVE | AP | |
14 | 49513622 | -2.25 | ACTIVE | II | |
15 | 11193741 | 2.25 | ACTIVE | AP | |
16 | 49513622 | -2 | ACTIVE | II | |
17 | 11193741 | 2 | ACTIVE | AP | |
18 | 49513622 | -1.75 | ACTIVE | II | |
19 | 11193741 | 1.75 | ACTIVE | AP | |
20 | 49513622 | -1.5 | ACTIVE | II | |
21 | 11193741 | 1.5 | ACTIVE | AP | |
22 | 49513622 | -1.25 | ACTIVE | II | |
23 | 11193741 | 1.25 | ACTIVE | AP | |
24 | 49513622 | -1 | ACTIVE | II | |
25 | 11193741 | 1 | ACTIVE | AP | |
26 | 49513622 | -0.75 | ACTIVE | II | |
27 | 11193741 | 0.75 | ACTIVE | 1.02 | AP |
28 | 49513622 | -0.5 | ACTIVE | II | |
29 | 11193741 | 0.5 | ACTIVE | AP | |
30 | 49513622 | -0.25 | ACTIVE | II | |
31 | 11193741 | 0.25 | ACTIVE | AP | |
32 | 49513622 | 0 | ACTIVE | 2.68 | II |
33 | 11193741 | 0 | ACTIVE | 1.03 | AP |
34 | 49513622 | 0.25 | ACTIVE | II | |
35 | 11193741 | -0.25 | ACTIVE | 1.99 | AP |
36 | 49513622 | 0.5 | ACTIVE | II | |
37 | 11193741 | -0.5 | ACTIVE | 2.3 | AP |
38 | 49513622 | 0.75 | ACTIVE | II | |
39 | 11193741 | -0.75 | ACTIVE | AP | |
40 | 49513622 | 1 | ACTIVE | II | |
41 | 11193741 | -1 | ACTIVE | AP | |
42 | 49513622 | 1.25 | ACTIVE | II | |
43 | 11193741 | -1.25 | ACTIVE | AP | |
44 | 49513622 | 1.5 | ACTIVE | II | |
45 | 11193741 | -1.5 | ACTIVE | AP | |
46 | 49513622 | 1.75 | ACTIVE | II | |
47 | 11193741 | -1.75 | ACTIVE | AP | |
48 | 49513622 | 2 | ACTIVE | II | |
49 | 11193741 | -2 | ACTIVE | AP | |
50 | 49513622 | 2.25 | ACTIVE | II | |
51 | 11193741 | -2.25 | ACTIVE | AP | |
52 | 49513622 | 2.5 | ACTIVE | II | |
53 | 11193741 | -2.5 | ACTIVE | AP | |
54 | 49513622 | 2.75 | ACTIVE | II | |
55 | 11193741 | -2.75 | ACTIVE | AP | |
56 | 49513622 | 3 | ACTIVE | II | |
57 | 11193741 | -3 | ACTIVE | AP | |
58 | 49513622 | 3.25 | ACTIVE | II | |
59 | 11193741 | -3.25 | ACTIVE | AP | |
60 | 49513622 | 3.5 | ACTIVE | II | |
61 | 11193741 | -3.5 | ACTIVE | AP | |
62 | 49513622 | 3.75 | ACTIVE | II | |
63 | 11193741 | -3.75 | ACTIVE | AP | |
64 | 49513622 | 4 | ACTIVE | II | |
65 | 11193741 | -4 | ACTIVE | AP |
Problem is no match any value, so DataFrame.loc
return empty Series
.
Possible solution is use next
+ iter
for assign default value if no match:
mask = test_df['Name'].str.contains(test_name, case=False) &
test_df['Factor'].str.contains(test_factor, case=False)
#if no match assign 'no match'
test_id = next(iter(test_df.loc[mask, 'ID']), 'no match')
#if no match assign None
test_id = next(iter(test_df.loc[mask, 'ID']), None)
Or use if-else
with test if at least one value match:
test_id = test_df.loc[mask, 'ID'].item() if m.any() else 'no match'
Or:
if m.any():
test_id = test_df.loc[mask, 'ID'].item()
EDIT: For testing si possible create helper columns:
m1 = test_df['Name'].str.contains(test_name, case=False)
m2 = test_df['Factor'].str.contains(test_factor, case=False)
test_df = test_df.assign(name_mask = m1, factor_mask = m1, both = m1 & m2)