Using regex in contains() to select rows from a pandas data frame having some string value (Capital or small)
Question:
I want to extract rows from a pandas data frame based on the values of a column using regex in contains() method.
I am using the following code line to extract rows from a data frame if the ‘COMPTYPE’ column has any string value mentioned in contains() method
df = df[df['COMPTYPE'].astype(str).str.contains('MCCB|ACB|VCB|CONTACTOR', regex=True)]
It works fine however it’s not selecting those rows which have MccB or Vcb or Contactor or acb etc. values in the ‘COMPTYPE’ column.
How to use this command so it will take rows irrespective of the case of the string values.
Input:
BOARDIBNO
SUBCOMP_IBNO
COMPTYPE
1044444001
9044444001
ACB
1044444001
9044444002
Relay
1044444001
9044444003
Meters
1044444001
9044444004
MCCB/MPCB
1044444001
9044444005
vcb
1044444001
9044444006
MCCB/MPCB
1044444001
9044444007
acb
1044444001
9044444008
mccb
1044444001
9044444009
MCCB/MPCB
1044444001
9044444010
Power Contactor
1044444001
9044444011
Power Contactor
1044444001
9044444012
Control Contactor
1044444001
9044444013
VCB
Expected output is this,
BOARDIBNO
SUBCOMP_IBNO
COMPTYPE
1044444001
9044444001
ACB
1044444001
9044444004
MCCB/MPCB
1044444001
9044444005
vcb
1044444001
9044444006
MCCB/MPCB
1044444001
9044444007
acb
1044444001
9044444008
mccb
1044444001
9044444009
MCCB/MPCB
1044444001
9044444010
Power Contactor
1044444001
9044444011
Power Contactor
1044444001
9044444012
Control Contactor
1044444001
9044444013
VCB
However, I’m getting following output,
BOARDIBNO
SUBCOMP_IBNO
COMPTYPE
1044444001
9044444001
ACB
1044444001
9044444004
MCCB/MPCB
1044444001
9044444005
MCCB/MPCB
1044444001
9044444006
MCCB/MPCB
1044444001
9044444010
VCB
How to do it? Please help!
Answers:
Just use flags=re.IGNORECASE
as parameter of str.contains
or use case=False
as suggested by @JoanLara:
import re
out = (df[df['COMPTYPE'].astype(str)
.str.contains('MCCB|ACB|VCB|CONTACTOR', regex=True, flags=re.IGNORECASE)]
print(out)
# Output
BOARDIBNO SUBCOMP_IBNO COMPTYPE
0 1044444001 9044444001 ACB
3 1044444001 9044444004 MCCB/MPCB
4 1044444001 9044444005 vcb
5 1044444001 9044444006 MCCB/MPCB
6 1044444001 9044444007 acb
7 1044444001 9044444008 mccb
8 1044444001 9044444009 MCCB/MPCB
9 1044444001 9044444010 Power Contactor
10 1044444001 9044444011 Power Contactor
11 1044444001 9044444012 Control Contactor
12 1044444001 9044444013 VCB
Or upper case the column before:
>>> out = df[df['COMPTYPE'].astype(str).str.upper()
.str.contains('MCCB|ACB|VCB|CONTACTOR', regex=True)]
print(out)
# Output
BOARDIBNO SUBCOMP_IBNO COMPTYPE
0 1044444001 9044444001 ACB
3 1044444001 9044444004 MCCB/MPCB
4 1044444001 9044444005 vcb
5 1044444001 9044444006 MCCB/MPCB
6 1044444001 9044444007 acb
7 1044444001 9044444008 mccb
8 1044444001 9044444009 MCCB/MPCB
9 1044444001 9044444010 Power Contactor
10 1044444001 9044444011 Power Contactor
11 1044444001 9044444012 Control Contactor
12 1044444001 9044444013 VCB
I want to extract rows from a pandas data frame based on the values of a column using regex in contains() method.
I am using the following code line to extract rows from a data frame if the ‘COMPTYPE’ column has any string value mentioned in contains() method
df = df[df['COMPTYPE'].astype(str).str.contains('MCCB|ACB|VCB|CONTACTOR', regex=True)]
It works fine however it’s not selecting those rows which have MccB or Vcb or Contactor or acb etc. values in the ‘COMPTYPE’ column.
How to use this command so it will take rows irrespective of the case of the string values.
Input:
BOARDIBNO | SUBCOMP_IBNO | COMPTYPE |
---|---|---|
1044444001 | 9044444001 | ACB |
1044444001 | 9044444002 | Relay |
1044444001 | 9044444003 | Meters |
1044444001 | 9044444004 | MCCB/MPCB |
1044444001 | 9044444005 | vcb |
1044444001 | 9044444006 | MCCB/MPCB |
1044444001 | 9044444007 | acb |
1044444001 | 9044444008 | mccb |
1044444001 | 9044444009 | MCCB/MPCB |
1044444001 | 9044444010 | Power Contactor |
1044444001 | 9044444011 | Power Contactor |
1044444001 | 9044444012 | Control Contactor |
1044444001 | 9044444013 | VCB |
Expected output is this,
BOARDIBNO | SUBCOMP_IBNO | COMPTYPE |
---|---|---|
1044444001 | 9044444001 | ACB |
1044444001 | 9044444004 | MCCB/MPCB |
1044444001 | 9044444005 | vcb |
1044444001 | 9044444006 | MCCB/MPCB |
1044444001 | 9044444007 | acb |
1044444001 | 9044444008 | mccb |
1044444001 | 9044444009 | MCCB/MPCB |
1044444001 | 9044444010 | Power Contactor |
1044444001 | 9044444011 | Power Contactor |
1044444001 | 9044444012 | Control Contactor |
1044444001 | 9044444013 | VCB |
However, I’m getting following output,
BOARDIBNO | SUBCOMP_IBNO | COMPTYPE |
---|---|---|
1044444001 | 9044444001 | ACB |
1044444001 | 9044444004 | MCCB/MPCB |
1044444001 | 9044444005 | MCCB/MPCB |
1044444001 | 9044444006 | MCCB/MPCB |
1044444001 | 9044444010 | VCB |
How to do it? Please help!
Just use flags=re.IGNORECASE
as parameter of str.contains
or use case=False
as suggested by @JoanLara:
import re
out = (df[df['COMPTYPE'].astype(str)
.str.contains('MCCB|ACB|VCB|CONTACTOR', regex=True, flags=re.IGNORECASE)]
print(out)
# Output
BOARDIBNO SUBCOMP_IBNO COMPTYPE
0 1044444001 9044444001 ACB
3 1044444001 9044444004 MCCB/MPCB
4 1044444001 9044444005 vcb
5 1044444001 9044444006 MCCB/MPCB
6 1044444001 9044444007 acb
7 1044444001 9044444008 mccb
8 1044444001 9044444009 MCCB/MPCB
9 1044444001 9044444010 Power Contactor
10 1044444001 9044444011 Power Contactor
11 1044444001 9044444012 Control Contactor
12 1044444001 9044444013 VCB
Or upper case the column before:
>>> out = df[df['COMPTYPE'].astype(str).str.upper()
.str.contains('MCCB|ACB|VCB|CONTACTOR', regex=True)]
print(out)
# Output
BOARDIBNO SUBCOMP_IBNO COMPTYPE
0 1044444001 9044444001 ACB
3 1044444001 9044444004 MCCB/MPCB
4 1044444001 9044444005 vcb
5 1044444001 9044444006 MCCB/MPCB
6 1044444001 9044444007 acb
7 1044444001 9044444008 mccb
8 1044444001 9044444009 MCCB/MPCB
9 1044444001 9044444010 Power Contactor
10 1044444001 9044444011 Power Contactor
11 1044444001 9044444012 Control Contactor
12 1044444001 9044444013 VCB