Extract with multiple Patterns
Question:
Having an issue that maybe some help me with. I am trying to extract two patterns from a string and place them in another column. It’s extracting the first string fine but I am missing some in getting the second one there. Here’s the string.
jobseries['New Column'] = jobseries['Occupation'].str.extract('(GS-d+)(|)(WG-d+)').fillna('')
The first string is (GS-d+)
and the second string is (WG-d+)
I’ve tried a ton of variations none have worked.
Answers:
You can use either
jobseries['New Column'] = jobseries['Occupation'].str.extract(r'(GS-d+|WG-d+)').fillna('')
or a shorter
jobseries['New Column'] = jobseries['Occupation'].str.extract(r'((?:GS|WG-d+)').fillna('')
The points are:
- There must be only one capturing group in the regex since you are using
Series.str.extract
and assignt he result to a single column (New Column
)
- The regex must match either one string or the other, but you can factor in the beginning of the pattern and simply use
((?:GS|WG-d+)
instead of (GS-d+|WG-d+)
, that means a capturing group that matches either GS
or WG
and then a hyphen and then one or more digits.
Having an issue that maybe some help me with. I am trying to extract two patterns from a string and place them in another column. It’s extracting the first string fine but I am missing some in getting the second one there. Here’s the string.
jobseries['New Column'] = jobseries['Occupation'].str.extract('(GS-d+)(|)(WG-d+)').fillna('')
The first string is (GS-d+)
and the second string is (WG-d+)
I’ve tried a ton of variations none have worked.
You can use either
jobseries['New Column'] = jobseries['Occupation'].str.extract(r'(GS-d+|WG-d+)').fillna('')
or a shorter
jobseries['New Column'] = jobseries['Occupation'].str.extract(r'((?:GS|WG-d+)').fillna('')
The points are:
- There must be only one capturing group in the regex since you are using
Series.str.extract
and assignt he result to a single column (New Column
) - The regex must match either one string or the other, but you can factor in the beginning of the pattern and simply use
((?:GS|WG-d+)
instead of(GS-d+|WG-d+)
, that means a capturing group that matches eitherGS
orWG
and then a hyphen and then one or more digits.