Regular Expression to Pull Information from String in Python
Question:
What I am trying to do is take my current string and remove all data from it that doesn’t contain the actual software version. Here is the string I am currently working with:
print (CurrentVersion)
Delivers the output:
2018, \\some\directory\is\here, \\some\directory\is\here, 2019, \\here\is\another\directory, \\here\is\another\directory, 2021, \\here\is\another\path_2021, 2020, http://some.will/even/look/like/this, 2022r2, 2023
When what I really want is this for an output:
2018, 2019, 2020, 2021, 2022r2, 2023
What I have tried was to come up with a regular expression to remove the excess data. It looks like ‘[0-9, ]’ will pull out the numbers and commas getting me closer to my goal. So I came up with this code:
RegexVersion = re.compile(r'[0-9, ]')
CurrentVersion = RegexVersion.search(CurrentVersion)
print (CurrentVersion.group())
But this only prints out an output of "2". Based on a regex calculator it looked like it was going to be a little closer to my expected output. From there I was planning on using .replace to get rid of the extra commas and spaces, but I can’t seem to get that far.
So the question is, how do I go from the current output of "CurrentVersion" stripped down to only versions, preferably in numerical order?
Answers:
You might use a capture group:
(?:^|,s*)(d{4}w*)(?=,|$)
The pattern matches:
(?:^|,s*)
Match either the start of the string, or match a comma followed by optional whitespace chars
(d{4}w*)
Capture at least 4 digits followed by optional word characters
(?=,|$)
Assert either a comma or the end of the string to the right
See a regex demo
Example
import re
pattern = r"(?:^|,s*)(d{4}w*)(?=,|$)"
s = ("2018, \\\\some\\directory\\is\\here, \\\\some\\directory\\is\\here, 2019, \\\\here\\is\\another\\directory, \\\\here\\is\\another\\directory, 2021, \\\\here\\is\\another\\path_2021, 2020, http://s...content-available-to-author-only...e.will/even/look/like/this, 2022r2, 2023n")
print(re.findall(pattern, s))
Output
['2018', '2019', '2021', '2020', '2022r2', '2023']
Other options could be finding all the years that start with 20 and then optionally match r
followed by 1 of more digits:
(?:^|,s*)(20dd(?:rd+)?)(?=,|$)
Or matching 4 digits followed by all except a comma:
(?:^|,s*)(d{4}[^,]*)
Your first problem is that the regex [0-9, ]
will match any character that is a digit from 0 to 9, a comma, or a space. This will match each digit in a number individually, as well as commas and spaces which you don’t want. Additionally, it won’t match the r in your expected output of 2022r2
and will match the digits of 2021 in "\hereisanotherpath_2021"
I would instead recommend using (:? |^)(d+(?:rd)?)
. First, this checks to make sure that the year is preceded with either a space or the start of the string. Next is a capturing group, which matches a string which contains 1 or more digits (d+
), and optionally matches an extension ((?:rd)?
) containing the letter "r" and one more digit. If your input could contain more than one digit following the letter "r", you could instead replace this part with (?:rd+)?
.
Your second, bigger problem is that you use RegexVersion.search(CurrentVersion)
, which only returns the first match in the string.
I would instead recommend using RegexVersion.findall(CurrentVersion)
, which would return an array of all matches. You could then optionally join that array into one long comma-seperated string using
", ".join(CurrentVersion).
What I am trying to do is take my current string and remove all data from it that doesn’t contain the actual software version. Here is the string I am currently working with:
print (CurrentVersion)
Delivers the output:
2018, \\some\directory\is\here, \\some\directory\is\here, 2019, \\here\is\another\directory, \\here\is\another\directory, 2021, \\here\is\another\path_2021, 2020, http://some.will/even/look/like/this, 2022r2, 2023
When what I really want is this for an output:
2018, 2019, 2020, 2021, 2022r2, 2023
What I have tried was to come up with a regular expression to remove the excess data. It looks like ‘[0-9, ]’ will pull out the numbers and commas getting me closer to my goal. So I came up with this code:
RegexVersion = re.compile(r'[0-9, ]')
CurrentVersion = RegexVersion.search(CurrentVersion)
print (CurrentVersion.group())
But this only prints out an output of "2". Based on a regex calculator it looked like it was going to be a little closer to my expected output. From there I was planning on using .replace to get rid of the extra commas and spaces, but I can’t seem to get that far.
So the question is, how do I go from the current output of "CurrentVersion" stripped down to only versions, preferably in numerical order?
You might use a capture group:
(?:^|,s*)(d{4}w*)(?=,|$)
The pattern matches:
(?:^|,s*)
Match either the start of the string, or match a comma followed by optional whitespace chars(d{4}w*)
Capture at least 4 digits followed by optional word characters(?=,|$)
Assert either a comma or the end of the string to the right
See a regex demo
Example
import re
pattern = r"(?:^|,s*)(d{4}w*)(?=,|$)"
s = ("2018, \\\\some\\directory\\is\\here, \\\\some\\directory\\is\\here, 2019, \\\\here\\is\\another\\directory, \\\\here\\is\\another\\directory, 2021, \\\\here\\is\\another\\path_2021, 2020, http://s...content-available-to-author-only...e.will/even/look/like/this, 2022r2, 2023n")
print(re.findall(pattern, s))
Output
['2018', '2019', '2021', '2020', '2022r2', '2023']
Other options could be finding all the years that start with 20 and then optionally match r
followed by 1 of more digits:
(?:^|,s*)(20dd(?:rd+)?)(?=,|$)
Or matching 4 digits followed by all except a comma:
(?:^|,s*)(d{4}[^,]*)
Your first problem is that the regex [0-9, ]
will match any character that is a digit from 0 to 9, a comma, or a space. This will match each digit in a number individually, as well as commas and spaces which you don’t want. Additionally, it won’t match the r in your expected output of 2022r2
and will match the digits of 2021 in "\hereisanotherpath_2021"
I would instead recommend using (:? |^)(d+(?:rd)?)
. First, this checks to make sure that the year is preceded with either a space or the start of the string. Next is a capturing group, which matches a string which contains 1 or more digits (d+
), and optionally matches an extension ((?:rd)?
) containing the letter "r" and one more digit. If your input could contain more than one digit following the letter "r", you could instead replace this part with (?:rd+)?
.
Your second, bigger problem is that you use RegexVersion.search(CurrentVersion)
, which only returns the first match in the string.
I would instead recommend using RegexVersion.findall(CurrentVersion)
, which would return an array of all matches. You could then optionally join that array into one long comma-seperated string using
", ".join(CurrentVersion).